Professional Documents
Culture Documents
DAFx17 Proceedings
DAFx17 Proceedings
DAFx17 Proceedings
5 – 9 September 2017
Edinburgh UK
Credits:
Proceedings edited and produced by
Alberto Torin, Brian Hamilton, Stefan Bilbao, Michael Newton
Cover designed by Nicky Regan
LATEX style by Paolo Annibale
DAFx17 logo by Michael Gill
Back cover photo by Michael Newton
www.dafx17.eca.ed.ac.uk
www.dafx.de
Foreword
September is finally upon us, and we the local organising committee for DAFx17 would like to welcome you to Edinburgh
for what is sure to be, as always, a mind- and ear-busting experience. We are especially happy to be able to host this 20th
edition of the International Conference on Digital Audio Effects, which provides an opportune moment to reflect on what has
changed since 1998, when DAFx was launched as a COST action, with its inaugural meeting in Barcelona.
Looking back at the proceedings from DAFx98, one may remark that, of the 35 accepted papers, there were none con-
cerned with virtual analog or physical modelling, and only a handful concerned with time-frequency audio analysis and spatial
audio. At DAFx17, approximately two thirds of the 68 accepted papers may be classified as such. DAFx has increasingly
become the home of the sophisticated use of mathematics for audio—and at this point, there’s no other conference quite like
it.
Beyond this change in the scope of DAFx, what is perhaps most striking is that the average age of a DAFx paper author
is, if anything, lower than it was in 1998. Why? Well, an unspoken rule of DAFx seems to be: we all love audio of course,
but—it had better be fun too. Rest assured that DAFx17 will be no exception. Hosting a conference in Edinburgh makes this
easy. We will be giving the conference participants lots of opportunity to see some of the best parts of this spectacular city,
and to venture further afield in Scotland. As at all times here, there is a 70% chance of rain.
We have been very lucky to be able to secure some superb keynote and tutorial presenters from the academic world and
from industry. Miller Puckette of UCSD, Julius Smith of CCRMA at Stanford, and Avery Wang from Shazam will deliver the
keynotes, and tutorials will be presented by Dave Berners of Universal Audio, Jean-Marc Jot of Magic Leap, Julian Parker
from Native Instruments, and Brian Hamilton from the Acoustics and Audio Group here at the University of Edinburgh. To
top it all off, we are also delighted to be able to showcase new multichannel work by Trevor Wishart—one of the legends of
electroacoustic music.
One sure sign of the success of a conference series is the level of industrial support through sponsorship. We are happy to
report that for DAFx17, thirteen sponsors have lent a hand; they are listed on the previous page and on the back cover of these
proceedings. A number of the sponsors are new in 2017, and, furthermore, many volunteered support before being asked!
We are of course extremely grateful for this support, not only because it defrays operating costs, and permits the reduction
of registration fees, but also because of the increasing importance of the industrial perspective in the research landscape—in
digital audio, the academic/industrial boundary is more porous than ever before, and DAFx is very proud to serve as a gateway
of sorts.
DAFx17 is hosted by the Acoustics and Audio Group at the University of Edinburgh. We are very grateful for the support
from previous organisers, and especially Udo Zölzer, Vesa Välimäki, Damian Murphy, Sascha Disch, Victor Lazzarini, Joe
Timoney, Sigurd Saue, and Pavel Rajmic, as well as the Journal of the Audio Engineering Society, which has agreed to invite
best papers for submission. We are also grateful for support offered by the Edinburgh College of Art at the University of
Edinburgh, and especially Nicky Regan, who took care of our graphic design work at very short notice, Jared de Bruin who
looked after our financial affairs, and more generally, the Reid School of Music, the host department.
DAFx of course will continue, and we look forward to seeing you all again in Aveiro, Portugal in 2018.
DAFx-iii
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Conference Committees
DAFx Board
Daniel Arfib (CNRS-LMA, Marseille, France)
Nicola Bernardini (Conservatorio di Musica “Cesare Pollini”, Padova, Italy)
Stefan Bilbao (Acoustics and Audio Group, University of Edinburgh, UK)
Francisco Javier Casajús (ETSIS Telecomunicación - Universidad Politécnica de Madrid, Spain)
Philippe Depalle (McGill University, Montréal, Canada)
Giovanni De Poli (CSC-DEI, University of Padova, Italy)
Myriam Desainte-Catherine (SCRIME, Université de Bordeaux, France)
Markus Erne (Scopein Research, Aarau, Switzerland)
Gianpaolo Evangelista (University of Music and Performing Arts, Vienna, Austria)
Simon Godsill (University of Cambridge, UK)
Pierre Hanna (Université de Bordeaux, France)
Robert Höldrich (IEM, University of Music and Performing Arts, Graz, Austria)
Jean-Marc Jot (Magic Leap, USA)
Victor Lazzarini (Maynooth University, Ireland)
Sylvain Marchand (L3i, University of La Rochelle, France)
Damian Murphy (University of York, UK)
Søren Nielsen (SoundFocus, Arhus, Denmark)
Markus Noisternig (IRCAM - CNRS - Sorbonne Universities / UPMC, Paris, France)
Luis Ortiz Berenguer (ETSIS Telecomunicación - Universidad Politécnica de Madrid, Spain)
Geoffroy Peeters (IRCAM - CNRS SMTS, France)
Rudolf Rabenstein (University Erlangen-Nuernberg, Erlangen, Germany)
Davide Rocchesso (University of Palermo, Italy)
Jøran Rudi (NoTAM, Oslo, Norway)
Mark Sandler (Queen Mary University of London, UK)
Augusto Sarti (DEI - Politecnico di Milano, Italy)
Lauri Savioja (Aalto University, Espoo, Finland)
Xavier Serra (Universitat Pompeu Fabra, Barcelona, Spain)
Julius O. Smith III (CCRMA, Stanford University, CA, USA)
Alois Sontacchi (IEM, University of Music and Performing Arts, Graz, Austria)
Todor Todoroff (ARTeM, Brussels, Belgium)
Jan Tro (Norwegian University of Science and Technology, Trondheim, Norway)
Vesa Välimäki (Aalto University, Espoo, Finland)
Udo Zölzer (Helmut-Schmidt University, Hamburg, Germany)
DAFx-iv
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFx-v
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Contents
Foreword iii
Conference Committees iv
Keynotes 1
History of Virtual Musical Instruments and Effects Based on Physical Modeling
Julius O. Smith III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Robust Indexing and Search in a Massive Corpus of Audio Recordings
Avery Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Time-domain Manipulation via STFTs
Miller Puckette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Tutorials 2
Efficient reverberation rendering for complex interactive audio scenes
Jean-Marc Jot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Room Acoustic Simulation: Overview and Recent Developments
Brian Hamilton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Modeling Circuits with Nonlinearities in Discrete Time
David Berners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
From Algorithm to Instrument
Julian D. Parker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Poster Session 1
Nonlinear Homogeneous Order Separation for Volterra Series Identification
Damien Bouvier, Thomas Hélie and David Roze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Introducing Deep Machine Learning for Parameter Estimation in Physical Modelling
Leonardo Gabrielli, Stefano Tomassetti, Stefano Squartini and Carlo Zinato . . . . . . . . . . . . . . . . . . . . . . . 11
Real-time Physical Model of a Wurlitzer and Rhodes Electronic Piano
Florian Pfeifle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A Mechanical Mapping Model for Real-time Control of a Complex Physical Modelling Synthesis Engine with a
Simple Gesture
Fiona Keenan and Sandra Pauletto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Simulating the Friction Sounds Using a Friction-based Adhesion Theory Model
Takayuki Nakatsuka and Shigeo Morishima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Kinematics of Ideal String Vibration Against a Rigid Obstacle
Dmitri Kartofelev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Doppler Effect of a Planar Vibrating Piston: Strong Solution, Series Expansion and Simulation
Tristan Lebrun and Thomas Hélie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Latent Force Models for Sound: Learning Modal Synthesis Parameters and Excitation Functions from Audio
Recordings
William J. Wilkinson, Joshua D. Reiss and Dan Stowell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
DAFx-vi
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Poster Session 2
Efficient Anti-aliasing of a Complex Polygonal Oscillator
Christoph Hohnerlein, Maximilian Rest and Julian D. Parker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Modeling Circuits with Operational Transconductance Amplifiers Using Wave Digital Filters
Ólafur Bogason and Kurt J. Werner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Automatic Decomposition of Non-linear Equation Systems in Audio Effect Circuit Simulation
Martin Holters and Udo Zölzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
WDF Modeling of a Korg MS-50 Based Non-linear Diode Bridge VCF
Maximilian Rest, Julian D. Parker and Kurt J. Werner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Comparison of Germanium Bipolar Junction Transistor Models for Real-time Circuit Simulation
Ben Holmes, Martin Holters and Maarten van Walstijn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Automatic Control of the Dynamic Range Compressor Using a Regression Model and a Reference Sound
Di Sheng and György Fazekas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
DAFx-vii
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Poster Session 3
On the Design and Use of Once-differentiable High Dynamic Resolution Atoms for the Distribution Derivative
Method
Nicholas Esterer and Philippe Depalle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Nicht-negativeMatrixFaktorisierungnutzendesKlangsynthesenSystem (NiMFKS): Extensions of NMF-based
Concatenative Sound Synthesis
Michael Buch, Elio Quinton and Bob L. Sturm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Validated Exponential Analysis for Harmonic Sounds
Matteo Briani, Annie Cuyt and Wen-shin Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
A System Based on Sinusoidal Analysis for the Estimation and Compensation of Pitch Variations in Musical
Recordings
Luı́s F. V. Carvalho and Hugo T. Carvalho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Gradient Conversion Between Time and Frequency Domains Using Wirtinger Calculus
Hugo Caracalla and Axel Roebel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Live Convolution with Time-variant Impulse Response
Øyvind Brandtsegg and Sigurd Saue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Modal Audio Effects: A Carillon Case Study
Elliot K. Canfield-Dafilou and Kurt J. Werner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
LP-BLIT: Bandlimited Impulse Train Synthesis of Lowpass-filtered Waveforms
Sebastian Kraft and Udo Zölzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Poster Session 4
EVERTims: Open Source Framework for Real-time Auralization in Architectural Acoustics and Virtual Reality
David Poirier-Quinot, Brian F.G. Katz and Markus Noisternig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
A Comparison of Player Performance in a Gamified Localisation Task Between Spatial Loudspeaker Systems
Joe Rees-Jones and Damian Murphy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
DAFx-viii
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Poster Session 5
Estimating Pickup and Plucking Positions of Guitar Tones and Chords with Audio Effects
Zulfadhli Mohamad, Simon Dixon and Christopher Harte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Unsupervised Taxonomy of Sound Effects
David Moffat, David Ronan and Joshua D. Reiss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
The Mix Evaluation Dataset
Brecht De Man and Joshua D. Reiss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
The Snail: A Real-time Software Application to Visualize Sounds
Thomas Hélie and Charles Picasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Beat-aligning Guitar Looper
Daniel Rudrich and Alois Sontacchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A Method for Automatic Whoosh Sound Description
Eugene Cherny, Johan Lilius and Dmitry Mouromtsev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Analysis and Synthesis of the Violin Playing Style of Heifetz and Oistrakh
Chi-Ching Shih, Pei-Ching Li, Yi-Ju Lin, Yu-Lin Wang, Alvin W. Y. Su, Li Su and Yi-Hsuan Yang . . . . . . . . . . . . . 466
DAFx-ix
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFx-x
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Keynotes
Julius O. Smith III
History of Virtual Musical Instruments and Effects Based on Physical Modeling
Abstract This presentation visits historical developments leading to today’s virtual musical instruments and effects based
on physical modeling principles. It is hard not to begin with Daniel Bernoulli and d’Alembert who launched the modal rep-
resentation (leading to both “additive” and “subtractive” synthesis) and the traveling-wave solution of the wave-equation for
vibrating-strings, respectively, in the 18th century. Newtonian mechanics generally suffices mathematically for characterizing
physical musical instruments and effects, although quantum mechanics is necessary for fully deriving the speed of sound in
air. In addition to the basic ballistics of Newton’s Law f = ma, and spring laws relating force to displacement, friction mod-
els are needed for modeling the aggregate behavior of vast numbers of colliding particles. The resulting mathematical models
generally consist of ordinary and partial differential equations expressing Newton’s Law, friction models, and perhaps other
physical relationships such as temperature dependence. Analog circuits are similarly described. These differential-equation
models are then solved in real time on a discrete time-space grid to implement musical instruments and effects. The external
forces applied by the performer (or control voltages, etc.) are routed to virtual masses, springs, and/or friction-models, and
they may impose moving boundary conditions for the discretized differential-equation solver. To achieve maximum quality
per unit of computation, techniques from digital signal processing are typically used to implement the differential-equation
solvers in ways that are numerically robust, energy aware, and minimizing computational complexity. In addition to reviewing
selected historical developments, this presentation will try to summarize some of the known best practices for computational
physical modeling in existing real-time virtual musical instruments and effects.
Avery Wang
Robust Indexing and Search in a Massive Corpus of Audio Recordings
Abstract In this talk I will give an overview of the Shazam audio recognition technology. The Shazam service takes a query
comprised of a short sample of ambient audio (as little as 2 seconds) from a microphone and searches a massive database of
recordings comprising more than 40 million soundtracks. The query may be degraded with significant additive noise (< 0 dB
SNR), environmental acoustics, as well as nonlinear distortions. The computational scaling is such that a query may cost as
little as a millisecond of processing time. Previous algorithms could index hundreds of items, required seconds of processing
time, and were less tolerant to noise and distortion by 20-30 dB SNR. In aggregate, the Shazam algorithm represents a leap
of more than a factor of 1E+10 in efficiency over prior art. I will discuss the various innovations leading to this result.
Miller Puckette
Time-domain Manipulation via STFTs
Abstract Perhaps the most important shortcoming of frequency-domain signal processing results from the Heisenberg limit
that often forces tradeoffs between time and frequency resolution. In this paper we propose manipulating sounds by altering
their STFTs in ways that affect time spans smaller than the analysis window length. An example of a situation in which this
could be useful is the algorithm of Griffin and Lim, which generates a time-domain signal that optimally matches a (possibly
overspecified) short-time amplitude spectrum. We propose an adaptation of Griffin-Lim to simultaneously optimize a signal
to match such amplitude spectra on two or more different time scales, in order to simultaneously manage both transients and
tuning.
DAFx-1
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Tutorials
Jean-Marc Jot
Efficient reverberation rendering for complex interactive audio scenes
Abstract Artificial reverberation algorithms originated several decades ago with Schroeder’s pioneering work in the late
60s, and have been widely employed commercially since the 80s in music and soundtrack production. In the late 90s, artificial
reverberation was introduced in game 3D audio engines, which today are evolving into interactive binaural audio rendering
systems for virtual reality. By exploiting the perceptual and statistical properties of diffuse reverberation decays in closed
rooms, computationally efficient reverberators based on feedback delay networks can be designed to automatically match
with verisimilitude the “reverberation fingerprint” of any room. They can be efficiently implemented on standard mobile
processors to simulate complex natural sound scenes, shared among a multiplicity of virtual sound sources having different
positions and directivity, and combined to simulate complex acoustical spaces. In this tutorial presentation, we review the
fundamental assumptions and design principles of artificial reverberation, apply them to design parametric reverberators,
and extend them to realize computationally efficient interactive audio engines suitable for untethered virtual and augmented
reality applications.
Brian Hamilton
Room Acoustic Simulation: Overview and Recent Developments
Abstract Simulation of room acoustics has applications in architecture, audio engineering, and video games. It is also
gaining importance in virtual reality applications, where realistic 3D sound rendering plays an integral part in creating a
sense of immersion within a virtual space. This tutorial will give an overview of room acoustic simulation methods, ranging
from traditional approaches based on principles of geometrical and statistical acoustics, to numerical methods that solve the
wave equation in three spatial dimensions, including recent developments of finite difference time domain (FDTD) meth-
ods resulting from the recently completed five-year NESS project (www.ness-music.eu). Computational costs and practical
considerations will be discussed, along with the benefits and limitations of each framework. Simulation techniques will be
illustrated through animations and sound examples.
David Berners
Modeling Circuits with Nonlinearities in Discrete Time
Abstract Modeling techniques for circuits with nonlinear components will be discussed. Nodal analysis and K-method
models will be developed and compared in the context of delay-free loop resolution. Standard discretization techniques
will be reviewed, including forward- and backward-difference and bilinear transforms. Interaction between nonlinearities
and choice of discretization method will be discussed. Piecewise functional models for memoryless nonlinearities will be
developed, as well as iterative methods for solving equations with no analytic solution. Emphasis will be placed on Newton’s
method and related techniques. Convergence and seeding for iterative methods will be reviewed. Relative computational
expense will be discussed for piecewise vs. iterative approaches. Time permitting, modeling of time-varying systems will be
discussed.
Julian D. Parker
From Algorithm to Instrument
Abstract The discipline of designing algorithms for creative processing of musical audio is now fairly mature in academia,
as evidenced by the continuing popularity of the DAFx conference. However, this large corpus of work is motivated primarily
by the traditional concerns of the academic signal-processing community - that being technical novelty or improvement in
quantifiable metrics related to signal quality or computational performance. Whilst these factors are extremely important,
they are only a small part of the process of designing an inspiring and engaging tool for the creative generation or processing
of sound. Algorithms for this use must be designed with as much thought given to subjective qualities like aesthetics and
usability as to technical considerations. In this tutorial I present my own experiences of trying to bridge this gap, and the
design principles I’ve arrived at in the process. These principles will be illustrated both with abstract examples and with case
studies from the work I’ve done at Native Instruments.
DAFx-2
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-3
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.2. Volterra operator properties But in practice, this approach is still very sensitive to the choice of
amplitude factors αk . Indeed, for small amplitudes, higher orders
p
In the following definition, L (E, F) denotes the standard Lebesgue- will be hidden in measurement noise, while high values of αk will
space of functions from vector spaces E to F with p-norm. potentially overload the system, or simply lead it out of its Volterra
Definition 2 (Volterra operator of order n). Let hn ∈ L1 (Rn convergence radius.
+ , R)
be a Volterra kernel, for n ∈ N∗ . Then we introduce the multi- Despite those disadvantages, this separation method has been
linear operator Vn : L∞ (R, R) × . . . × L∞ (R, R) 7→ L∞ (R, R) used intensively for simplifying identification process [13]; it is
such that function Vn [u1 , . . . , un ] is defined ∀t ∈ R by generally used in frequency domain (the previous equations and
remarks remains valid) and jointly with frequency probing meth-
Z n
Y ods [10,11]; recently, a maximum order truncation estimation method
Vn [u1 , . . . , un ](t) = hn (τ1 , . . . , τn ) ui (t − τi )dτi (4) has been constructed from it [14]. In the following, this method
Rn
+ i=1 will be referred to as the Amplitude Separation (AS) method.
If u1 = . . . = un = u, Vn [u, . . . , u] = yn .
3. PHASE-BASED HOMOGENEOUS SEPARATION
Property 1 (Symmetry). Given a symmetric kernel hn , the corre- METHOD
sponding Volterra operator Vn is also symmetric, meaning
The starting point of the proposed separation method are the fol-
Vn uπ(1) , . . . , uπ(n) (t) = Vn [u1 , . . . , un ](t) (5) lowing remarks:
for any permutations π and ∀t ∈ R. • using the AS method with factor 1 and −1, it is possible to
separate odd and even orders by inverting a mixing matrix A
In the following, symmetry of hn and Vn will be supposed. with optimum condition number;
• multiplying a signal by amplitude factor −1 is equivalent to tak-
Property 2 (Multilinearity and homogeneity). Volterra operator ing its opposite phase;
Vn is multilinear, i.e. for any signals u1 , u2 , and any scalars λ, µ, • thus, for a system truncated to order N = 2, there exists a robust
separation method that relies only on phase deconstruction and
Vn [λu1 + µu2 , . . . , λu1 + µu2 ](t) = reconstruction between tests signals.
!
Xn The main idea of this paper is to generalize the use of phase for ro-
n n−q q
λ µ Vn u1 , . . . , u1 u2 , . . . , u2 (t) (6) bust separation method to Volterra systems with truncation N > 2.
q | {z } | {z }
q=0
n−q q
3.1. Method for complex-valued input
This also implies that Vn is a homogeneous operator of degree n,
i.e. for any signal u and scalar α, This section proposes a theoretical separation method relying on
the use of complex signals u(t) ∈ C as system inputs.
Vn [αu, . . . , αu](t) = αn Vn u, . . . , u (t) (7)
3.1.1. Principle
2.3. State-of-the-art order separation
Using complex signals, factors αk in the AS method are not lim-
Nonlinear homogeneous order separation implies the ability to re- ited to real scalar. So it would be possible to choose values which
cover signals yn from the output y of a system described by a only differs by their phase (e.g. are on the unit circle instead of
Volterra series truncated to Porder N . the real axis). Noticing that 1 and −1 are the two square root of
From (7), V [αu](t) = N n
n=1 α yn (t). Consider a collection unity, a natural extension of the toy-method proposed for order-2
of input signals uk (t) = αk u(t), with αk ∈ R∗ , k = 1, . . . , N systems would be to take the N -th roots of unity as factors αk .
and note zk (t) = V [uk ](t) their corresponding output through the Choosing αk = wN k 2π
with wN = ej N , (8) becomes, for all time t:
system; then, for all time t:
2 N
z1 wN wN . . . wN y1
z1 α1 α12 . . . α1N y1 w2 w4 . . . w2N y
α α2 . . . αN y z2 N N N 2
z2 j 2π
2 2 2 2 .. (t) = .. .. .. .. ·
..
(t) , wN = e N
.. (t) = .. .. . . ..
·
..
(t) , αk ∈ R
∗ . . . . . .
. . . . . .
zN N 2N N2
2 N wN wN . . . wN yN
zN αN αN . . . αN yN
Z(t) = WN · Y (t)
Z(t) = A · Y (t) (9)
(8) where WN is the Discrete Fourier Transform (DFT) matrix of or-
Since A is a Vandermonde matrix, it is invertible if and only if all der N (after a column and row permutation1 ). It is important to
αk are different; hence it is possible to recover terms yn . note that here the DFT does not apply on time but on the homoge-
But for real-valued αk , this type of matrix is known for be- neous nonlinear orders.
coming rapidly ill-conditioned when its size grows (meaning small Since the DFT is invertible, order separation is possible. Fur-
noise in the measured outputs would become a large error in the thermore, the DFT matrix is well-conditioned2 , and the solution
estimation); so robustness decreases rapidly when the truncation
order N increases. In order to circumvent that, it is possible to 1 It suffices to consider vectors Z(t)
b = [zN , z1 , . . . , zN −1 ]T and
solve the linear problem Z = AY by using a Newton Recursive b T
Y = [yN , y1 , . . . , yN −1 ] to recover the usual DFT matrix.
or Lagrange Recursive method [20, Algorithm 4.6.1 and 4.6.2]. 2 Its condition number is even optimum, since it is 1 for any order N .
DAFX-2
DAFx-4
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
can be computed using the Fast Fourier Transform algorithm3 . In for kernel hn ), the order n contribution is:
the following, this method will be referred to as the Phase Separa-
tion (PS) method. yn (t) = Vn [v, . . . , v](t)
= Vn w u + w u, . . . , w u + w u (t)
!
3.1.2. Nonlinear order aliasing and rejection factor Xn
n
= wn−q wq Vn u, . . . , u , u, . . . , u (t)
Given the N -periodicity of N -th root of unity wn , the ouptut of a q | {z } | {z }
q=0
(n−q) times q times
Volterra system with no order truncation is: !
n
X n
N ∞
= wn−2q Mn,q (t) (14)
X n
X q
q=0
V [wN u](t) = wN yn+rN (t) (10)
n=1 r=0
with Mn,q (t) = Vn u, . . . , u, u, . . . , u (t) ∈ C. This term rep-
| {z } | {z }
e of nonlinear orders 1 to
By applying the PS method, estimation Y n−q q
Equation P(11) reveals that estimation yen is perturbed by a resid- Remark 6. By their definition, terms Mn,0 (t) (respectively Mn,n (t))
ual term r=1 yn+rN , which is structured as an aliasing with are homogeneous contribution of order n of the system excited by
respect to the nonlinear order: we4 call this effect the nonlinear the complex signal u(t) (resp. u(t)).
order aliasing.
For a band-limited input signal, presence of term yn+k in the Equation (14) shows that, in the output term yn , there is more
estimated signal yen means presence of higher-frequency compo- than one characterising phase factor w. So PS method is not di-
nents than expected; therefore, this artefact can help detect a wrong rectly exploitable to separate terms yn .
truncation order N . But moreover, (11) permits to create higher-
order rejection by using amplitude as a contrast factor. Taking 3.3. Method for real-valued input
k
αk = ρwN , where ρ is a positive real number less than 1, estima-
e
tion Y using PS method becomes Difficulty analysis on an example: Consider a system truncated
to N = 3 with input the real signal described in (13); then, omit-
P
ρ y1 + Pr=1 ρrN y1+rN ting temporal dependency, its nonlinear orders are:
ρ
e (t) =
2
0 y2 + r=1 ρ y2+rN
rN
+ w−1 M1,1
Y . . .. (t), (12) y1 = wM1,0
. . 2
w−2 M2,2
0 P
ρN yN + r=1 ρrN yN +rN
y2 = w M2,0 + 2 M2,1
y3 = w3 M3,0 + 3 wM3,1 + 3 w−1 M3,2
+
+ w−3 M3,3
(15)
creating a ρN ratio between desired signal yn (t) and the first per- Only 7 different phase terms appears (from w−3 to w3 ). Consider
turbation yn+N (t). Thus parameters N and ρ enables to reach a a collection of K = 7 real signals vk (t) = wKk
u(t) + wKk
u(t),
required Signal-to-Noise Ratio (SNR). where wK is the first K-th root of unity, and zk (t) = V [vk ](t)
their corresponding output through the system. Then:
However, the need of complex input (and output) signals pre-
vents the use of PS method in practice. z1 M2,1
z2 M1,0 + 3M3,1
z3 M2,0
3.2. Application to real-valued input z4 (t) = WK · M3,0 (t) , (16)
z M3,3
5
Consider the real signal v(t) constructed as follows: z M2,2
6
z7 M1,1 + 3M3,2
v(t) = w u(t) + w u(t) = 2 Re[w u(t)] (13)
where WK is the DFT matrix5 of order K.
where w is a complex scalar on the unit circle (such that w = Therefore, by application of the PS algorithm (i.e. inversion
w−1 ) and u(t) a complex signal. Therefore, using property (6), of WK ) on this of signals zk , the right-hand side vector in (16) is
and assuming symmetry for the operator Vn (meaning symmetry recovered. Further separation could be made using two different
amplitudes to discriminate between M1,0 and M3,1 , and thus be
3 Even if the gain in time computation is not significant since generally able to reconstruct nonlinear orders yn using (14).
N is not a power of 2 and is not very high.
4 We and the anonymous reviewer that proposed this relevant expres- 5 It is important to notice that the right-hand side vector has hermitian
sion. symmetry, due to its Fourier Transform (the left-hand side) being real.
DAFX-3
DAFx-5
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Condition number
y(t) = n−p Mn, n−p (t)
−N ≤p≤N 1<|p|≤n≤N 2
2 10 2
p even n even
!
X p
X n
+ w n−p Mn, n−p (t) (17)
2
−N ≤p≤N |p|≤n≤N 2
p odd n odd
10 1
So, by applying a PS algorithm of order K = 2N + 1, we can
separate the terms
10 5 0 5 10
! Gain spacing γ (dB)
X n
Qp (t) = n−p Mn, n−p (t) (18)
2
2
|p|≤n≤N
n≡p mod 2
Figure 1: Condition number κ = kAkkA−1 k (Frobenius norm):
evolution w.r.t gain spacing γ and truncation order N (3 to 6, from
with −N ≤ p ≤ N . Then, application of the AS algorithm on bottom to top) for AS (solid line, × indicating minima) and PAS
each Qp (with bN/2c amplitudes) gives all Mn,q terms; terms yn (dashed line, • indicating minima) methods.
are reconstructed using (14).
This concatenation of PS and AS algorithm constitutes the
proposed phase-based method, which will be referred to as Phase- 4. APPLICATIONS
Amplitude Separation (PAS) method. As will be pointed out in
In order to test and compare the different separation method, data
Section 4.3, it can be more interesting to use directly terms Mn,q
from simulation of a nonlinear device were used.
for identification, instead of nonlinear orders yn ; this alternative
process will be referred to as raw-PAS (rPAS) method.
4.1. Simulation of a loudspeaker with nonlinear suspension
3.4. Condition number The simulated system was a loudspeaker with nonlinear suspen-
sion described by the modified Thiele & Small following equa-
In numerical analysis, condition number κ measures the method’s tions:
sensibility to noise in the measured data; it only depends on the di(t) d`(t)
solving method itself (the matrix to invert in linear problems), and u(t) = Re i(t) + L
+ Bl (20a)
dt dt
not the data it is applied to. In this section, condition number is 3
used to compare AS and PAS robustness6 .
2
d `(t) d`(t) X
M = Bl i(t) − Rm − kn `n (t) (20b)
For AS method, amplitude factors αk are chosen equally-spaced dt2 dt n=1
in dB scale, with alternating signs:
where u is the voltage at the loudspeakers terminals, i the current
α2p = γ p−1 and α2p+1 = −α2p (19) flowing
P3 through it and ` the position of the diaphragm. The term
n=1 n ` (t) corresponds to the nonlinear force that the suspen-
n
k
where γ is a chosen spacing gain. sion applies on the diaphragm.
T
Because the DFT matrix is optimally conditioned, PAS method Consider the state vector x = i ` d(`)/dt . Then, using
overall conditioning only depends on the K applications of AS state-space formalism, the system of input u(t) and output i(t) is
method. Amplitude factors are also chosen equally-spaced in dB written:
scale; the need to separate terms with same order-parity prevents ẋ = Ax + Bu + K2 (x, x) + K3 (x, x, x)
us from using alternating signs. Only the worst condition number (21)
i = Cx
for all K sub-problems is reported.
Figure 1 presents conditioning for AS and PAS methods. For where ẋ indicates the temporal derivative of x, and:
all maximum order truncation and gain spacing, an improvement R
e Bl
1
from amplitude-based to phase-based method is visible. Further- − 0 −
L L L , C = 1 0 0
more, for negative gain spacing, optimum conditioning is divided A= 0 0 1 , B = 0
by a factor 2 between AS and PAS. But, for both methods, the same
Bl k1 Rm
behaviour when truncation order N increases is remarked (which − − 0
M M M
is unsurprising due to the fact that PAS relies partly on AS).
Those results indicates that PAS robustness to noisy measure- and where K2 et K3 are multilinear functions of x such that:
ments should be better than AS.
0 0
0 0
6 As PAS and rPAS methods share the same steps, their condition num- K2 (a, b) = , and K3 (a, b, c) =
k2 k3
ber are equal. − a2 b2 − a2 b2 c2
M M
DAFX-4
DAFx-6
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Re 5, 7 Ω Rm 4, 06 · 10−1 Nm−1 s 80
L 1, 1 · 10−1 H k1 912, 3 kg · s−2
Bl 2, 99 NA−1 k2 611.5 kg · m−1 s−2 60
M 1, 9 · 10−3 kg k3 8, 0 · 107 kg · m−2 s−2
40
2 20
Total output
40
0
60
2 1 2 3 4 5
2 Nonlinear order n
Order 1
0 Figure 3: Relative error n (in dB) of separation w.r.t. nonlinear
order n and SNR (80 dB, 60 dB and 40 dB from bottom to top)
2 for AS (solid line with ×) and PAS (dashed line with •) methods.
0.0004 Input signals are linear sweep of amplitude 10 V going from 30 Hz
to 200 Hz.
Order 2
0.0000
0.0004
Figure 3 compares the separation measure n for both AS and
0.06 PAS methods applied to noisy output data with different SNR. Es-
timation on clean data output is not shown here; in this case, both
methods perform similarly and give the true homogeneous contri-
Order 3
0.00
bution yn (within machine accuracy). The high values of relative
errors 2 and 4 is due to the smaller amplitude of signals y2 and y4
0.06 in respect to the other orders, which makes their estimation more
0.000 0.005 0.010 0.015 0.020 0.025 0.030
sensitive to measurement noise.
Time (s)
Furthermore, Figure 3 shows that PAS outperforms AS method
in presence of noise. For same order n and SNR, the gain in n is
Figure 2: Total output and first three nonlinear orders yn of around 6 dB.
the simulated loudspeaker for a cosinusoidal input of frequency
100 Hz and amplitude 10 V.
4.3. Kernel identification using order separation
4.3.1. Identification algorithms
Simulations were made using numerical methods described in
The standard Volterra identification method used in this paper is
Appendix 8, with parameters given in Table 1.
the Korenberg’s algorithm for Least-Squares problem (KLS) as de-
scribed in [6].
4.2. Error separation It consists of solving the following linear-in-the-parameters
problem:
Order separation method are compared using the relative error n
y = Φf (23)
of each order n for noisy measurements of system outputs, where
where:
RMS(e
yn [k] − yn [k]) • vector y = [y[0], . . . , y[L − 1]]T is the concatenation of all
n = (22)
RMS(yn [k]) output samples, with L the signal length (in samples);
• Φ = [φ[0], . . . , φ[L − 1]]T is the input combinatorial matrix,
with yn the true nonlinear homogeneous contribution, yen their es- where vector φ[k] regroups all cross-product u[k−k1 ] . . . u[k−
timation, and RMS the standard Root-Mean Square measure. kn ] of the input signal at time k (for all orders n);
Simulation were made at a sampling frequency of 20000 Hz, • vector f is the concatenation of all kernels values (for all orders
and with a truncation order N = 5. All signals were 1 second n).
long. Using order separation (AS or PAS method), (23) is separated
Figure 2 shows the total output and first three nonlinear orders into N smaller problems:
of the system. Its linear behaviour is predominant: y1 is one (re-
spectively four) magnitude order above y3 (resp. y2 ). This large yn = Φn fn (24)
difference of amplitude between signals y (or equivalently y1 ) and
y2 means that the quadratic order could be hidden by noise in rel- Identification is then carried on separately for each kernel; this
atively high SNR. leads to AS-KLS and PAS-KLS algorithm.
DAFX-5
DAFx-7
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.12 10 6
Amplitude
0.08
0
Amplitude
0.04
−10 6
0.00
0.012
0.000 0.008
0.004 0.004 (s)
0.04 0.008 e2
0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 Time 1 (s)
0.012 0.000 Tim
Time (s)
Figure 4: First and second-order kernels computed from (30) and (31) in Appendix 8.2.
DAFX-6
DAFx-8
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.12 10 6
Amplitude
0.08
0
Amplitude
0.04
−10 6
0.00
0.012
0.000 0.008
0.004 0.004 (s)
0.04 0.008 e2
0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 Time 1 (s)
0.012 0.000 Tim
Time (s)
DAFX-7
DAFx-9
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
in Audio Engineering Society Convention 141. Audio Engi- by posing p = n − 2q. Taking into account the fact that both sum
neering Society, 2016. are finite, it is possible to inverse their orders, and thus:
[14] B. Zhang and S.A. Billings, “Volterra series truncation and
!
kernel estimation of nonlinear systems in the frequency do- X X n
main,” Mechanical Systems and Signal Processing, vol. 84, y(t) = wp Mn,(n−p)/2 (t)
(n − p)/2
−N ≤p≤N
pp. 39–57, 2017. p even
1<|p|≤n≤N
n even
[15] Wilson J. Rugh, Nonlinear system theory, Johns Hopkins
!
University Press Baltimore, 1981. X X n
+ wp Mn,(n−p)/2 (t)
[16] Stephen P. Boyd, Leon O. Chua, and Charles A. Desoer, −N ≤p≤N |p|≤n≤N
(n − p)/2
“Analytical foundations of Volterra series,” IMA Journal p odd n odd
of Mathematical Control and Information, vol. 1, no. 3, pp. (27)
243–282, 1984.
[17] Thomas Hélie and Béatrice Laroche, “Computation of con- 8. APPENDIX: SYSTEM NUMERICAL
vergence bounds for Volterra series of linear-analytic single- APPROXIMATION
input systems,” Automatic Control, IEEE Transactions on,
vol. 56, no. 9, pp. 2062–2072, 2011. 8.1. Numerical simulation
[18] Thomas Hélie and Béatrice Laroche, “Computable conver-
gence bounds of series expansions for infinite dimensional The input signal u(t) is approximated at sampling rate fs with a
linear-analytic systems and application,” Automatica, vol. zero-order holder; then, following the state-space representation in
50, no. 9, pp. 2334–2340, 2014. (21), output current i at sample time l is given by:
[19] Stephen Boyd and Leon Chua, “Fading memory and the N
X
problem of approximating nonlinear operators with Volterra i[l] = Cxn [l] (28)
series,” IEEE Transactions on circuits and systems, vol. 32, n=1
no. 11, pp. 1150–1161, 1985. where xn are the states of nonlinear homogeneous order n; first
[20] Gene H. Golub and Charles F. Van Loan, Matrix computa- three orders are given by the following recursive equations:
tions, vol. 3, JHU Press, 2012. x1 [l + 1] = eATs x1 [l] + ∆0 Bu[l] (29a)
[21] SICA, “Constructor datasheet for SICA loudpseaker ATs
model Z000900,” https://www.sica.it/media/ x2 [l + 1] = e x2 [l] + ∆0 K2 (x1 [l], x1 [l]) (29b)
Z000900.pdf, Accessed: 2017-04-07. x3 [l + 1] = eATs x3 [l] + ∆0 K3 (x1 [l], x1 [l], x1 [l])
(29c)
+ 2∆0 K2 (x1 [l], x2 [l])
7. APPENDIX: DETAILED COMPUTATION OF (17)
where Ts = 1/fs is the sampling time and ∆0 = A−1 eATs − I
Consider a system truncated to N = 3 with real input signal de- is a bias term due to sampling.
scribed in (13). From (17), we have:
n
! 8.2. Discrete-time kernel computation
X n
yn (t) = wn−2q Mn,q (t) (26)
q Discrete-time kernels hn corresponding to the loudspeaker simu-
q=0
lation using (28) and (29) are given by:
So, using (1), the overall output of the system is: N
X
N
X hn [m1 , . . . , mn ] = Cgn [m1 , . . . , mn ] (30)
y(t) = yn (t) n=1
n=1
! where gn are the input-to-states kernels of order n; first three or-
XN X n
n ders are given by:
= wn−2q Mn,q (t)
q
n=1 q=0 g1 [m] = (1 − δm,0 )eATe (m−1) ∆0 (31a)
n
!
X X n max{m1 ,m2 }
X
= wn−2q Mn,q (t) g2 [m1 , m2 ] = eATe l ∆0 K2 (g1 [m1 − l], g1 [m2 − l])
q=0
q
1≤n≤N
n even l=0
n
! (31b)
X X n
+ wn−2q Mn,q (t) max{m1 ,m2 ,m3 }
X
q
q=0
1≤n≤N g3 [m1 , m2 , m3 ] = eATe l ∆0
n odd
! l=0
X X n
= wp Mn,(n−p)/2 (t) K3 (g1 [m1 − l], g1 [m2 − l], g1 [m3 − l])
(n − p)/2 !
1≤n≤N −n≤p≤n
n even p even
! + 2K2 (g1 [m1 − l], g2 [m2 − l, m3 − l])
X X n
+ wp Mn,(n−p)/2 (t) (31c)
(n − p)/2
1≤n≤N −n≤p≤n
n odd p odd with δm,n the Kronecker delta.
DAFX-8
DAFx-10
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-11
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Section 4 reports the implementation details and the experiments W (m) , m = 1, ..., M are the filter weights to be learned, σ(·) is
conducted, while Section 5 provides the results of these experi- a non-linear sigmoid activation function, Z (m−1) is the output of
ments and discusses them. Finally in Section 6 conclusions are layer m − 1, called feature map, Q(m) is the result of convolution
drawn and open avenues for research are suggested. on the previous feature map and h(·) is a pooling function that
reduces the feature map dimensionality. After M convolutional
2. THE PROPOSED METHOD layers, one or more fully connected layers are added. The final
layer has size p and outputs an estimate of the model parameters
A physical model solves a set of differential equations that model θ̂.
a physical system and requires a set of of parameters θ to gener- Learning is conducted according to an update rule, which is
ate a discrete time audio sequence s(n). The goal of the model based on the evaluation of a loss function, such as
is to approximate acoustic signals generated by the physical sys-
tem in some perceptual sense. If the model provides a mapping `(W, e) = kθ − θ̂ (e) k2 (3)
from the parameters to the discrete sequence s(n), the problem of
estimating the parameters θ that yield a specific audio sequence where e is the training epoch. Training is iterated until a conver-
identified as a target (e.g. an audio sequence sampled from the gence criterion is matched or a maximum number of epochs has
physical system we are approximating), is equivalent to finding passed. To avoid overfitting and reduce training times early stop-
the model inverse. Finding an inverse mapping (or an approxima- ping by validation can be performed, which consists in evaluating
tion thereafter) for the model is a challenging task to face, and a after a constant number of epochs the loss, called validation loss,
first necessary condition is the existence of the inverse for a given calculated against a validation set, i.e. a part of the dataset that is
s(n). Usually, however, in physical modelling applications, re- not used for training and is, hence, new to the network. Even if the
quirements are less strict, and generally it is only expected that training loss may still be improving, if the validation loss does not
audio signals match in perceptual or qualitative terms, rather than improve after some training epochs, the training may stop avoiding
on a sample-by-sample basis. This means that, although, a signal a network overfit.
r(n) cannot be obtained from the model for any θ, a sufficiently Finally, once the network is trained, it can be fed with novel
close estimate r̂(n) may. Evaluating the distance between the two audio sequences to estimate the physical model parameters that
signals in psychoacoustic terms is a rather complex task and is out can obtain a result close to the original. In the present work we em-
of the scope of this work ploy additional audio sequences generated by the model in order
Artificial neural networks, and more specifically, deep neu- to measure the distance in the parameter space between the param-
ral networks of recent introduction, are well established to solve eters θi and θ̂i . If non-labelled audio sequences are employed (e.g.
a number of inverse modelling problems. Here, we propose the sequences sampled from the real-world), it is not straightforward
application of a convolutional neural network that, provided with to validate the generated result, that is why in the present work no
an audio signal of maximum length L in a suitable time-frequency attempt has been made to evaluate the estimation performance of
representation, can estimate model parameters θ̂ that fed to the the network with real-world signals.
physical model obtain an audio signal ŝ(n) close to s(n).
To achieve this, the inverse of the model must be learned em- 3. USE CASE
ploying deep machine learning techniques. If a supervised train-
ing approach is employed, the network must be fed with audio se- The method described in the previous section has been applied to a
quences and the related model parameters, also called target. The specific use case of interest, i.e. a patented digital pipe organ phys-
production of a dataset D of such tuples D = {(θi , si (n)), i = ical model. A pipe organ is a rather complex system [14, 15], pro-
1, ..., M } allows the network to try learn the mapping that con- viding a challenging scenario for physical modelling itself. This
nects these. The production of the dataset is often a lengthy task specific model, already employed on a family of commercial dig-
and may require human effort. However, in this application, the ital pipe organs, exposes 58 macro-parameters to be estimated for
model is known and, once implemented, it can be employed to each key, some of which are intertwined in a non linear fashion and
automatically generate a dataset D in order to train the neural net- are acoustic-wise non-orthogonal (i.e. jointly affect some acoustic
work. features of the resulting tone).
The neural network architecture proposed here allows for end- We introduce here some key terms for later use. A pipe organ
to-end learning and is based on convolutional layers. Convolu- has one or more keyboards (manuals or divisions), each of which
tional neural networks globally received attention from a large can play several stops, i.e. a set of pipes, typically one or more
number of research communities and found application into com- per key, which can be activated or deactived at once by means of a
mercial applications, especially in the field of image processing, drawstop. When a stop is activated, air is ready to be conveyed to
classification, etc. They are also used with audio signals, where, the embouchure of each pipe, and when a key is pressed, a valve
usually, the signal is provided to the CNN in a suitable time-fre- is opened to allow air flow into the pipe. Each stop has a different
quency representation, obtained by means of a Short-Time Fourier timbre and pitch (generally the pitch of the central C is expressed
Transform (STFT) with appropriate properties of time-frequency in feet measuring the pipe length). From our standpoint, the con-
localization. The architecture of a CNN is composed of several cept of stop is very important, since each key in a stop will sound
layers in the following form, similar to the neighboring ones in terms of timbre and envelope,
and each key will trigger different stops which may have similar
Z (m) = h(σ(Q(m) )), (1) pitch but different timbre. In a pipe organ it can be expected that
Q (m)
=W (m)
∗Z (m−1)
, (2) pipes in a stop have consistent construction features (e.g. mate-
rials, geometry, etc.) and a physical model that mimics that pipe
and Z (0) ≡ X, where M denotes the total number of layers, stop may have correlated features along the keys but this is not
DAFX-2
DAFx-12
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 1: Overview of the proposed system including the neural network for parameter estimation and the physical model for sound
synthesis.
an assumption that can be done, so it is necessary to conduct an dataset has been split by 90% and 10% for the training and vali-
estimate of the parameters for each key in a stop. dation sets respectively, for a total of 1998 samples for the former
The physical model employed in this work is meant to emulate and 222 for the latter. The only pre-processing conducted is the
the sound of flue pipes and is described in detail in the related normalization of the parameters and the extraction of the STFT.
patent. To summarize, it is constituted by three main parts: A trade-off has been selected in terms of resolution and hop size
1. exciter: models the wind jet oscillation that is created in the to allow a good tracking of harmonics peaks and attack transient.
embouchure and gives rise to an air pressure fluctuation, Figure 2 shows the input STFT for a A4 tone used for training the
network.
2. resonator: a digital waveguide structure that simulates the
bore,
15621
3. noise model: a stochastic component that simulates the air
13383
noise modulated by the wind jet.
Frequency [Hz]
11156
The parameters involved in the sound design are widely different
in range and meaning, and are e.g. digital filters coefficients, non- 8922
linear functions coefficients, gains, etc. The diverse nature of the
6695
parameters requires a normalization step which is conducted on
the whole training set and maps each parameter in a range [-1, 1], 4461
in order for the CNN to learn all parameters employing the same 2234
arithmetic. A de-normalization step is required, thus, to remap the
parameters to their original range. 0
0 1 2 3
Figure 1 provides an overview of the overall system for pa- Time [s]
rameter estimation and sound generation including the proposed
neural network and the physical model. Figure 2: The STFT for an organ sound in the training set. The
tone is a A4.
4. IMPLEMENTATION
The CNN and the machine learning framework has been imple- The CNN architecture is composed of up to 6 convolutional
mented as a python application employing Keras1 libraries and layers, with optional batch normalization [16] and max pooling,
Theano2 as a backend, running on a Intel i7 Linux machine equipped up to 3 fully connected layers and a final activation layer. For the
with 2 x GTX 970 graphic processing units. The physical model training, stochastic gradient descent (SGD), Adam and Adamax
is implemented as both an optimized DSP algorithm and a PC ap- optimizers have been tested. The training made use of early stop-
plication. The application has been employed in the course of this ping on the validation set. A coarse random search has been per-
work and has been modified to allow producing batches of audio formed first to pinpoint some of the best performing hyperparam-
sequences and labels for each key in a stop (e.g. to produce the eters. A finer random search has been, later, conducted keeping
dataset, given some specifications). Each audio sequence contains the best hyperparameters from the previous search constant. Tests
a few seconds of a tone of specific pitch with given parameters. have been conducted with other stops not belonging to the training
A dataset of 30 Principal pipe organ stops, each composed of set and averaged for all the keys in the stops.
74 audio files, one per note, has been created taking the parameters
from a database of pre-existing stops hand-crafted by expert mu-
5. RESULTS
sicians to mimic different hystoric styles and organ builders. The
1 http://keras.io The loss used in training, validation and testing is the Mean Square
2 http://deeplearning.net/software/theano/ Error (MSE) calculated at the output of the network with respect
DAFX-3
DAFx-13
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
to the target, before de-normalization. Results are, therefore, eval- HMSE10 HMSE20 HMSE35
uated in terms of MSE. Stentor 5.2 dB 9.4 dB 18.9 dB
Table 2 reports the 15 best hyperparameters combinations in HW-DE 0.3 dB 10.1 dB 12.3 dB
the fine random search. The following activation function combi-
nations are reported in Table 2: Table 1: Harmonics MSE (HMSE) for the first 10, 20 and all 35
harmonics for the tones shown in Figure 3.
1. A: employs tanh for all layers,
2. B: employs tanh for all layers besides the last one, that uses Activations Minibatch Internal layers MSE
a Rectified Linear Unit (ReLU)[17], class size size
3. C: employs ReLU functions for all layers, A 40 (16, 16, 512, 58) 0.261
B 50 (4, 6, 8, 10, 512, 58) 0.203
4. all other combinations of the two aforementioned activation A 40 (16, 16, 32, 32, 512, 58) 0.164
functions obtained higher MSE score and are not included A 40 (16, 16, 32, 32, 512, 58) 0.139
here. A 25 (16, 16, 32, 32, 1024, 58) 0.161
A 40 (16, 16, 32, 32, 1024, 58) 0.266
Results are provided against a test set of 222 samples from three
organ stops, and have same learning rate (1E−5), momentum max A 25 (16 ,16, 32, 32, 512, 58) 0.156
(0.9), pool sizes (2x2 for each convolutional layer), receptive field A 40 (16, 16, 32, 32, 1024, 58) 0.252
sizes (3x3 for each convolutional layer) and optimizer (Adamax A 50 (4, 6, 8, 10, 512, 58) 0.166
[18]). These fixed parameters have been selected as the best ones A 40 (16, 16, 32, 32, 256, 58) 0.179
after the coarse random search. A 25 (2, 2, 4, 4, 128, 58) 0.254
Figure 4 shows the training and validation loss plots for the C 740 (16, 16, 32, 32, 512, 58) 0.252
first 200 training epochs for the first combination in Table 2. The B 50 (4, 6, 8, 10, 512, 58) 0.214
loss is based on the MSE for all parameters before denormaliza- C 740 (16, 16, 32, 32, 512, 58) 0.257
tion. This means all parameters contribute to the MSE with the B 40 (16, 16, 512, 58) 0.179
same weight and makes results clearer to evaluate. Indeed, if the
MSE would be evaluated after de-normalization, parameters with Table 2: The best 15 results of the fine random hyperparameters
larger excursion ranges would have a larger effect on the loss (e.g. search. Activation classes are described in the text. The MSE
a delay line length versus a digital filter pole coefficient). Vali- is evaluated before denormalization, thus, all parameters have the
dation Early Stopping is performed when the minimum validation same weight. Please note: all the layers are convolutional with ker-
loss is achieved to prevent overfitting and reduce training times. In nel size as indicated, exception made for the second to last which
Figure 4, e.g., the validation loss minimum (0.027) occurs at epoch is a fully connected layer. The output layer has fixed size equal to
122, while the training loss minimum (0.001) occurs at epoch 198. the number of model parameters.
Two spectra and their waveforms are shown in Figure 3 showing
the similarity of the original tone and the estimated one, both ob-
tained from the flue pipe model.
Figure 3 are evaluated in terms of HMSE for the first 10, 20 and
Results provided in terms of MSE, unfortunately, are not acous- 35 harmonics in Table 1 3 .
tically motivated: not all errors have the same effect, since param-
eters affect different perceptual features, thus large errors on some The first tone, shown in Figure 3(a) has a spectrum with a good
parameters may not result as easily perceived as small errors on match for the first harmonics, but with some outliers and a gener-
other parameters. To the best of the authors’ knowledge there is ally bad match for harmonics higher than 12. The latter, shown
no agreed method in literature to objectively evaluate the degree in Figure 3(b) has a good match, especially in its first 10 partials,
of similarity of two musical instruments spectra. Previous works but the error raises with higher partials, especially from the 12th
suggested the use of subjective listening tests [19, 20, 21, 22], but up. This is reflected by an HMSE10 of 5.2 dB vs. 0.3 dB and an
an objective way to measure this distance is still to be addressed. error on the whole spectrum (HMSE35) of 18.9 dB vs. 12.3 dB.
In order to provide the reader with some cues on how to eval- HMSE20 values for the two tones do not differ significantly, due
uate these results, we draw from the psychoacoustic literature, as to the averaging done on spectral ranges with different results, but
an example, the work from Caclin et al. [23], where spectral ir- we leave them to the reader so that they can be compared to the
regularity is proposed as a salient feature in sound similarity rat- experiments of Caclin et al. The HMSE20 values are somewhere
ing. Spectral irregularity is modelled, in their work, as the atten- between the two extremes, “same” and “different”, tending more
uation of even harmonics in dB (EHA). The perceptual extremes towards the former. Informal listening tests conducted with expert
are chosen to be a tone with a regular spectral decay and a tone musicians suggest that the estimated “Stentor” tone does not match
with all even harmonics attenuated by 8dB. The mean squared er- well to the original, while the “HW-DE” does match sufficiently.
ror calculated for these two tones (HMSE) for the first 20 harmon- We hypothesize that the spectral matching of the first harmonics is
ics (as done in their work) is 32dB. In our experiments, results more relevant in psychoacoustic terms to assess similarity, but we
vary greatly, depending on the pipe sounds to be modeled by the leave this to more systematic studies as a future work. The tones
CNN. As an example, Figure 3 shows time and frequency plots of are made available to the reader online4 .
two experiments. They both present two A4 signals created by the
physical model with two different parameter configurations hand- 3 The sampling frequency of the tones is 31250 Hz, thus, 35 is the high-
crafted by an expert musician, called respectively “Stentor” and est harmonic for a A4 tone.
“HW-DE”. The peak amplitude of the harmonics for the tones in 4 http://a3lab.dii.univpm.it/research/10-projects/84-ml-phymod
DAFX-4
DAFx-14
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0 0
−10 −10
Magnitude [dB]
Magnitude [dB]
−20 −20
−30 −30
−40 −40
−50 −50
−60 −60
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Frequency [Hz] Frequency [Hz]
0.04 0.10
0.02 0.05
0.00 0.00
−0.02 −0.05
−0.10
−0.04 −0.15
−0.06 −0.20
0 500 1000 1500 2000 0 500 1000 1500 2000
0.10 0.1
0.05
0.0
0.00
−0.05 −0.1
−0.10 −0.2
−0.15 −0.3
0 500 1000 1500 2000 0 500 1000 1500 2000
Time [samples] Time [samples]
(a) (b)
Figure 3: Spectra and harmonic content for two A4 tones from (a) Principal stop named “Stentor”, and (b) Principal stop named “HW-DE”.
The gray lines and crosses show the spectrum and the harmonic peaks of ŝ(n), while the black line and dots show the spectrum and the
harmonic peaks of s(n). In the waveform plots, the upper ones are obtained by the target parameters, while the lower ones are obtained
with the estimated parameters.
0.30 validation is required with data sampled from a real pipe organ for
0.25
further assessment and to evaluate the robustness of this method to
noise, reverberation and such.
0.20
During the evaluation of the results, it has been discovered that
MSE Loss
0.15 results may greatly vary depending on the stops acoustic charac-
ter. A rigorous approach to machine learning requires understand-
0.10
ing whether the training set, which is a sampling of the probability
0.05 distribution of all flue pipe stops obtained by the model, is repre-
0.00
sentative of that probability distribution. Furthermore, it can be ex-
0 50 100 150 200 pected that the organ stops that can be obtained from the model are
Epoch a subset of all organ stops that could physically built, due to model
limitations and simplifying hypotheses. On the other hand, due
Figure 4: Training (solid line) and validation loss (dashed line) to its digital implementation, the model could circumvent some
for the best combination reported in Table 2. Please note that the physical limitation of real flue pipes, thus, yielding stops that are
validation loss minimum (0.027) occurs at epoch 122, while the not physically feasible. This calls for a better understanding of
training loss minimum (0.001) occurs at epoch 198. how different stops are related to each others in the modelled and
the physical realms, to understand before trying the machine learn-
ing approach, whether satisfying results can be obtained. As a last
6. CONCLUSIONS remark, these are general issues that apply also to other physical
model or musical instruments.
In this paper a machine learning paradigm that is general and flex-
ible is proposed and applied to the problem of estimating the pa-
rameters for a physical model for sound synthesis. To validate the 7. REFERENCES
idea a specific use case of a flue pipe physical model has been em-
ployed. Results in term of MSE are good and tones spectra have [1] Vesa Välimäki, Jyri Pakarinen, Cumhur Erkut, and Matti
a good match in terms of harmonic content, although results vary. Karjalainen, “Discrete-time modelling of musical instru-
Such results, coming from a real-world application scenario moti- ments,” Reports on progress in physics, vol. 69, no. 1, pp.
vate the authors in believing that a machine learning paradigm can 1, 2005.
be employed with success for the problem at hand. Nonetheless,
this first achievement calls for more research works. First of all a [2] Julius O. Smith, “Virtual acoustic musical instruments: Re-
DAFX-5
DAFx-15
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
view and update,” Journal of New Music Research, vol. 33, [13] C. Zinato, “Method and electronic device used to synthe-
no. 3, pp. 283–304, 2004. sise the sound of church organ flue pipes by taking advantage
[3] Stefano Zambon, Leonardo Gabrielli, and Balazs Bank, “Ex- of the physical modeling technique of acoustic instruments,”
pressive physical modeling of keyboard instruments: From Oct. 28 2008, US Patent 7,442,869.
theory to implementation,” in Audio Engineering Society [14] NH Fletcher, “Sound production by organ flue pipes,” The
Convention 134. Audio Engineering Society, 2013. Journal of the Acoustical Society of America, vol. 60, no. 4,
[4] Janne Riionheimo and Vesa Välimäki, “Parameter estimation pp. 926–936, 1976.
of a plucked string synthesis model using a genetic algorithm [15] Neville H Fletcher and Suszanne Thwaites, “The physics of
with perceptual fitness calculation,” EURASIP Journal on organ pipes,” Scientific American, vol. 248, no. 1, pp. 94–
Advances in Signal Processing, vol. 2003, no. 8, 2003. 103, 1983.
[5] Vasileios Chatziioannou and Maarten van Walstijn, “Estima- [16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
tion of clarinet reed parameters by inverse modelling,” Acta deep network training by reducing internal covariate shift,”
Acustica united with Acustica, vol. 98, no. 4, pp. 629–639, arXiv preprint arXiv:1502.03167, 2015.
2012.
[17] Vinod Nair and Geoffrey E Hinton, “Rectified linear units
[6] Carlo Drioli and Davide Rocchesso, “Learning pseudo- improve restricted boltzmann machines,” in Proceedings
physical models for sound synthesis and transformation,” in of the 27th international conference on machine learning
Systems, Man, and Cybernetics, 1998. 1998 IEEE Interna- (ICML-10), 2010, pp. 807–814.
tional Conference on. IEEE, 1998, vol. 2, pp. 1085–1090.
[18] Diederik Kingma and Jimmy Ba, “Adam: A method for
[7] Aurelio Uncini, “Sound synthesis by flexible activation func- stochastic optimization,” arXiv preprint arXiv:1412.6980,
tion recurrent neural networks,” in Italian Workshop on Neu- 2014.
ral Nets. Springer, 2002, pp. 168–177.
[19] Simon Wun and Andrew Horner, “Evaluation of weighted
[8] Alvin WY Su and Liang San-Fu, “Synthesis of plucked-
principal-component analysis matching for wavetable syn-
string tones by physical modeling with recurrent neural net-
thesis,” J. Audio Engineering Society, vol. 55, no. 9, pp.
works,” in Multimedia Signal Processing, 1997., IEEE First
762–774, 2007.
Workshop on. IEEE, 1997, pp. 71–76.
[9] Alvin Wen-Yu Su and Sheng-Fu Liang, “A new automatic [20] H. M. Lehtonen, H. Penttinen, J. Rauhala, and V. Välimäki,
IIR analysis/synthesis technique for plucked-string instru- “Analysis and modeling of piano sustain-pedal effects,” J.
ments,” IEEE transactions on speech and audio processing, Acoustical Society of America, vol. 122, pp. 1787–1797,
vol. 9, no. 7, pp. 747–754, 2001. 2007.
[10] Ali Taylan Cemgil and Cumhur Erkut, “Calibration of [21] Brahim Hamadicharef and Emmanuel Ifeachor, “Objective
physical models using artificial neural networks with ap- prediction of sound synthesis quality,” 115th Convention of
plication to plucked string instruments,” PROCEEDINGS- the AES, New York, USA, p. 8, October 2003.
INSTITUTE OF ACOUSTICS, vol. 19, pp. 213–218, 1997. [22] L. Gabrielli, S. Squartini, and V. Välimäki, “A subjective val-
[11] Katsutoshi Itoyama and Hiroshi G Okuno, “Parameter esti- idation method for musical instrument emulation,” in 131st
mation of virtual musical instrument synthesizers,” in Proc. Audio Eng. Soc. Convention, New York, 2011.
of the International Computer Music Conference (ICMC), [23] Anne Caclin, Stephen McAdams, Bennett K Smith, and
2014. Suzanne Winsberg, “Acoustic correlates of timbre space di-
[12] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh mensions: A confirmatory study using synthetic tones a,”
Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and The Journal of the Acoustical Society of America, vol. 118,
Yoshua Bengio, “SampleRNN: an unconditional end-to-end no. 1, pp. 471–482, 2005.
neural audio generation model,” in 5th International Confer-
ence on Learning Representations (ICLR 2017), 2017.
DAFX-6
DAFx-16
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Florian Pfeifle
Systematic Musicology,
University of Hamburg
Hamburg, DE
Florian.Pfeifle@uni-hamburg.de
DAFX-1
DAFx-17
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4. MEASUREMENTS
DAFX-2
DAFx-18
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5. PHYSICAL MODEL
i
5.1. Overview
The physical models presented in this section are based on the
measured properties presented in section 4, qualitative observa-
tions of the FEM models published in [18] and some assumptions
regarding material properties of the Wurltizer’s reed and the ham-
mer tip of both instruments. The model of the Rhodes e-piano
includes a formulation for the hammer impact, a model of a beam
0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17
vibrating in two polarisations subject to large effects and a pre-
processed approximation of the magnetic field distibution over the
Figure 3: Measurement of one Rhodes tine. The upper row shows tip of the pick-up’s magnet core. The model of the Wurlitzer
the deflection of the tine the lower the measured direct-out signal EP200 shares conceptual similarities with the Rhodes model but
behind the pick-up. is adapted to the different geometry of the sound production. The
tine of the Wurlitzer is modeled as a non-uniform cantilever beam
with large-deflection effects and a spatial transfer function describ-
ing the change in capacitance resulting from the motion of the tine.
P
with xL indicating a weighted sum over the contact area. This
Figure 4: Phase plot of the two polarisations of the tine deflection model is based on a model for hammer impacts developed by Hunt
of one tine. & Crossly [33], that has been shown to yield good results for mod-
els of hammer impacts with moderate impact velocities and plain
geometries [14]; [29]. Here, α is the nonlinearity exponent de-
4.3. Wurlitzer measurements pending on the geometry of the contact area and λ is a material
dependent damping term that dissipates energy in dependence to
the velocity of the changing hammer-tip compression written as
Figure 5 shows the first 12 milliseconds of the Wurlitzer’s reed
xt .
motion and the measured sound at the output.
DAFX-3
DAFx-19
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
exhibiting large and non-planar deflections, having non homoge- depicted in Figure 6 our derivation makes use of several simplify-
neous mass, and not considering the angle of rotation can be writ- ing assumptions facilitating the computation of the magnetic field
ten as distribution over the magnet’s tip that are
1 1. The tine vibrates in an sinusoidal motion in both planes in
ρutt + [EIuxx ]xx − EA uxx · K(u) in front of the pickup.
2
− κu2x2t − F (uV [x], t) = 0 (2) 2. The tip of the tine vibrates on the trajectory of an ideal circle
with the center at its fixation point.
Here, u is the deflection in two transverse polarisations (H, V ),
E is the Young’s modulus I is the radius of gyration and A is the
cross-sectional area. F (uV [x], t) is the forcing function of the
hammer model, impacting the bar in the vertical direction indi-
cated by uV . K(u, w) is a nonlinear Kirchhoff-like term given
as Z l
K(u) = (uH 2 V 2
x ) + (ux ) dx (3)
0
integration of the field over the pickup. Due to the specific pickup distribution for position (y , z ) can be computed accordingly. In-
geometry of the Rhodes, a different approach is taken here to cal- tegrating this formula for all points on a trajectory given by the
culate the induction field strength above the tip of the magnet. As position of the Rhodes’ tine tip
0
q
1 This assumption proposes an equivalence between the effective causes z = r − r2 − (x0 )2
of electric fields and magnetic fields and can be used as a mathematical 0
modeling tool, see: [31, pp. 174 ff]. x = x̂ · sin(2πftine t) (8)
DAFX-4
DAFx-20
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
with ftine the fundamental frequency of the tine, leads to a mag- sound. Using a basic definition of time-varying capacitance that
netic potential function characterising the magnitude of relative induces a current i we get
magnetic field change.
An idealised form of the magnetic field in front in one plane of ∂u(t) ∂C(t)
the Rhodes pickup is depicted in Figure 7a and 7b, it is comparable i(t) = C(t) + u(t) (10)
∂t ∂t
to the measurements results published in [16].
with u the voltage and C the capacitance both depending on time
t. For the derivation of the influence function of the capacitor we
take two simplifying assumptions.
∂C(t)
Figure 7: An idealised schematic deptiction of the pickup system i(t) = u0 (11)
∂t
of the Rhodes E-piano. The sinusoidal motion of the vibrating tine
induces a c a) A low amplitude input of a sinusoidal vibration of This alternating current induces an alternating voltage over a resis-
the magnetic flux weighted by the magnet fields distribution. By tor that drives an amplification circuit.
differentiating the magnetic flux in respect to time, the alternat- To calculate the capacitance curve due to the deflection of the
ing voltage present at the output is calculated. b)A similar model Wurlitzer’s reed, a number of i planes through the geometry are
setup as before consisting of a slightly displaced mid-point for the taken and the electric field strength is computed for each resulting
input motion resulting in a different weighting function of the mag- slice simplifying the 3-dimensional problem to a 2-dimensional.
netic field. The output shows a different form than before. This The capacitance for each slice can be computed from the electric
condition is close to a realistic playing condition found in Rhodes field by
E-pianos. Qi
Ci = (12)
Ui
H
with Qi = t A E ~ · dA the charge defined as the integral of the
5.5. Wurlitzer reed model
electric field over the surfaces of the geometries using Gauss’s the-
The reed of the Wurlitzer is modeled using a similar PDE as for orem and t an electric field constant for the material and free field
the Rhodes’ tine using only one direction of motion u, the coupling properties. Three exemplary positions for the computation of the
term between the two polarisations is ommited. The use of a beam capacitance are depicted in Figure 8.
model instead of a plate model is justifiable here because typical
effects found in plates were not measured using the high-speed
camera setup and thus are either not present or small compared to
the transversal deflection of the fundamental mode. In addition to
that, the measurements show that the influence of higher modes
are comparably small or non-existent even under extreme playing
conditions, thus the primary mode of vibration can be approxi-
mated by the reeds first natural frequency which coincides with a
cantilever beam of similar dimensions. Figure 8: Distribution of the electric field for one examplary reed
deflections. On the left hand side one slice of geometry on the
ρutt + [EIuxx ]xx − κu2x2t − f (x, t) = 0 right hand side the results from an FEM model.
with the same variables as introduced before. Again Equation 9
does not explicitly depend on the shear angle α (see [28]) thus it is
not regarded here any further. Again omitting the shear angle, the
boundary conditions for the fixed/free beam are 6. HARDWARE MODEL
u|0 = 0
0
6.1. Overview
k GAux |L = 0 . (9)
The finite difference implementations of the physical models de-
5.6. Wurlitzer pickup model scribed in section 5 are implemented on an FPGA development
board consisting of a XILINX Virtex-7 690T. This section gives a
The influence of the Wurlitzer’s pick-up system can be charac- short overview on the methodology and the implementation of the
terised as a change in capacitance of a time varying capacitor in- real-time hardware models. A more detailed account is published
duces an alternating voltage which is amplified as the instruments in [21] or [34]
DAFX-5
DAFx-21
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-6
DAFx-22
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
7. SIMULATION RESULTS
DAFX-7
DAFx-23
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[8] Harold B. Rhodes and Steven J. Woodyard,"Tuning Fork [25] Wurlitzer Company, The Electric Pianos Se-
Mounting Assembly In Electromechnanical Pianos", U.S. ries 200 and 200A Service Manual Available at:
Patent 4,373,418, 1983. http://manuals.fdiskc.com/flat/Wurlitzer%
20Series%20200%20Service%20Manual.pdf (Ac-
[9] Greg Shear and Matthew Wright,"The electromagnetically cessed: 17.02.2016)
sustained Rhodes piano", NIME Proceedings, Oslo, 2011.
[26] R. W. Traill-Nash and A. R. Collar. The effects of shear flex-
[10] Torsten Wendland, "Klang und Akustik des Fender Rhodes ibility and rotatory inertia on the bending vibrations of beams.
E-Pianos", Technische Universität Berlin, Berlin, 2009. The Quarterly Journal of Mechanics and Applied Mathematics,
6(2):186–222, 1953.
[11] Rhodes Keyboard Instruments, "Service Manual", CBS Mu-
sical Instruments a Division of CBS Inc., Fullerton CA, 1979. [27] Antoine Falaize and Thomas HÃ
lie.c Passive simulation
of the nonlinear port-Hamiltonian modeling of a Rhodes Piano.
[12] Benjamin F. Miessner, "Method and apparatus for the pro-
Journal of Sound and Vibration, 390:289–309, March 2017.
duction of music", US Patent 1,929,027, 1931.
[28] Seon M. Han, Haym Benaroya, and Timothy Wei. Dynamics
[13] C.W. Andersen, "Electronic Piano", US 2974555, 1955. of transversely vibrating beams using four engineering theories.
[14] Federico Avanzini and Davide Rocchesso. Physical model- Journal of Sound and Vibration, 225(5):935–988, September
ing of impacts: theory and experiments on contact time and 1999.
spectral centroid. In Proceedings of the Conference on Sound [29] Stefan D. Bilbao. Numerical sound synthesis: finite dif-
and Music Computing, pages 287–293, 2004. ference schemes and simulation in musical acoustics. Wiley,
[15] Tatsuya Furukawa, Hideaki Tanaka, Hideaki Itoh, Hisao Chichester, 2009.
Fukumoto, and Masashi Ohchi. Dynamic electromagnetic [30] Charles Jordan. Calculus of finite differences. Chelsea Publ.
Analysis of Guitar Pickup aided by COMSOL Multiphysics. In Co, New York, 1 edition, 1950.
Proceedings of the COMSOL Conference Tokyo 2012, Tokyo,
[31] John D. Jackson Classical Electrodynamics John Wiley &
2012. COMSOL.
sons, Inc., 3rd edition, 1998
[16] Nicholas G. Horton and Thomas R. Moore. Modeling the
[32] John C. Strikwerda. Finite Difference Schemes and Partial
magnetic pickup of an electric guitar. American Journal of
Differential Equations, Second Edition. Society for Industrial
Physics, 77(2):144, 2009.
and Applied Mathematics, Philadelphia, USA, 2nd ed edition,
[17] Malte Muenster and Florian Pfeifle. Non-Linear Behaviour 2004.
in Sound Production of the Rhodes Piano. In Proceedings of the [33] K. H. Hunt and F. R. E. Crossley. Coefficient of restitution
International Symposium of Musical Acoustics (ISMA) 2014, interpreted as damping in vibroimpact. Journal of applied me-
pages 247–252, Le Mans, France., 2014. chanics, 42(2):440–445, 1975.
[18] Florian Pfeifle and Malte Muenster Tone Production of the [34] F. Pfeifle Physical Model Real-time Auralisation of Musical
Wurlitzer and Rhodes E-Pianos In Albrecht Schneider, edi- Instruments: Analysis and Synthesis Ph.D. Dissertation. Uni-
tor, Studies in Musical Acoustics and Psychoacoustics volume 5 versity of Hamburg. 2014.
of Current Research in Systematic Musicology, pages 75–107.
Springer International Publishing, 2017.
[19] Malte Muenster, Florian Pfeifle, Till Weinrich, and Martin
Keil. Nonlinearities and self-organization in the sound produc-
tion of the Rhodes piano. The Journal of the Acoustical Society
of America, 136(4):2164–2164, 2014.
DAFX-8
DAFx-24
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The first section of this article briefly describes the design of the
prototype digital sound synthesis engine that physically models
the workings of the acoustic wind machine; the second section
describes in detail the digital modelling of the action (a simple
rotational gesture) and its mapping and complex effect on the
sound engine. In the third and final section of this article, it is
proposed that this digital action model can be applied to the
reproduction of other historical theatre sound effects, and more
broadly to the design and control of real-time physical modelling
Figure 1: The acoustic wind machine example.
DAFX-1
DAFx-25
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
To facilitate real-time performance with a digital wind sound Table 1: Parameters to the Sound Designer’s Toolkit (SDT)
while providing the same multisensory feedback, the acoustic objects in Max/MSP for each ‘digital slat’ [12, 10].
wind machine itself was fitted with a rotary encoder
mechanically coupled to its moving cylinder with 3D printed Parameters to Surface profile Noise
gearing. The movement of the encoder was read with an Arduino sdt.scraping~ (A signal)
prototyping board connected to the computer via USB, sending a Grain 0.080596,
single stream of rotation data scaled to a range of 0° - 360°. (Density of micro- modulated by
Max/MSP1 was chosen as the platform for a design-led approach impacts) encoder data
to programming and the production of a functional prototype of a Velocity (m/s) Real-time optical
digital wind sound using the Sound Designer’s Toolkit (SDT)2 encoder data
suite of physics-based sound synthesis algorithms. The SDT Force (N) 0.546537,
algorithms are computationally efficient and designed for modulated by
advanced control mapping, a particularly relevant feature for this encoder data
project. While Max/MSP is an ideal platform for prototyping this Parameters to External rubbing Signal from
effect, it is envisaged that the overall model could in the future be sdt.friction~ force sdt.scraping~
ported to another platform with similar real-time audio Bristle stiffness 500.
processing and physical modelling functions. (Evolution of mode
To model the digital wind sound, the acoustic wind machine’s lock-in)
stages of sound production were first deconstructed and Bristle dissipation 40.
described as an entity-action model [6]. The acoustic wind (Sound bandwidth)
machine creates sound through friction, but instead of just two Viscosity 1.2037
surfaces (one cylinder, one cloth) in contact, there are twelve slat (Speed of timbre
‘rubbers’ at work. As the cylinder of the acoustic wind machine evolution and pitch)
is rotated, each of its twelve slats come into contact with the Amount of sliding 0.605833
encompassing cloth and rub it to produce sound through friction. noise (Perceived
Each slat falls silent for part of its rotation as it moves out of surface roughness)
range of the cloth, and only seven of the slats contact the cloth Dynamic friction 0.159724
and produce sound at any one time (Figure 2). This mechanical coefficient
process, which gives an individual envelope to the friction sound (High values reduce
produced by each slat of the cylinder, informed the sound bandwidth)
programming. Twelve instances of a physical model of nonlinear Static friction 0.5 (for Hemp
friction, the most recent iteration of the SDT’s dynamic friction coefficient cloth and Wood)
model [11], were created, giving one voice for each slat on the (Smoothness of sound
acoustic wind machine example. The physical model, which attack)
simulates the friction between a sliding probe and an inertial Breakaway coefficient 0.174997
object with a modal resonator, used the following SDT (Transients of elasto-
algorithms: plastic state)
Stribeck velocity 0.103427
• sdt.scraping~: A scraping texture generator, which
(Smoothness of sound
outputs a force to be applied to a resonator or another
attacks)
solid. Affords real-time control over the grain, velocity
Pickup index of object 0
and force of the scrape.
1 (Contact point)
• sdt.friction~: An elasto-plastic friction model, which
Pickup index of object 0
takes its rubbing force input from sdt.scraping~.
2 (Contact point)
• sdt.inertial~: Inertial object model
Parameters to Mass of inertial object 0.01
• sdt.modal~: Modal resonator
sdt.inertial~ (Kg)
Fragment size 1
The parameters to these objects were chosen through a process of
(To simulate
comparison and evaluation of the sonic outputs of the acoustic
crumpling)
wind machine and the digital synthesis engine [as reported in 12,
Parameters to Frequency factor 1
10] with the aim of creating a digital sound as close as possible to
sdt.modal~ Frequency of each of 380, 836, 1710
the acoustic sound (Table 1).
three modes (Hz)
Decay factor 0.005
Decay of each mode 0.8, 0.45, 0.09
(s)
Pickup factor 2.2
Pickup0_1 50.
Pickup0_2 100.
Pickup0_3 80.
Fragment size 1
1
http://www.cycling74.com/
2
For Max/MSP and Pure Data: http://soundobject.org/SDT/
DAFX-2
DAFx-26
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The acoustic wind machine’s cloth is its main resonator, and the The rotational gesture afforded by the acoustic wind machine is
sdt.modal~ object in each digital slat adds modal resonance to the comprised of two distinct stages due to the way the cloth has
nonlinear friction model. To account for the cloth’s role in been fixed to one side of the frame, and the influence of gravity.
dispersing the sound, a bidirectional digital waveguide model This creates a rotational gesture that at first requires more effort
[13] was adapted. The cloth is pulled tight on one side of the to overcome oppositional force from gravity and the cloth’s
acoustic wind machine, and hangs freely on the other (Figure 3). tension on the upstroke, and then requires less effort with the
The tight side of the cloth, which is coupled to a wooden pole loose cloth and assistance from gravity on the downstroke
‘bridge’ pinned to the a-frame, is similar to a bowed string, with (Figure 3). This produces an amplitude envelope that rises in the
the slatted cylinder ‘bowing’ the cloth during rotation. A digital first half of the rotation, and then decreases in the second half
waveguide in series with a low-pass filter and an all-pass filter [12]. This amplitude envelope is most pronounced at slow
was used to simulate dispersion of the friction sound through the speeds, when the overall amplitude of the wind sound is lower. In
tight side of the cloth, allowing for some damping due to its addition, as the cylinder gains speed in rotation, and overcomes
coupling to the acoustic wind machine’s ‘bridge.’ Dispersion inertia, it moves much more freely, reducing the difference in
through the freely hanging side of the cloth was modeled using amplitude between the upstroke and downstroke of each rotation,
the most basic method, without damping, of a delay line in series while the overall amplitude of the acoustic wind machine is
with an all-pass filter [14]. higher.
DAFX-3
DAFx-27
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-28
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
of metal shot, dried peas or other small materials production of the acoustic wind machine outlined in the entity-
rotated inside a sealed barrel [16]. action model. This enables the digital system to take the
2. Crash machines produced sound from multiple impacts individuality of the acoustic wind machine into account through
and rolling, with pieces of masonry or large rocks the use of the specific degree placement of its wooden slats. This
rotated inside a sealed barrel [17]. Some variations of design could be extended to model other specific examples of
the design were based on a large ratchet mechanism, acoustic wind machines through the incorporation of the detail of
and produced sound through impacts between pieces of their own structures into the digital domain.
wood [18].
3. Thunder barrels produced sound from rolling metal Parsing a single stream of rotation data into twelve distinct
cannonballs inside a metal-lined barrel or metal streams extends the possibilities for the real-time control of
container [19]. synthesis with a simple gesture of rotation. This is particularly
4. The crackling of a fire was produced by slowly rotating pertinent in the case of real-time performance with physical
a wooden ratchet or clapper [20]. modelling synthesis, where action and resulting sound should be
5. Creaking was produced by rotating a clay pot inside a tightly coupled. Other simple single-gesture controllers could be
larger clay pot [18]. investigated and extended in a similar way, perhaps to control
6. Designs to imitate machinery, vehicles or motors physical models of other historical sound effects devices. For
produced sounds through repetitive impacts between example, a digital fader could set off a chain of impact events,
different materials, such as an aeroplane sound each triggered by a particular position on its travel.
achieved by plucking catgut strings with a toothed
More broadly, this approach opens a new design space for real-
wheel [21].
time performance with physical modelling synthesis. By building
7. A simple gesture of rotation was also used to control
on historical techniques for making sound with acoustic materials
the production of sound that had previously been
and simple mechanisms, new interactions between virtual
achieved manually as devices became more complex in
materials can be devised and explored while simultaneously
the early twentieth century. A mechanism for the
affording a simple, but meaningful and rich control interface to
sound of horses’ hooves [22] triggered impacts on
the performer. This approach might also be applied in a
wooden cups, which would have been usually held in
meaningful way to other synthesis methods, enabling (as in the
each hand and hit on a wooden surface. The Allefex case of the crank handle example) the control of envelopes and
machine, which offered mechanisms for many kinds of triggers in real-time performance to bring obvious, or more
sounds, produced a series of gunshots, thunder and a ‘mechanical,’ affordances to potentially more abstract methods
steam engine controlled with crank handles [23]. of sound creation. This is an area currently under investigation as
part of this work.
The gesture mapping model (using the SDT and data from a
rotary encoder and Arduino configuration) could be easily
adapted to be used in the digital replication of many of these 4. CONCLUSIONS
mechanical sound effects. For example, recreating a cylinder
‘structure’ from the rotation data using degree values in a similar This paper has proposed a strategy for real-time control over
way to the digital wind machine described here could trigger a physical modeling synthesis informed by the design and
series of sdt.crumpling~ (a model of a stochastic sequence of operation of historical theatre sound effects. A description of the
impacts) and sdt.friction~ objects to model the sliding and design of a digital wind machine synthesis engine based on the
multiple impact sounds of an acoustic rain machine in rotation. Sound Designer’s Toolkit (SDT) in Max/MSP was detailed,
As the cloth works against the rotational motion in the acoustic along with the mechanical mapping model to add complexity to a
wind machine, so could the inertia of the material inside the rain single stream of rotational data for its performance in real-time.
machine’s barrel be factored into the data stream through the Further applications of this approach and suggestions for future
proposed inertia model. Modelling these historical methods of work were also outlined.
sound production may reveal more useful strategies to increase
the potential of simple rotational gestures in the control of 5. ACKNOWLEDGMENTS
physical modelling synthesis systems, as well as the design and
DSP management of those systems for real-time performance. This research is supported by the Arts and Humanities Research
Council (AHRC) through the White Rose College of the Arts and
3.2. A Mechanical Approach to Mapping: Extending Humanities (WRoCAH) Doctoral Training Partnership.
Rotation
The mapping strategy implemented for the digital wind machine
takes a mechanical approach, and recreates the stages of sound
DAFX-5
DAFx-29
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 6: Diagram of signal flow and mapping for the digital wind machine in Max/MSP.
DAFX-6
DAFx-30
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
sound synthesis: Library and documentation," [17] F. Napier, Noises Off: A Handbook of Sound Effects:
Technical report, UNIVERONA (Verona)2007. London: JG Miller, 1936.
[12] F. Keenan and S. Pauletto, "An Acoustic Wind [18] G. H. Leverton, The Production of Later Nineteenth
Machine and its Digital Counterpart: Initial Audio Century American Drama: A Basis for Teaching: AMS
Analysis and Comparison," presented at the Interactive Press, 1936.
Audio Systems Symposium (IASS), University of [19] M. K. Culver, "A History of Theatre Sound Effects
York, York, UK, 2016. Devices to 1927," University of Illinois at Urbana-
[13] M. Karjalainen, V. Välimäki, and T. Tolonen, Champaign, 1981.
"Plucked-string models: From the Karplus-Strong [20] A. E. Peterson, "Stage Effects and Noises Off," Theatre
algorithm to digital waveguides and beyond," and Stage, vol. 1 and 2 1934.
Computer Music Journal, vol. 22, pp. 17-32, 1998. [21] A. Rose, Stage Effects: How to Make and Work Them.
[14] J. O. Smith, Physical audio signal processing: For Chatham: Mackays Ltd, 1928.
virtual musical instruments and audio effects: W3K [22] V. D. Browne, Secrets of Scene Painting and Stage
Publishing, 2010. Effects: Routledge, 1913.
[15] A. Hunt, M. M. Wanderley, and M. Paradis, "The [23] F. A. A. Talbot, Moving Pictures: How they are made
importance of parameter mapping in electronic and worked: JB Lippincott Company, 1914.
instrument design," Journal of New Music Research,
vol. 32, pp. 429-440, 2003.
[16] D. Collison, "The Sound of Theatre," ed: London:
Professional Lighting and Sound Association, 2008.
DAFX-7
DAFx-31
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-32
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Van den Doel et al. [1] generated friction sounds to remark on points). At each actual contact point, there is adhesion (or bond)
the fractal characteristics of asperities. However, the model is ap- of two objects due to the interaction between molecules or atoms
plicable only to simple and homogeneous shapes of material body. of the object surface. The sum of the areas of the actual contact
Ren at al. [2] synthesized friction sounds using the techniques of points is called actual contact area. The force required to overcome
Raghuvanshi et al. [28] and a surface model of the object in three the adhesion at actual contact points is the friction force, which is
phases of scales following [1]. However, the model could not treat represented by the following equation:
objects with nonlinear properties such as deformation of clothes.
On the other hand, An et al. [3] made friction sounds synthesized Ff ric = σs × Ar (1)
to the cloth animation using a database of recorded actual sounds.
The velocity of the cloth mesh top is matched with entries in the where Ff ric is the friction force, σs is the shear strength, and Ar
database and the corresponding fragment of friction sound is re- is the actual contact area. Assuming Ar is related to the load W
turned, and the fragments are joined to synthesize the sound. Mak- as Ar = aW with a constant of proportionality a, the friction
ing of such a database requires a large amount of time and labor to coefficient µ is given by the following equation.
record the sounds of rubbing clothes at various speeds. Our pro-
Ff ric σs × As
posed method solves these problems partially with a novel surface µ= = = aσs (2)
model using a physically well-grounded simulation and enables W W
synthesis of friction sounds for deformable objects. Therefore, the friction coefficient µ does not depend on the ap-
parent area of contact (Amonton’s first law of friction). An actual
3. BACKGROUND contact point is formed at the point surrounded by the green rect-
angle in Fig. 2, forming a “stick” state. The adhesion at the pair of
3.1. Principle of Friction asperities is cut by the elastic energy and the actual contact point
collapses a few moments later, termed as a “slip” state, when the
Friction is a complicated phenomenon to solve analytically be- two objects move against each other in a horizontal direction. The
cause it is an irreproducible, microscale phenomenon (an object energy dissipated in this series of processes is equal to the work
surface slightly deforms by friction). As shown in Fig. 1, when done by the friction force. The stick-slip phenomenon is repeti-
an external force is applied on an object in contact with another tion of the process of adhesion and cutting the pair of asperities.
object in order to move the first object, friction acts to impede the The global motion of the object and the stick-slip phenomenon
motion. The force caused by friction is called friction force. The at actual contact points are not related. The actual contact points
magnitude of friction force depends on the objects in contact. change with time by the motion of the object. Even though the
object continues the sliding movement at a regular velocity, asper-
ities on the object surface are in stick state with pairs of asperities
on the different object surfaces forming the actual contact points.
An asperity in the stick state is deformed by the motion of an ob-
ject, and the elastic energy is accumulated around the asperity. The
actual contact point beyond the limit of elasticity slip by the elas-
tic energy and the neighborhood of an asperity vibrate around the
equilibrium state. The stick-slip motion is repeated locally within
the system during the movement of an object at a constant velocity.
Figure 1: Forces during motion of two objects in contact. The Since the frequency of local slip per unit time is proportional to the
friction force Ff ric is the resistance between the relative motion sliding velocity, the energy dissipation per unit time is also propor-
of the two bodies. tional to the sliding velocity. Therefore, the friction force does not
depend on the sliding velocity as the energy dissipation per unit
The classical laws of friction derived from observing objects, known time is equal to the friction force times the sliding velocity. Such
as Amonton-Coulomb laws of friction, are as follows. a localized stick-slip motion explains the third Amonton-Coulomb
• Amonton’s first law: magnitude of friction force is inde- law.
pendent of contact area.
• Amonton’s second law: magnitude of friction force is pro- 3.2. Sound Propagation
portional to magnitude of normal force. The sound propagates in air as a wave. In general, a sound propa-
• Coulomb’s law of friction: magnitude of kinetic friction is gated through the space Ω can express using a wave equation (3):
independent of speed of slippage.
1 ∂ 2 φ(~
x, t)
A phenomenological discussion of Amonton-Coulomb laws is in- ∇2 φ(~
x, t) − = 0, ~x ∈ Ω (3)
corporated in the theory of adhesion [6], which is deeply rooted in c2 ∂t2
the field of tribology. Adhesion theory describes a friction phe- where c is the speed of sound in air (at standard temperature and
nomenon based on Amonton-Coulomb model of friction, espe- pressure, it is empirically described as c = 331.5+0.6t [m/s] with
cially with a focus on the second property. The main principle of temperature t [◦ C]). In the study of acoustics, acoustic pressure
adhesion theory is that the contact between two object is present p(~x, t) and particle velocity u(~x, t) are used as a variable of the
at points. The surface roughness of an object is on a sufficiently wave equation as follows.
small scale. In particular, the convex portion of the object surface
is termed as an asperity. Friction is considered to be caused by 1 ∂2
contact between asperities of the two objects (called actual contact ∇2 − 2 2 p(~x, t) = 0 (4)
c ∂t
DAFX-2
DAFx-33
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 2: Illustration of contact between two objects and the stick-slip phenomenon at an actual contact point. Due to the roughness of the
surface, objects are in contact with each other at points, instead of a large area.
1 ∂2 where α and β are constants. The roughness of the object surface
∇2 − 2 2 u(~x, t) = 0 (5)
c ∂t is expressed as dispersions σa and σb , respectively, of the normal
The relation between acoustic pressure and particle velocity is de- distributions of lengths of a and b.
scribed as first-order partial differential equation (PDE):
4.2. Determination of Global Shape and Amplitude
∂u(~
x, t)
ρ = −∇p(~x, t) (6) Several studies have used elastic body simulation for computer an-
∂t
imation. There are three main methods used in computer graphics.
where ρ is the density of the air (at standard temperature and pres- Finite element method (FEM) is the one of most popular methods
sure, it is empirically described as ρ = 1.293/(1 + 0.00367t) and is based on physics. The spring mass model [41] sacrifices
[kg/m3 ] with temperature t [◦ C]). accuracy to achieve a low costs of computation. In addition, there
is a specialized technique for computer animation [42]. To de-
4. GENERATE FRICTION SOUND termine the global transition of the body shape and the amplitude
which of vibration of the rectangular domains by friction, we han-
Our physics-based method comprises three steps (see Fig. 3). dle the mesh vertices of the material body as true actual contact
First, we separate an object surface into microrectangles (local points and control their positions. This is because PBD is widely
shape) around each vertex of the object and define their initial known in the field of computer graphics and has the advantages of
size. The initial size of the microrectangles is decided by the ob- robustness and simplicity. Let us postulate there are N particles
ject texture (e.g., in the case of a cloth, the initial size is the thread with positions xi and inverse masses gi = 1/mi . For a constraint
diameter) and is deformed on the basis of two assumptions: the de- C, the positional corrections ∆qi is calculated as
formation (1) depends on the velocity of the contacting vertex and
(2) the initial size obeys normal distributions. The deformation of ∆qi = −sgi ∇qi C(q1 , · · · , qn ) (9)
a microrectangle is given by the following equation: where
C(q1 , · · · , qn )
∂ w 2 s= P . (10)
D∇4 w + ρ =0 (7) j j ∇qi C(q1 , · · · , qn )
g
∂t2
where D is the flexural rigidity, w is the displacement, and ρ is As constraints with respect to the position of the object vertices, we
the density of the object [39]. We then determine the global shape adapt constraints representing the shape deformation of the body
of the object using PBD [7]. As constraints, we adopt a distance and the adhesion at actual contact points: the respective terms de-
constraint between the vertices of the object and an adhesion con- fined as CDef ormation and CAdhesion . Then, the total constraint
straint between two objects. Finally, we synthesize the sounds gen- C is expressed by
erated by each microrectangle at the observation point by using a C = CDef ormation + CAdhesion . (11)
wave equation [40]. The synthesized sound can be expressed as a
linear sum of sounds because the sound waves are independent. In our method, we adopt a distance constraint between the ver-
tices of the body for CDef ormation and a relational expression for
4.1. Determination of Vibration Area CAdhesion , which is given as
Z
The spectrum of friction sound depends on the velocity of the ob- (r > d)
ject. Therefore, An et al. [3] recorded several friction sounds at CAdhesion = r . (12)
Z (r ≤ d)
different speeds and made a friction sound database consisting of
pairs of a sound spectrum and a velocity. We also conducted exper- where r is the distance between a vertex of the object and another
iments to record friction sounds as described in their method [3] object, Z is the adhesion degree depended on material properties
and confirmed the relation between spectrum and velocity (see Fig. and d is the effective distance of the adhesion. Equation (12) de-
4). To reflect the dependence on velocities of the object, we define notes represents the fact that the attracting forces between the ob-
relationships between sides a and b of the rectangular domain and jects are given as a Coulomb potential. Then, the amplitude A of
the velocity v of an actual contact point as follows: the rectangular amplitude is determined by the following equation:
1 1 1
= αv, = βv (8) A= (xi (t + ∆t) − xi (t)) (13)
a b ]X
DAFX-3
DAFx-34
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 3: Overview: For the input object, we first define the vibration area which deform independently with regard to each surrounding
vertex of the object surface. We then compute a global shape of the object and an amplitude of each vibration area to use PBD, and a local
shape of each rectangle. Finally, we synthesize the sounds which is generated by each surrounding vertex of the object at the observation
point P based on wave equation.
M = {m | m = 2k + 1} (14) The wave equation with sound sources is generally solved to ana-
N = {n | n = 2l + 1} (15) lyze sound propagation in air. These equations have high compu-
tational cost due to the complex boundary conditions. In our case,
where m and n are the mode orders of the vibration and k and l a rectangle sound source can be approximated by a point sound
are natural numbers N ∪ {0}. source, because a and b (sides of the rectangle) are sufficiently
small compared with the distance |r| between the actual contact
point and the observation point. Therefore, we take the limit of
4.3. Determination of Local Shape
a, b → 0 to approximate a plane sound source with a point sound
This section describes the main principle of our method to define source. Then, the spatial distribution of the sound pressure p(r, t),
the vibration area on an object surface. Waves propagate from where r is the position and t is the time, is denoted as
the actual contact point when friction occurs. We consider that
these waves spread out a rectangular domain around the actual k −ikr
p(r, t) = iρcV0 e (19)
contact point. In general, the propagating waves are attenuated 4πr
in the body or prevented from spreading to the neighboring actual where i is the imaginary unit, ρ is the air density, c is the acoustic
contact points. Therefore, the vibration of the rectangular domain velocity, V0 is the vibration velocity of the object surface, and k is
DAFX-4
DAFx-35
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
the wave number (see Appendix B). The vibration velocity V0 can theory. We synthesized friction sounds and confirmed that clear
be expressed as differences in tone appear by varying the parameters. As future
works, synthesis of friction sounds with high quality in compli-
V0 = iAωmn αmn eiωmn t (20) cated scenes and body shapes is important. In particular, we would
where like to consider the transfer of sound waves between the object and
the observation point to reflect the spatial properties affecting the
1 m+n≡0 (mod 2) sound, such as reflection, refraction, and diffraction. Further, we
αmn = . (21)
−1 m+n≡1 (mod 2) want to evaluate the similarity between the synthesized sounds and
Finally, we synthesize the friction sounds caused by the vibration the actual sounds. In addition, we want to parallelize the calcula-
of each actual contact point at the observation point. Since the tions using GPUs for each actual contact point to speed up the
sound waves are independent, the synthesized sound at the obser- synthesis.
vation point can be expressed as a linear sum of the sound made
by each actual contact point. Therefore, the synthesized sound can 7. ACKNOWLEDGMENTS
be written as X
sound = pi (ri , t) (22) This work was supported by JST ACCEL Grant Number JPM-
i JAC1602, Japan.
where pi is the sound pressure of i-th vertex under friction.
8. REFERENCES
5. RESULTS
[1] Kees Van Den Doel, Paul G Kry, and Dinesh K Pai, “Fo-
We show the parameters that we use at the time of physical sim- leyautomatic: physically-based sound effects for interactive
ulation in table 1. In the simulation, we utilize literature val- simulation and animation,” in Proceedings of the 28th an-
ues [43, 44] in terms of shear modulus E, Poisson ratio ν and nual conference on Computer graphics and interactive tech-
density par unit area ρ. The execution environment when simulat- niques. ACM, 2001, pp. 537–544.
ing is CPU: Intel(R) core(TM) i7 CPU 2.93GHz, GPU: NVIDIA
GeForce GT 220, RAM: 4GB and OS: Windows 7. Also, the time [2] Zhimin Ren, Hengchin Yeh, and Ming C Lin, “Synthesizing
step t of the physical simulation is 1/44100 [s], the refresh rate of contact sounds between textured models,” in Virtual Reality
animations is 60 [fps] and the audio sampling rate of sounds by Conference (VR), 2010 IEEE. IEEE, 2010, pp. 139–146.
friction is 44100 [Hz]. [3] Steven S An, Doug L James, and Steve Marschner, “Motion-
driven concatenative synthesis of cloth sounds,” ACM Trans-
5.1. Examples actions on Graphics (TOG), vol. 31, no. 4, pp. 102–111,
2012.
cloth: In our results, we synthesize friction sounds in the case of
pulling a cloth (cotton) over the sphere (see Fig. 6 and Fig. 7). We [4] Camille Schreck, Damien Rohmer, Doug James, Stefanie
only change the velocity of a cloth in cloth1 and cloth 2. Then, our Hahmann, and Marie-Paule Cani, “Real-time sound synthe-
method can generate a friction sound matched to its speed in each sis for paper material based on geometric analysis,” in Eu-
case. Looking at the figures, there are differences between cloth rographics/ACM SIGGRAPH Symposium on Computer Ani-
1 and 2. In particular, the results show that (i) fluctuating power mation (2016), 2016.
of the sound denotes stick-slip and (ii) friction phenomena lead to [5] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor-
various consequences on each frame. ralba, Edward H Adelson, and William T Freeman, “Visually
metal: In this result, we generate the friction sounds of metal indicated sounds,” in Proceedings of the IEEE Conference on
(cooper) sliding on the floor (see Fig. 8). As a result, we suc- Computer Vision and Pattern Recognition, 2016, pp. 2405–
ceed to create a friction sound of the metal. The spectrum shows 2413.
natural frequency of a copper on 0.31 [s] and 0.54 [s], which is
[6] Lieng-Huang Lee, Fundamentals of adhesion, Springer Sci-
caused by a picking action.
ence & Business Media, 2013.
In this study, we propose the examples - a cloth pulled over the
sphere and a copper sliding on the floor. In particular, the results [7] Matthias Müller, Bruno Heidelberger, Marcus Hennix, and
synthesizing friction sounds for clothes are the first illustrations. John Ratcliff, “Position based dynamics,” Journal of Visual
In addition, our proposed method is able to manage metal like ma- Communication and Image Representation, vol. 18, no. 2,
terials such as a copper. In each case, our method can synthesize pp. 109–118, 2007.
a friction sound as an actual sound. In our method, we focus on [8] JT Oden and JAC Martins, “Models and computational meth-
the only the friction sounds. However, our method can be used ods for dynamic friction phenomena,” Computer methods in
in combination with other methods owing to the postulation that applied mechanics and engineering, vol. 52, no. 1, pp. 527–
object surface vibrates independently, and expected to bring about 634, 1985.
sounds improving effect of correcting.
[9] JAC Martins, JT Oden, and FMF Simoes, “A study of static
and kinetic friction,” International Journal of Engineering
6. CONCLUSION
Science, vol. 28, no. 1, pp. 29–92, 1990.
In this study, we presented a novel method to generate friction [10] Yu A Karpenko and Adnan Akay, “A numerical model of
sounds that do not depend on the material body. The method in- friction between rough surfaces,” Tribology International,
volves physical simulation of friction on the basis of the adhesion vol. 34, no. 8, pp. 531–545, 2001.
DAFX-5
DAFx-36
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Shear modulus E [GPa] Poison ratio ν Density ρ [kg/m2 ] a, b [m] Dispersion σ 2 Vertex Time [min]
cloth1 928.7 0.85 1.54 × 101 5.0 × 10−4 1.0 × 10−12 900 31
cloth2 928.7 0.85 1.54 × 101 5.0 × 10−4 1.0 × 10−12 900 59
copper 129.8 0.343 8.94 1.0 × 10−8 1.0 × 10−7 242 72
Figure 6: Key frames and sound waveform of a cloth (cotton). In this example, pull a cloth toward the front and slide it over the sphere.
Figure 7: Key frames and sound waveform of a cloth (cotton). In this example, change the velocity of pulling a cloth (× 1/2).
Figure 8: Key frames and sound waveform of a metal (copper). In this example, slide a copper on the floor.
[11] Federico Avanzini, Stefania Serafin, and Davide Rocchesso, Effects (DAFx-15), 2015.
“Modeling interactions between rubbed dry surfaces using
an elasto-plastic friction model,” in Proc. DAFX, 2002. [13] Adnan Akay, “Acoustics of friction,” The Journal of the
Acoustical Society of America, vol. 111, no. 4, pp. 1525–
[12] Charlotte Desvages and Stefan Bilbao, “Two-polarisation fi- 1548, 2002.
nite difference model of bowed strings with nonlinear con-
tact and friction forces,” in Int. Conference on Digital Audio [14] Tapio Takala and James Hahn, “Sound rendering,” in ACM
DAFX-6
DAFx-37
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
SIGGRAPH Computer Graphics. ACM, 1992, vol. 26, pp. [30] Nicolas Bonneel, George Drettakis, Nicolas Tsingos, Is-
211–220. abelle Viaud-Delmon, and Doug James, “Fast modal sounds
with scalable frequency-domain synthesis,” in ACM Trans-
[15] William W Gaver, “Synthesizing auditory icons,” in Pro-
actions on Graphics (TOG). ACM, 2008, vol. 27, pp. 24–32.
ceedings of the INTERACT’93 and CHI’93 conference on
Human factors in computing systems. ACM, 1993, pp. 228– [31] Jeffrey N Chadwick, Steven S An, and Doug L James, “Har-
235. monic shells: a practical nonlinear sound model for near-
[16] Kees van de Doel and Dinesh K Pai, “Synthesis of shape rigid thin shells,” in ACM Transactions on Graphics (TOG).
dependent sounds with physical modeling,” 1996. ACM, 2009, vol. 28, pp. 119–128.
[17] Kees van den Doel and Dinesh K Pai, “The sounds of phys- [32] Changxi Zheng and Doug L James, “Rigid-body fracture
ical shapes,” Presence: Teleoperators and Virtual Environ- sound with precomputed soundbanks,” in ACM Transactions
ments, vol. 7, no. 4, pp. 382–395, 1998. on Graphics (TOG). ACM, 2010, vol. 29, pp. 69–81.
[18] Kevin Karplus and Alex Strong, “Digital synthesis of [33] Changxi Zheng and Doug L James, “Toward high-quality
plucked-string and drum timbres,” Computer Music Journal, modal contact sound,” ACM Transactions on Graphics
vol. 7, no. 2, pp. 43–55, 1983. (TOG), vol. 30, no. 4, pp. 38–49, 2011.
[19] Perry R Cook, Real sound synthesis for interactive applica- [34] Jeffrey N Chadwick, Changxi Zheng, and Doug L James,
tions, CRC Press, 2002. “Precomputed acceleration noise for improved rigid-body
sound,” ACM Transactions on Graphics (TOG), vol. 31, no.
[20] Stefan Bilbao, Numerical sound synthesis: finite difference 4, pp. 103–111, 2012.
schemes and simulation in musical acoustics, John Wiley &
Sons, 2009. [35] Timothy R Langlois, Steven S An, Kelvin K Jin, and Doug L
James, “Eigenmode compression for modal sound models,”
[21] Andrew Allen and Nikunj Raghuvanshi, “Aerophones in flat- ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp.
land: interactive wave simulation of wind instruments,” ACM 40–48, 2014.
Transactions on Graphics (TOG), vol. 34, no. 4, pp. 134,
2015. [36] Gabriel Cirio, Dingzeyu Li, Eitan Grinspun, Miguel A
Otaduy, and Changxi Zheng, “Crumpling sound synthesis,”
[22] James F Director-O’Brien, “Synthesizing sounds from phys-
ACM Transactions on Graphics (TOG), vol. 35, no. 6, pp.
ically based motion,” in ACM SIGGRAPH 2001 video review
181, 2016.
on Animation theater program. ACM, 2001, pp. 59–66.
[37] Matthias Rath, “Energy-stable modelling of contacting
[23] James F O’Brien, Chen Shen, and Christine M Gatchalian,
modal objects with piece-wise linear interaction force,”
“Synthesizing sounds from rigid-body simulations,” in Pro-
DAFx-08, Espoo, Finland, 2008.
ceedings of the 2002 ACM SIGGRAPH/Eurographics sym-
posium on Computer animation. ACM, 2002, pp. 175–181. [38] Stefano Papetti, Federico Avanzini, and Davide Rocchesso,
[24] Dinesh K Pai, Kees van den Doel, Doug L James, Jochen “Energy and accuracy issues in numerical simulations of a
Lang, John E Lloyd, Joshua L Richmond, and Som H Yau, non-linear impact model,” in Proc. Of the 12th Int. Confer-
“Scanning physical interaction behavior of 3d objects,” in ence on Digital Audio Effects, 2009.
Proceedings of the 28th annual conference on Computer [39] Arthur W Leissa, “Vibration of plates,” Tech. Rep., DTIC
graphics and interactive techniques. ACM, 2001, pp. 87–96. Document, 1969.
[25] Cynthia Bruyns, “Modal synthesis for arbitrarily shaped ob- [40] Miguel C Junger and David Feit, Sound, structures, and their
jects,” Computer Music Journal, vol. 30, no. 3, pp. 22–37, interaction, vol. 225, MIT press Cambridge, MA, 1986.
2006.
[41] Xavier Provot, “Deformation constraints in a mass-spring
[26] Nobuyuki Umetani, Jun Mitani, and Takeo Igarashi, “De- model to describe rigid cloth behaviour,” in Graphics in-
signing custom-made metallophone with concurrent eigen- terface. Canadian Information Processing Society, 1995, pp.
analysis.,” in NIME. Citeseer, 2010, vol. 10, pp. 26–30. 147–147.
[27] Gaurav Bharaj, David IW Levin, James Tompkin, Yun Fei, [42] Matthias Müller, Bruno Heidelberger, Matthias Teschner,
Hanspeter Pfister, Wojciech Matusik, and Changxi Zheng, and Markus Gross, “Meshless deformations based on shape
“Computational design of metallophone contact sounds,” matching,” in ACM Transactions on Graphics (TOG). ACM,
ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 2005, vol. 24, pp. 471–478.
223, 2015.
[43] Shigeta Fujimoto, “Modulus of rigidity as fiber properties,”
[28] Nikunj Raghuvanshi and Ming C Lin, “Interactive sound Textile Engineering, vol. 22, no. 5, pp. 369–376, 1969.
synthesis for large scale environments,” in Proceedings of
the 2006 symposium on Interactive 3D graphics and games. [44] National Astronomical Observatory of Japan, Ed., Chrono-
ACM, 2006, pp. 101–108. logical Science Tables 2016 A.D., Maruzen, 2015.
[29] Doug L James, Jernej Barbič, and Dinesh K Pai, “Pre-
computed acoustic transfer: output-sensitive, accurate sound A. VIBRATION OF PLATE
generation for geometrically complex vibration sources,” in
ACM Transactions on Graphics (TOG). ACM, 2006, vol. 25, Suppose a micro rectangle is an elastic isotropic plate and there
pp. 987–995. are no normal forces Ni , inplane shearing forces Nij and external
DAFX-7
DAFx-38
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-39
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Dmitri Kartofelev
Department of Cybernetics, School of Science,
Tallinn University of Technology,
Tallinn, Estonia
dima@ioc.ee
ABSTRACT years many authors have solved this problem using different ap-
proaches. The problem was considered by Schatzman [3] and Ca-
This paper presents a kinematic time-stepping modeling approach bannes [4], who used the method of characteristics and assumed
of the ideal string vibration against a rigid obstacle. The prob- that the string does not lose energy when it hits an obstacle. Bur-
lem is solved in a single vibration polarisation setting, where the ridge et al. [5] assumed an inelastic constraint where the string
string’s displacement is unilaterally constrained. The proposed nu- is losing kinetic energy during contact. Ducceschi et al. [6], Kr-
merically accurate approach is based on the d’Alembert formula. ishnaswamy and Smith [7], Han and Grosenbaugh [8], Bilbao et
It is shown that in the presence of the obstacle the lossless string al. [9], Bilbao [10], Chatziioannou and Walstijn [11], and Taguti
vibrates in two distinct vibration regimes. In the beginning of the [12] used a finite difference method to study the string interaction
nonlinear kinematic interaction between the vibrating string and with the obstacle. Issanchou et al. [13], van Walstijn and Bridges
the obstacle the string motion is aperiodic with constantly evolv- [14], and van Walstijn et al. [15] used a modal analysis approach.
ing spectrum. The duration of the aperiodic regime depends on the Vyasarayani et al. [1], Mandal and Wahi [16], and Singh and Wahi
obstacle proximity, position, and geometry. During the aperiodic [17] described the movement of the sitar string with partial differ-
regime the fractional subharmonics related to the obstacle position ential equations or sets of partial differential equations. Rank and
may be generated. After relatively short-lasting aperiodic vibra- Kubin [18], Evangelista and Eckerholm [19], and Siddiq [20] used
tion the string vibration settles in the periodic regime. The main a waveguide modelling approach to study the plucked string vibra-
general effect of the obstacle on the string vibration manifests in tion with nonlinear limitation effects.
the widening of the vibration spectra caused by transfer of funda- This paper proposes an idealised approach for modeling the
mental mode energy to upper modes. The results presented in this nonlinear string–obstacle interaction. We consider lossless ideal
paper can expand our understanding of timbre evolution of numer- string vibration and assume that the obstacle is absolutely rigid.
ous stringed instruments, such as, the guitar, bray harp, tambura, The interaction between the vibrating string and the obstacle is
veena, sitar, etc. The possible applications include, e.g., real-time modeled in terms of kinematics, i.e., we study the string motion
sound synthesis of these instruments. without considering its mass nor the possible inertial and reac-
tional forces acting between the string and the obstacle during the
1. INTRODUCTION collisions. The heuristics of our approach is directly determined
by the d’Alembert formula (travelling wave solution). The results
Interaction and collision of a vibrating string with spatially dis- presented here are a continuation of the work published in [21].
tributed obstacles, such as fretboard or bridge, plays a signifi- The organisation of the paper is as follows. In Sec. 2, the string
cant role in the mechanics of numerous stringed musical instru- vibration modeling is explained. In Sec. 3 the problem description
ments. One elegant example is the Medieval and Renaissance bray and the kinematic numerical model of string–obstacle interaction
harp which has small bray-pins which provide a metal surface for is presented and explained. Section 4 presents the results and anal-
the vibrating string to impact, increasing the upper partial content ysis of three case studies, where the effects of the obstacle geome-
in the tone and providing a means for the harp to be audible in try, proximity, and position are considered. In Sec. 5 the presented
larger spaces and in ensemble with other instruments [1]. It is results and the accuracy and efficiency of the numerical model are
evident that for realistic physics-informed modeling of this instru- briefly commented on. Section 6 presents the main results and
ment such nonlinearity inducing interactions need to be properly conclusions.
examined and understood.
The string–obstacle interaction modeling is a long-standing
2. MODELING STRING VIBRATION
problem in musical acoustics. In the early twentieth century, Ra-
man was first to study the problem and identify the veena bridge as Let us consider the vibration of lossless ideal string in a single
the main reason for distinctive sounding of the tambura and veena. vibration polarisation setting. The one-dimensional equation of
He noted that all string frequencies in these instruments are ex- motion, called the wave equation, is in the following form:
cited irrespective of the location of the excitation thereby violat-
ing the Young-Helmholtz law which states that the vibrations of a ∂2u ∂2u
string do not contain the natural modes which have a node at the 2
= c2 2 , (1)
∂t ∂x
point of excitation. He notes that this is caused by the geometry
of the bridge but did not explain the reason behind the inapplica- where
p u(x, t) is the transverse displacement of the string, c =
bility of the Young-Helmholtz law [2]. Since then, much effort T /µ is the speed of the waves travelling on the string, where
has been devoted to modeling the collision dynamics of a vibrat- T is the tension and µ is the linear mass density of the string,
ing string with various obstacles or boundary barriers. Over the having the dimension [kg/m]. In the context of a real string Eq. (1)
DAFX-1
DAFx-40
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
can be used as an approximation of thin homogeneous elastic (and is a solution to Eq. (9). This result is not surprising because Eq. (1)
lossless) string vibration under
p a small amplitude restriction. In can be factored into
this case wave speed c = T /(ρA◦ ), where ρ is the volumetric
∂ ∂ ∂ ∂
density, A◦ = πr2 is the cross-section area of a cylindrical string, −c +c u = 0. (10)
and T is the tension. Equation (1) is studied, here, in a normalised ∂t ∂x ∂t ∂x
and dimensionless form for increased clarity. We normalise the Advection Eqs (8) and (9) are used below to numerically model
fundamental frequency f0 by setting c = 1 and by introducing the the propagation of travelling waves. The finite difference method
following dimensionless variables: is used to approximate the solutions of these equations.
T X U We discretise xt-plane into n × m discrete samples using a
t= , x= , u= , (2) grid with equal step sizes in x and t directions. We discretise the x-
P λ λ
axis with grid spacing ∆x = L/n and the t-axis with grid spacing
where P is the fundamental period and λ is the corresponding ∆t = ∆x = tmax /m, where m = 2n. We let xi = i∆x, where
wavelength. The dimensional quantities T , X, and U are the 0 ≤ i ≤ n and tj = j∆t, where 0 ≤ j ≤ m. From here it
time, space, and string displacement, respectively. Additionally, follows that uji = u(xi , tj ), rij = r(xi , tj ), and lij = l(xi , tj ). A
the speaking length L of the string is chosen to be a half of the combination of step forward finite difference (FD) approximations
wavelength, i.e., L = 1/2 [d. u.] (dimensionless units). This en- of first order derivatives
sures that the fundamental frequency f0 = 1 because for Eq. (1) it
holds that ∂u uj+1 − uji ∂u uj − uji
c c ≈ i , ≈ i+1 , (11)
f0 = = . (3) ∂t ∆t ∂x ∆x
2L λ
In order to further simplify the frequency domain analysis pre- and step backwards FD approximations
sented in Sec. 4 the following initial condition is selected: at t = 0
∂u uj − uij−1 ∂u uj − uji−1
the string is freely released and the initial displacement is selected ≈ i , ≈ i , (12)
to be equal to the shape of the fundamental mode ∂t ∆t ∂x ∆x
are used to approximate Eqs (8) and (9). Resulting FD approxima-
u0 (x) = u(x, 0) = A sin 2πx, x ∈ [0, L], (4) tions are in the following form:
∂ ∆t j
u(x, 0) = 0, x ∈ [0, L], (5) rij+1 − rij + c j
(r − ri−1 ) = 0, (13)
∂t ∆x i
where A = 1 is the amplitude. This selection results in a sinu-
∆t j
soidal (standing wave) vibration of all string points with funda- lij+1 − lij − c
(l − lij ) = 0. (14)
mental frequency f0 = 1 determined by the normalised Eq. (1). ∆x i+1
Figure 1 shows the initial condition (4). Because c = 1 and ∆x = ∆t, in our case, the Courant number
It is well known that the Eq. 1 has an analytical solution known c∆t/∆x = 1 and the above expressions are simplified
as the d’Alembert formula. For infinite string (ignoring boundary
rij+1 = ri−1
j
, (15)
conditions for now) and for initial conditions (4), (5) the solution
takes the following form:
lij+1 = li+1
j
. (16)
1 By applying these rules for all grid points i and j one gets a simple
u(x, t) = (u0 (x − ct) + u0 (x + ct)) . (6)
2 translation of numerical values rij and lij propagating in opposite
directions with respect to the x-axis (i-axis). This result agrees
This solution represents a superposition of two travelling waves:
with the d’Alembert formula (7) and can be understood as a digi-
u0 (x − ct)/2 moving to the right (positive direction of the x-axis),
tal waveguide based on travelling wave decomposition and use of
and u0 (x+ct)/2 moving to the left. The function u0 /2 describing
delay lines. The equivalence between FD approximation used here
the shape of these waves stays constant with respect to x-axis, as
and digital waveguide modeling is shown in [22].
they are translated in opposite directions at speed c = 1.
So far we have not addressed the boundary conditions of Eq. (1).
In the general case and under our assumptions a wave on any
We assume that the string is fixed at both ends. The following
arbitrary segment of the string can be understood as a sum of two
boundary conditions apply:
travelling waves that do not need to be equal. This means that one
can write u(0, t) = u(L, t) = 0, t ∈ [0, tmax ], (17)
u(x, t) = r(x − ct) + l(x + ct), (7)
where tmax = 10P is the maximum integration time (size of the
where r(x − ct) is the travelling wave moving to the right and
temporal domain). By applying boundary conditions (17) to the
l(x + ct) is the travelling wave moving to the left. Conveniently,
general solution (7) of the initial equation one finds for x = 0 the
one-dimensional advection equations
reflected travelling wave in the following form:
∂r ∂r
+c = 0, (8) u(0, t) = r(−ct) + l(ct) = 0 ⇒ r(−ct) = −l(ct), (18)
∂t ∂x
∂l ∂l and similarly for x = L:
−c = 0, (9)
∂t ∂x
u(L, t) = r(L − ct) + l(L + ct) = 0 ⇒
have similar general solutions. The travelling wave r(x−ct), on its
own, is also a solution to Eq. (8) and the travelling wave l(x + ct) ⇒ l(L + ct) = −r(L − ct). (19)
DAFX-2
DAFx-41
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
These results are discretised according to the FD discretisation cross-section profile function B(x) of the obstacle, and its discre-
scheme discussed above. The travelling wave (18) reflected from tised samples Bi .
the left boundary at x = 0 is thus The kinematic modeling of the string–obstacle interaction is
a twofold problem. First, one needs to consider travelling wave
r0j = −l0j , j ∈ [0, m], (20) r(x − ct) approaching the obstacle form the left side. Second, one
considers the travelling wave l(x + ct) approaching the obstacle
and the travelling wave (19) reflected from the right boundary at from the right side.
x = L is
lnj = −rnj , j ∈ [0, m]. (21)
3.1. Reflection of travelling wave r(x − ct)
In order to obtain the resulting string displacement uji , for the se-
lected initial and boundary conditions, a superposition of travel- The heuristics of the following approach is strictly determined by
ling waves (15), (16), (20), and (21) is found in accordance with the d’Alembert formula (6) or (7). Any change in string displace-
general solution (7) ment u(x, t) imposed by the obstacle must involve both travelling
waves. The reflection of the travelling wave r(x−ct) approaching
uji = rij + lij , {i, j} ∈ [0, n] × [0, m]. (22) the obstacle from the left is determined by a change in the travel-
ling wave l(x + ct). We assume that the travelling wave l(x + ct)
resulting from the interaction first appears, or already existing one
is modified, only at the point x = x∗ ≤ b, where the amplitude of
A
string displacement u(x∗ , t) < B(x∗ ), i.e., the string attempts to
u0 (b) = u(b, 0)
A penetrate the obstacle. The position of this point x∗ is determined
u(x, t) [d. u.]
l(x∗ , t) = l(x∗ , t) + ˆ
l(x∗ , t). (25)
−A The above modification of the travelling wave l(x + ct) simply en-
x<b b b0 x > b0 = b + ∆x
Distance x [d. u.] sures that the resulting string displacement u(x∗ , t) = r(x∗ , t) +
l(x∗ , t) does not geometrically penetrate the obstacle during the
Figure 1: Top: Schematic of the problem studied. The initial con- collision.
dition (4) and the corresponding travelling waves are presented. In determining the points x∗ one does not need to consider
The obstacle is shown with the grey formation at x = b. The rest- points x > b. In the absence of any waves approaching from
ing position of the string is shown with the horizontal dash-dotted the right and for x∗ = b the string displacement u(b, t) = −D
line. Bottom: The cross-section profile B(x) of the rigid parabolic (caused by the reflection process explained in the next Subsection).
obstacle and its discrete samples Bi , where Bβ = Bβ+1 = −D. This means that the string displacement u(x, t) becomes truncated
and equal to the maximum value of the obstacle as the propagating
wave passes over its apex. This phenomenon is shown in Fig. 2
in the case of a short wavelength bell-shaped pulse reflecting from
3. STRING–OBSTACLE INTERACTION KINEMATICS the obstacle during the first period of string vibration. Because the
obstacle profile is an inverted parabola with its maximum at x = b
Let us consider a smooth and absolutely rigid impenetrable obsta- then for x > b it holds that u(x, t) > B(x) ⇒ x 6= x∗ .
cle. The obstacle is placed near the vibrating string so that it is Numerical implementation of the procedure is straightforward.
able to obscure string’s displacement u(x, t). We select an obsta- We let xβ = β∆x = b (see Fig. 1) and xi∗ = i∗ ∆x = x∗ . Using
cle with parabolic cross-section profile this notation the reflected travelling wave (24) takes the following
form:
(x − b)2 ˆj
li∗ = Bi∗ − uji∗ , i ∈ [0, β]. (26)
B(x) = − +D , (23)
2R The resulting reflection and consequent string shape according to
where x = b is the position of the obstacle along the string, D is (22) and (25) for grid points i∗ becomes
the vertical proximity of the obstacle to the string at its rest posi-
uji∗ = ri∗
j j
+ li∗ , i ∈ [0, β], (27)
tion, and R is the curvature radius of parabola at its apex. The
function B(x) is discretised according to the FD discretisation where
scheme presented in the previous Section. We let Bi = B(xi ). j
li∗ j
= li∗ +ˆj
li∗ , i ∈ [0, β]. (28)
Figure 1 shows the schematic drawing of the problem studied, the
DAFX-3
DAFx-42
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.0 tion and consequent string shape according to (22), (29), and (30)
−D
for grid points i∗ is
0.5
u(x, 0.13)
j
uji∗ = ri∗ j
+ li∗ , i ∈ (β, n], (31)
−1.0 where
u(x, t) [d. u.]
0.0 j j j j
ri∗ = ri∗ + r̂i∗ and r̂i∗ = Bi∗ − uji∗ , if i ∈ (β, n], (32)
−D
0.5
u(x, 0.15) 4. RESULTS
−1.0
0.0 Next, three case studies are considered. The effects of varying the
−D values of the obstacle radius R (Sec. 4.1), obstacle proximity D
0.5 (Sec. 4.2), and position b (Sec. 4.3), while keeping other system
u(x, 0.18) parameters constant, on the string vibration are analysed. Table 1
−1.0
0.0 0.1 0.2 L/2 0.3 b L displays the values of parameters used in the case studies.
Distance x [d. u.]
Table 1: Values of the parameters used in the case studies. The pa-
rameter that is being varied is indicated with the grey background.
Figure 2: Reflection of travelling wave r(x − ct), shown with the
thin green line, from the obstacle that is shown with the grey for-
Radius R Position b
mation. The reflected travelling wave ˆ l(x + ct) is shown with the Sec. Proximity D [d. u.]
[d. u.] [d. u.]
thin blue line. Arrows indicate the directions of wave propagation.
1 · 10−5
Bell-shaped initial condition is shown with the dashed line. 4.1 0.5 u0 (b) = 0.29 L/5 = 1/10
6 · 10−3
0.3 u0 (b) = 0.26
4.2 3 · 10−3 L/3 = 1/6
0.1 u0 (b) = 0.09
3.2. Reflection of travelling wave l(x + ct)
0.35 u0 (b) = 0.30 L/3 = 1/6
4.3 4 · 10−3
The determination of the reflection of travelling wave l(x + ct) 0.35 u0 (b) = 0.21 L/5 = 1/10
approaching the obstacle from the right side is similar to the pre-
vious case. In fact, it is a mirror image of that problem with sym- The proximity D of the obstacle to the string at its rest posi-
metry axis at x = b. However, there is a slight difference in the tion is expressed in terms of the initial condition (4). Namely, the
region where it is necessary to evaluate and determine the points string displacement u0 (x) = u(x, 0) at obstacle location x = b
x = x∗ . This difference stems from the selection and mathemat- (see Fig. 1). This is done in order to make the results compa-
ical definition of the obstacle cross-section profile (23), namely, rable to each other. All frequency domain result are calculated
the function B(x), being a unimodal function, has one maximum using time series of the string displacement u(x, t) recorded at
max B(x) = −D at x = b. The problem arises from the fact x = xr = 0.47L = 0.235. This point is close to a node at
that one has already evaluated this point inside the same time mo- x = L/2 shared by all even numbered modes (harmonics or over-
ment t and used it to calculate the reflection of travelling wave tones). The reader must keep in mind that this selection results
r(x − ct). By using this point again one would introduce a dis- in relatively small amplitude values of even numbered modes in
continuity in the string displacement u(x, t). In principle it is not the spectra presented below. On the other hand, this selection en-
possible to use some other closely located neighbouring point, e.g., sures that the amplitude of fundamental mode is nearly unity for
x = b0 = b+∆x (see Fig. 1). This selection, too, would introduce the linear case where the obstacle is absent. This, in turn, will aid
a small discontinuity due to B(b) 6= B(b0 ), and more importantly, in drawing the conclusions. The spectrograms of power spectra,
in a continuous domain of real numbers there is no such thing as amplitude spectra, spectral centroid hf i, and instantaneous spec-
a neighbouring number (point) because one can always find in- tral centroid hf 0 i(t) are calculated using fast Fourier transform al-
finitely many numbers between any two real numbers. One way gorithm which is based on the Fourier transform. In calculating
of resolving this problem is to slightly modify the discretised ap- spectrograms a sliding window approach, in combination with the
proximation of the obstacle profile Bi as shown in Fig. 1. One Hanning window function, is used. Here, window size t = 3P and
simply squeezes in a second maximum point Bβ+1 = −D after window overlap value is 50% of the window size. Instantaneous
grid point i = β corresponding to xβ = β∆x = b. Because spectral centroid hf 0 i(t) is also calculated using the windowing
∆x 1 the overall change introduced in Bi compared to B(x) approach with window size t = P and with window overlap value
remains negligibly small. 70% of the window size. Video animations of the string vibrations
Let us formalised the result. Similarly to the previous case for all case studies presented below can be downloaded at the ac-
only at the points x = x∗ ≥ b0 ' b, where by definition and companying web-page of this paper1 . The web-page also includes
during the collision u(x∗ , t) < B(x∗ ), a reflected travelling wave an additional example of a symmetric case where the obstacle is
moving to the right is introduced or existing one is modified positioned at the midpoint of the string at b = L/2.
r(x∗ , t) = r(x∗ , t) + r̂(x∗ , t), (29)
4.1. Effect of the obstacle radius
where
r̂(x∗ , t) = B(x∗ ) − u(x∗ , t). (30) Figure 3 shows the influence of varying the value of obstacle ra-
dius R on the string shape u(x, t) for the duration of the first period
Numerical implementation, using the notation proposed in the
previous Subsection, takes the following form: the resulting reflec- 1 http://www.ioc.ee/~dima/dafx17.html
DAFX-4
DAFx-43
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.0 1.0
0.0
0.0 −0.5
−1.0
−0.5 0 1 tp 2 tp 3 4 5 6 7 8 9 10
Time t [P]
Frequency f [1/P]
−1.0 12
4 −30
0.0 2
0 −40
0 1 2 tp 3 4 5 6 7
−0.5 Time t [P]
Frequency f [1/P]
12
6 · 10−3 (bottom) are positioned at b = L/5 and D = u0 (b)/2. |F [u(0.235, t)]|, R = 1 · 10−5
0.3
|F [u(0.235, t)]|, R = 6 · 10−3
0.2
only. The results of frequency domain analysis of the resulting vi- 0.1
bration are shown in Fig. 4. Visual inspection of the time series 0.0
u(xr , t) reveals that the string vibrates in two distinct vibration 0 1 hf i 3 4 5 6 7 8 9 10 11 12
regimes (generally true for all presented case studies). The ini- Mode number or frequency f [1/P]
tial strong influence of the obstacle is manifested in the constantly
evolving spectrum that prolongs for t = tp . We call this regime Figure 4: Top: Time series u(xr , t) shown for two values of the
the aperiodic regime. After t = tp the string vibration settles in obstacle radius R. Onset time tp of the periodic regime shown
the periodic regime, where the spectrum remains constant. Time with the colour-coded dashed line. Middle: Power spectrograms.
moment tp corresponds to the latest time instance where point x∗ Instantaneous spectral centroid hf 0 i is shown with the solid red
is determined. line marked with bullets. Onset time tp of the periodic regime is
During the aperiodic regime the value of instantaneous spec- shown with the dashed white line. Bottom: Amplitude spectra in
tral centroid grows with the time (generally true for all presented the periodic regimes where t ∈ [dtp e, tmax ]. Spectral centroid hf i
case studies) resulting in the value of spectral centroid that is one is shown with the colour-coded dash-dotted line.
octave (two times) greater compared to the linear case where the
obstacle is absent (in linear case hf 0 i(∞) = hf i = f0 = 1).
The growth of the value of spectral centroid is driven by nonlinear of both cases shows that a short-lasting large amplitude fractional
widening of the spectrum caused by transfer of the fundamental subharmonic at f = 1.5 is generated. This partial does not sur-
mode energy to higher modes. For both presented cases and ac- vive the aperiodic regime. Based on the relationship (3) one can
cording to the amplitude spectra, shown in Fig. 4, approximately conclude that this frequency is related to the fact that the obstacle
65% of initial fundamental mode amplitude A = 1 is redistributed is promoting a node at b = L/3 by effectively dividing the string
to and between the higher modes. Mutual comparison of the cases into two vibrating segments with lengths L/3 and 2L/3.
shows that the case with larger radius R results in approximately Comparison of the periodic regime vibration to the linear case
one-third shorter-lasting aperiodic regime and slightly higher value shows that the resulting value of spectral centroid is increased
of spectral centroid. A relatively large change in the obstacle ra- by almost four to five octaves depending on the case. As in the
dius R has a moderate effect (compared to the other case studies) previous case study the string–obstacle interaction has widened
on the final string vibration—at least for the given parameter val- the spectrum at the expense of the fundamental mode. The re-
ues. This is best seen in roughly equal amplitude spectra. sulting fundamental mode amplitude, for both cases, is approxi-
mately 0.02. This means that during the aperiodic regime more
4.2. Effect of the obstacle proximity than 97% of initial mode amplitude A = 1 is redistributed to
higher modes. The greatest relative and absolute change in mode
Figure 5 shows the influence of varying the value of the obstacle amplitude, caused by the reduction of the distance D between the
proximity D on the string shape u(x, t) for the duration of the first string and the obstacle, is seen for the third and tenth modes. The
period only. The results of frequency domain analysis of the vi- amplitude of third mode is grown two times from approximately
bration are shown in Fig. 6. Inspection of the aperiodic regime 0.02 to approximately 0.04. The amplification of the third mode is
DAFX-5
DAFx-44
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.0 0 1 tp 2 3 tp 5 6 7 8 9 10
D = 0.3 u0 Time t [P]
Frequency f [1/P]
12
10
8 −28
0.0 6
−32
4
0.5 2 −36
0 −40
0 1 tp 2 3 4 5 6 7
1.0
1.0 Time t [P]
Frequency f [1/P]
D = 0.1 u0 12
8 −24
6
0.0 −30
4
2 −36
0.5 0
0 1 2 3 tp 5 6 7
Time t [P]
1.0 0.05
Amplitude [d. u.]
DAFX-6
DAFx-45
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.0 1.0
0.0
0.0 −0.5
−1.0
−0.5 0 1 tp 2 tp 3 4 5 6 7 8 9 10
Time t [P]
Frequency f [1/P]
−1.0 12
4
2 −36
0.0
0 −40
0 1 tp 2 3 4 5 6 7
−0.5 Time t [P]
Frequency f [1/P]
12
b = L/3 (top) and b = L/5 (bottom), and in both cases D = |F [u(0.235, t)]|, b = L/3
0.35 u0 (b). 0.10 |F [u(0.235, t)]|, b = L/5
0.05
DAFX-7
DAFx-46
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
dent and this was to confirm that the problem is nonlinear. IEEE Workshop on Applications of Signal Processing to Au-
It was shown that the ideal lossless string interacting with the dio and Acoustics, New Paltz, NY, USA, 2003, pp. 233–236.
obstacle vibrates in two distinct vibration regimes. In the begin- [8] S. M. Han and M. A. Grosenbaugh, “Non-linear free vibra-
ning of the kinematic interaction between the vibrating string and tion of a cable against a straight obstacle,” Journal of Sound
the obstacle the string motion is aperiodic with constantly evolv- and Vibration, vol. 237, pp. 337–361, 2004.
ing spectrum. After some time of aperiodic vibration the string
[9] S. Bilbao, A. Torin, and V. Chatziioannou, “Numerical mod-
vibration settles in the periodic regime where the string motion is
eling of collisions in musical instruments,” Acta Acustica
repetitious in time. The duration of the aperiodic regime depends
United With Acustica, vol. 101, no. 1, pp. 155–173, 2012.
on the obstacle proximity D, position b, and geometry (curvature
radius R). The comparison of the resulting spectra in the periodic [10] S. Bilbao, “Numerical modeling of string/barrier collisions,”
regime with the linear case where the obstacle was missing showed in Proc. of International Symposium on Musical Acoustics
that the general effect of the obstacle manifests in the widening of (ISMA 2014), Le Mans, France, Jul. 7–12 2014, pp. 1–6.
the spectrum caused by transfer of fundamental mode energy to [11] V. Chatziioannoua and M. van Walstijn, “Energy conserv-
upper modes. The analysis of the relatively short-lasting aperiodic ing schemes for the simulation of musical instrument contact
regime showed that the obstacle position b may generate tempo- dynamics,” Journal of Sound and Vibration, vol. 339, pp.
rary fractional subharmonics related to the node point at x = b. In 262–279, March 2015.
conclusion, the results presented in this paper can expand our un- [12] T. Taguti, “Dynamics of simple string subject to unilateral
derstanding of timbre evolution of numerous stringed instruments, constraint: A model analysis of sawari mechanism,” Acous-
such as, the guitar, bray harp, tambura, veena, sitar, etc. The possi- tical Science and Technology, vol. 29, no. 3, pp. 203–214,
ble applications include, e.g., sound synthesis of these instruments. 2008.
[13] C. Issanchou, S. Bilbao, J.-L. Le Carrou, C. Touzé, and
7. ACKNOWLEDGMENTS O. Doaré, “A modal-based approach to the nonlinear vibra-
tion of strings against a unilateral obstacle: Simulations and
This research was supported by the Estonian Ministry of Education experiments in the pointwise case,” Journal of Sound and
and Research, Project IUT33-24, and by Doctoral Studies and In- Vibration, vol. 393, pp. 229–251, April 2017.
ternationalisation Programme DoRa Plus Action 1.1 (Archimedes [14] M. van Walstijn and J. Bridges, “Simulation of distributed
Foundation, Estonia) through the ERDF. The author is grateful contact in string instruments: a modal expansion approach,”
to the anonymous reviewers of this paper—their suggestions im- in Proc. of 24th European Signal Processing Conference
proved the manuscript greatly. (EUSIPCO-2016), Budapest, Hungary, Aug. 29–Sept. 2
2016, pp. 1–5.
8. REFERENCES [15] M. van Walstijn, J. Bridges, and S. Mehes, “A real-time
synthesis oriented tanpura model,” in Proc. of 19th Interna-
[1] C. P. Vyasarayani, S. Birkett, and J. McPhee, “Modeling tional Conference on Digital Audio Effects (DAFx-16), Brno,
the dynamics of a vibrating string with a finite distributed Czech Republic, Sept. 5–9 2016, pp. 175–182.
unilateral constraint: Application to the sitar,” The Journal of
[16] A. K. Mandal and P. Wahi, “Natural frequencies, mode-
the Acoustical Society of America, vol. 125, no. 6, pp. 3673–
shapes and modal interactions for strings vibrating against
3682, 2009.
an obstacle: Relevance to sitar and veena,” Journal of Sound
[2] C. V. Raman, “On some Indian stringed instruments,” Pro- and Vibration, vol. 338, pp. 42–59, March 2015.
ceedings of the Indian Association for the Cultivation of Sci- [17] H. Singh and P. Wahi, “Non-planar vibrations of a string in
ence, vol. 7, pp. 29–33, 1921. the presence of a boundary obstacle,” Journal of Sound and
[3] M. Schatzman, “A hyperbolic problem of second order uni- Vibration, vol. 389, pp. 326–349, February 2017.
lateral constraints: the vibrating string with a concave ob- [18] E. Rank and G. Kubin, “A waveguide model for slapbass
stacle,” Journal of Mathematical Analysis and Applications, synthesis,” in Proc. of IEEE International Conference on
vol. 73, no. 1, pp. 138–191, 1980. Acoustics, Speech, and Signal Processing, Munich, Ger-
[4] H. Cabannes, “Presentation of software for movies of vibrat- many, 1997, pp. 443–446.
ing strings with obstacles,” Applied Mathematics Letters, [19] G. Evangelista and F. Eckerholm, “Player–instrument in-
vol. 10, no. 5, pp. 79–84, 1997. teraction models for digital waveguide synthesis of guitar:
[5] R. Burridge, J. Kappraff, and C. Morshedi, “The sitar touch and collisions,” IEEE Transactions on Audio, Speech,
string, a vibrating string with a one-sided inelastic con- and Language Processing, vol. 18, no. 4, pp. 822–832, 2010.
straint,” SIAM Journal on Applied Mathematics, vol. 42, no. [20] S. Siddiq, “A physical model of the nonlinear sitar string,”
6, pp. 1231–1251, 1982. Archives of Acoustics, vol. 37, no. 1, pp. 73–79, 2012.
[6] M. Ducceschi, S. Bilbao, and C. Desvages, “Modelling colli- [21] D. Kartofelev, A. Stulov, H.-M. Lehtonen, and V. Välimäki,
sions of nonlinear strings against rigid barriers: Conservative “Modeling a vibrating string terminated against a bridge with
finite difference schemes with application to sound synthe- arbitrary geometry,” in Proc. of 4th Stockholm Music Acous-
sis,” in Proc. of 22nd International Congress on Acoustics tics Conference (SMAC 2013), Stockholm, Sweden, Jul. 30–
(ICA 2016), Buenos Aires, Argentina, Sept. 5–9 2016, pp. Aug. 3 2013, pp. 626–632.
1–11. [22] J. O. Smith, Physical Audio Signal Processing for Virtual
[7] A. Krishnaswamy and J. O. Smith, “Methods for simulat- Musical Instruments and Audio Effects, W3K Publishing,
ing string collisions with rigid spatial obstacles,” in Proc. of 2010.
DAFX-8
DAFx-47
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1. INTRODUCTION
ξ(t)
The Doppler effect in loudspeakers is due to the membrane mo- (p, v) ∞
tion: this introduces a varying propagation time between the piston
and a fixed receiver. For large frequency range speakers, this re-
0 x
sults in distortion effects. This phenomenon has been hightlighted
by Beers and Belar [1], who recommend the use of a multi-way
system to reduce its influence. Figure 1: A piston vibrates around x = 0 in a semi-infinite duct.
In [2], Butterweck proposes a simplified 1D model based on plane
wave propagation generated by a moving piston. He proposes to
approximate the solution by a truncated series expansion, from 2.2. Physical model
which a criterion to evaluate the distortion is built.
Following (H1), the wave propagation in the duct is described by
This paper restates the 1D model introduced in [2] and investigates
the well-posedness of the problem. A necessary and sufficient con- ρ0 ∂t v(x, t) + ∂x p(x, t) = 0, (1)
dition is presented for the existence of a regular solution, charac-
terized by an implicit equation. This equation admits a unique
solution, and its regularity order is proven to be related to that ∂t p(x, t) + ρ0 c20 ∂x v(x, t) = 0, (2)
of the membrane displacement function. The proof relies on the
where p and v denote the acoustic pressure and the particle veloc-
method of characteristics. This so-called moving boundary prob-
ity, x and t are the space and time variables and ρ0 and c0 are the
lem has already been investigated, and analytical solutions have
air density and the sound velocity, respectively.
been established [3, 4]. In this paper, infinite regular displacement
functions are considered, and the perturbation method proposed by The general solution (1-2) is decomposed into traveling waves as
[2] is adopted to derive the power series expansion in an exact re- follows
cursive way. The series expansion corresponds to a Volterra series
and its terms involve the partial Bell polynomials. Finally, simula- v(x, t) = v + (t − x/c0 ) − v − (t + x/c0 ),
tions are carried out by truncation of the series expansion and both p(x, t) = ρ0 c0 v + (t − x/c0 ) + v − (t + x/c0 ) ,
harmonic and intermodulation distortions are evaluated.
This paper is organized as follows. Section 2 introduces the phys- where functions t 7→ v ± (t) represent forward and backward wave-
ical model and the equations to solve. Then Section 3 presents the forms.
strong solution derived from the method of characteristics, and a
DAFX-1
DAFx-48
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Kξ 7→ R
x − ξ(t) (8)
(x, t) 7→ t + , ξ(t)
c0
and
Kξ : Kξ 7→ Kaξ
(9)
(x, t) 7→ x, τξa (x, t)
where the image set Kaξ is 0
posit
Kaξ = x, τξa (x, t) for (x, t) ∈ Kξ (10) ion x time
In this definition (see Figure 2), T = τξa x = X, t = θ provides
the arrival time t = T at position x = X of an acoustic wave Figure 3: Perspective view of an acoustic wave generated at
emitted at time t = θ and position x = ξ(θ), according to (3).
ξ(t), t and observed at x, τξa (x, t) . The piston displacement
Indeed, the particle velocity v at (x, t) = (X, T ) is is described by the curve labelled ξ(t), and the velocity by the he-
v + (τξa (X, θ) − X/c0 ) = v + (θ − ξ(θ)/c0 ), licoidal trajectory ξ 0 (t). Four lines of characteristic Kξ are rep-
resented by arrows, connecting ξ 0 (t) with the particle velocity v at
which corresponds to the particle velocity at (x, t) = ξ(θ), θ . position x.
DAFX-2
DAFx-49
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The characteristic defined in (9) is based on the emission time of Indeed, as E has a zero measure, the monotonicity is still fulfilled
the acoustic wave (propagation occurs at time τξa (x, t) > t). Then so that this condition is still sufficient. This condition is also nec-
the expression of the particle velocity observed at x is v x, τξa (x, t) . essary because if E has a non zero measure, there exists an open
However we seek a solution based on the observation time, so that subset of Kξ on which the jacobian of Kξ has a zero determinant,
the velocity is v(x, t). Thus, the existence of strong solutions in- making Kξ locally constant and so non injective.
volves the reciprocal application Kξ−1 .
Remark 2 (Relations between Kaξ and Kξ ). By construction of Kaξ
The main result of this section is given by the following theorem. (see (10)), Kξ ⊆ Kaξ : the characteristics lines cover Kξ (see Lξ (t)
in Figure 2.) Moreover, if the Mach number condition (11) is ful-
Theorem 1 (Strong solution). Let ξ be a C n+1 -regular function filled, the characteristics
lines Lξ (t) start from the boundary ∂Kξ
with n ∈ N. Suppose that its (signed) Mach number is stricly less of Kξ at point ξ(t), t , t ∈ R and do not recross this boundary
than 1, namely, ∂Kξ = ξ(t), t , t ∈ R. Therefore, Kaξ = Kξ .
∀t ∈ R, ξ 0 (t)/c0 < 1. (11)
Then, the following results hold: Proposition 2 (Kξ is a C n -diffeomorphism). Let ξ be a C 1 -regular
function that satisfies condition (11). If function ξ is also C n+1 -
(i) Kξ is a C n -regular diffeormorphism from Kξ to Kaξ ,
regular with n ∈ N, then function Kξ is a C n -regular diffeomor-
(ii) The strong solution v of (3-6-7) is C n -regular and given by phism. Consequently, if ξ is C ∞ -regular, then Kξ is a C ∞ -regular
diffeomorphism.
v : Kaξ −→ R
Proof. Let n ∈ N and consider ξ ∈ C n+1 -regular. Suppose that
(x, t) 7−→ ξ 0 ◦ τξd (x, t) (12)
condition (11) is satisfied. Then, from Proposition 1, function
τξd (x, t) : (x, t) ∈ Kξ 7→ [K −1 (x, t)]2 ∈ R exists and is con-
where the (unique) departure time τξd (x, t) of the travelling
tinuous.
wave observed at (x, t) is given by the C n -regular function
Now, we prove by induction that τξd is C p -regular for 1 ≤ p ≤ n.
τξd : Kaξ −→ R
• Case p = 1: τξd ∈ C 1 .
(x, t) 7−→ [Kξ−1 (x, t)]2 , (13) From (8), τξa is C 1 -regular so that D2 τξa is continuous, where D2
stands for the derivative with respect to the second component
of
where [Kξ−1 (x, t)]2 denotes the second component of the the function. Moreover, ∀(x, t) ∈ Kaξ , x, τξd (x, t) ∈ Kξ and
function.
f (x, t) = D2 τξa x, τξd (x, t) = 1 − ξ 0 τξd (x, t) /c0 defines a
Consequently, if ξ is C ∞ -regular, then all the above-defined func- a
function f : Kξ −→ R which is:
tions are also C ∞ -regular. (i) continuous, because it is a composition of the continuous func-
The proof of this theorem is given in appendix B. It relies on the tions τξd and D2 τξa ,
two following propositions. (ii) strictly positive because (11) is satisfied.
The jacobian of Kξ−1 given by
Proposition 1 (Kξ is a bijection). Let ξ be a C 1 -regular function. JK −1 : Kaξ −→ M2,2 (R)
Then, function Kξ : Kξ → Kaξ is a bijection if function ξ satisfies
condition (11). 1 0 , (14)
(x, t) 7−→ −1 1
c0 f (x,t) f (x,t)
Proof. (i) By construction of the image set Kaξ (see (10)), Kξ is a
surjective function. is then a continuous function. Hence, Kξ−1 and then τξd (see (13))
(ii) Injectivity. ξ is a C 1 -regular function, so that τξa and then Kξ are C 1 -regular.
are also C 1 -regular functions. For all (x, t) ∈ Kξ , the jacobian of
Kξ is given by • Case p ≥ 2: If τξd is C p -regular, then τξd is C p+1 -regular.
From (14), the jacobian JK −1 is C p -regular. Therefore τξd is C p+1 -
1 0
JKξ (x, t) = 0 , regular, which concludes the proof.
1/c0 1 − ξ (t)/c0
From Proposition 1, there exists a function τξd : Kaξ 7→ R, such
in which 1 − ξ 0 (t)/c0 is strictly positive. Therefore Kξ (x, t) that for all (x, t) ∈ Kaξ , equation (8) reads
is strictly increasing with respect to x and t. The monotonicity
proves that Kξ is injective, which concludes the proof. x + ξ ◦ τξd (x, t)
τξd (x, t) = t − . (15)
c0
Remark 1 (Condition on the Mach number). (11) is a sufficient The calculation of this space-time function can be reduced to that
condition to ensure the bijectivity of Kξ . To obtain a necessary of a simpler time function, because of the following property.
and sufficient condition, (11) must be modified as follows:
• the set E = {t ∈ R s.t. ξ 0 (t)/c0 = 1} has a zero measure, Property 1 (Translational symmetry). Let δ > 0 and Tδ : (x, t) 7→
and (x + δ, t + δ/c0 ) be the space-time translation operator of space-
• ξ 0 (t)/c0 < 1 for all ∈ R \ E. time shift (δ, δ/c0 ).
DAFX-3
DAFx-50
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Then, for any arrival space-time point (x, t) ∈ Kaξ , the translated 4.2. Perturbation method and series expansion
point Tδ (x, t) is in Kaξ and the departure time of this translated
point is unchanged: Consider ξ ∈ C ∞ (so that τξ ∈ C ∞ from the theorem 1), and let
∀(x, t) ∈ Kaξ , δ > 0, τξd ◦ Tδ (x, t) = τξd (x, t). (16) (t) = τξ (t) − t, (24)
and v Tδ (x, t) = ξ 0 ◦ τξd (x, t). which corresponds to the variation of propagation time due to source
motion. Reformulating (22-23) with (t) yields
The proof is straightforward from (3).
Consequently, the next section addresses the derivation of a solver
v(x, t) = ξ 0 t − x + (t) , (25)
for the reduced problem given by
(t) = ξ t + (t) . (26)
ξ ◦ τξ (t)
τξ (t) = t + , (17)
c0 Now, we solve the implicit equation (26) using a perturbation method.
Consider that (26) describes a system of input ξ and output (t),
where τξ (t) = τξd (x = 0, t), and from which the strong solution and apply the change of variable ξ = α.u, with α ≥ 0. The
is derived method consists in writing the output as a formal power series in
v(x, t) = ξ 0 τξ (t) − x/c0 . (18) α, such that
∞
X
Remark 3 (Equivalent Eulerian n (t) n
description). Consider function (t) = α . (27)
V : (x, t) 7→ ξ 0 τξ (t − x/c0 ) defined on R2 . This function coin- n=0
n!
cides with v on Kaξ (see (12)) and is such that V ◦ Tδ is invariant Then (26) becomes
with respect to δ. Consequently, function V extends solution v on
the space-time domain R2 and t 7→ V (0, t) stands for the equiva- ∞
X n (t) X∞
n (t)
lent source description at x = 0 (Eulerian description). αn = α.u t + αn . (28)
n=0
n! n=0
n!
4. RECURSIVE METHOD BASED ON POWER SERIES Now function u is developped into Taylor series at point t:
∀n > 1,
For the sake of readability, the symbols "tilde" are omitted in the
sequel. where the amplitude α has been dropped, so that u = ξ.
DAFX-4
DAFx-51
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Property 2 (Multivariate polynomial). n (t) has the form 4.3. Numerical evaluation of the convergence
Cn : n (t) = Pn ξ(t), ξ 0 (t), ..., ξ (n−1) (t) , (35) Although the convergence domain of (38) is not tackled from the
theoretical side, this section presents simulations of the acoustic
where Pn is a homogeneous multivariate polynomial of degree n. output for a 40Hz, 1s sinusoidal velocity input at various Mach
numbers. Figure 4 shows the results for different truncation or-
The proof of this property is given in Appendix C.
ders, from N = 1 to N = 15.
Divergence of the series is noted at Mach 1 and above, which is
Now, the expression of the particle velocity (25) is also expanded
consistent with the condition of existence of the strong solution
into Taylor series, at point t − x:
(11). Since the usual range of membrane velocities is far below
X∞
ξ (n+1) (t − x) X m (t) n
∞ this limit, the convergence criterion should be met, at least for si-
v(x, t) = . (36) nusoidal motions.
n=0
n! m=1
m!
Applying again the Faá di Bruno formula for power series compo- Mach 0.6 Mach 0.8
1.015 1.10
sition finally yields the expression of the particle velocity,
max|v|/max|u|
max|v|/max|u|
1.08
∞
X 1.010
1.06
v(x, t) = vn (x, t), (37)
1.04
n=1 1.005
1.02
where 1.000 2 1.00 2
4 6 8 10 12 14 4 6 8 10 12 14
v1 = ξ (1) (t − x), N N
n−1
X 7
Mach 1.0 Mach 1.5
ξ (k+1) (t − x) 2000
(38)
max|v|/max|u|
max|v|/max|u|
vn = Bn−1,k 1 , ..., n−k , 6
k!(n − 1)! 1500
k=1 5
∀n > 1. 4 1000
3
The first orders of vn are listed below, where the dimensionless 500
2
model has been dropped: 1 2 0 2
4 6 8 10 12 14 4 6 8 10 12 14
(1) N N
v1 (x, t) = ξ (t − x/c0 ), (39a)
1 Figure 4: Convergence computation for various sinusoidal input
v2 (x, t) = .ξ(t − x/c0 ).ξ (2) (t − x/c0 ), (39b)
c0 velocities. Simulations at Mach 0.6 and Mach 0.8 converge, con-
1 h1 trary to simulations at Mach 1 and Mach 1.5.
v3 (x, t) = 2 ξ(t − x/c0 )2 .ξ (3) (t − x/c0 )
c0 2
i
+ ξ(t − x/c0 ).ξ (1) (t − x/c0 ).ξ (2) (t − x/c0 ) . (39c)
5. SIMULATION AND DISTORTION EVALUATION
Remark 4 (Link with Volterra series). From the property 2, n (t) 5.1. Harmonic distortion
has homogeneous order n with respect to ξ(t) and its derivatives. The harmonic distortion is evaluated for a sinusoidal piston ve-
Then the system of input ξ(t) , on which the perturbation method is locity with constant amplitude over the frequency spectrum (loud-
applied, and output (x, t) (and by extension v(x, t)) can be rep- speaker with a flat frequency response), so that the input displace-
resented by a Volterra series expansion [5], formally in the space ment takes the form
of distributions.
A
ξ(t) = sin(2πf1 t). (40)
Remark 5 (Convergence). Some results about the convergence of 2πf1
Volterra series are available in [6, 7] for L∞ input signals: they Two simulations are implemented, listed in Table 2 below. The
involve L1 Volterra kernels. In the present problem, the conver- velocity amplitude A corresponds to a high velocity of the driver’s
gence and the class of admissible inputs are a more complicated membrane, T is the simulation duration and N is the truncation
issue: because of the time-derivative in (38), kernels are not in L1 . order of (38). Results are presented in Figures 5and 6.
In addition to (11), some properties about asymptotic behaviour
(bounds on time derivatives or on frequency characteristics) have Table 2: Simulation parameters used for harmonic distortion sim-
to be examined to set the convergence condition: this future work ulations.
will provide the class of admissible waveforms.
Remark 6 (Practical implementation). For implementation pur-
pose, the expansion (37) is truncated at a given order N . More- Name f1 A T N fs
over, for a signal in the frequency range fmax , a frequency over- HD1 40 Hz 1 m/s 50 s 3 N × 44100 Hz
sampling by a factor of N is applied so that fs = N.(2fmax ),
HD2 1 kHz 1 m/s 50 s 3 N × 44100 Hz
ensuring no aliasing (the frequency range of vn is then nfmax ≤
N fmax ).
DAFX-5
DAFx-52
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
• The same harmonic distortion is observed for HD1 and HD2, 5.2. Intermodulation distortion
respectively in Figure 5 and Figure 6, with the apparition of the
second harmonic around -55dB. Injecting the expression of ξ(t) in In this section, the intermodulation distortion is examined by sim-
the first orders equations (39) leads to ulating the particle velocity for an input displacement of the form
A A
v1 (x, t) = A sin(2πf1 t), ξ(t) = sin(2πf1 t) + sin(2πf2 t).
2πf1 2πf2
A2
v2 (x, t) = 1 − cos(4πf1 t) . Given the weak amplitude of harmonic distortion, it is assumed
2c0 that both phenomena can be analyzed separately (harmonic distor-
It appears that harmonic distortion only depends on piston velocity tion that occurs in the following computations is neglected). The
amplitude, as confirmed by the simulations. Moreover, the ampli- simulation parameters are listed in Table 3. Simulation results are
tude ratio between v1 and v2 is of -56,7 dB, which is consistent presented in Figures 7 and 8.
with the numerical result.
Table 3: Simulation parameters used for intermodulation distor-
1 tion simulation.
• A THD of 0.14 % is noted. In most cases, harmonic distor-
tion can then be neglected. This is in agreement with previous
studies [2, 8]. A truncation at order 2 appears to be sufficient to
characterise this type of distortion (in the loudpseaker velocities Name f1 f2 A T N fs
range). IMD1 40 Hz 3 kHz 1 m/s 50 s 5 N × 44100 Hz
IMD2 40 Hz 6 kHz 1 m/s 50 s 5 N × 44100 Hz
Input piston velocity
0
20log(|ξ0|/max|ξ0|) (dB)
20
40 • An increase of intermodulation distortion is observed between
60 both simulations with a noticeable gain in sidebands amplitude
80 for IMD2. The Intermodulation Factors2 are 7.5% and 11.4% for
100 1
10 102 103 104 IMD1 and IMD2, respectively. This confirms that the intermodu-
Frequency (Hz)
lation effect, which rises with the value of f2 , should be taken into
Output particle velocity
0 account for large frequency range speakers.
20log(|v|/max|v|) (dB)
20
40 • Truncation at order 5 was sufficient for both simulations to cap-
60 ture the distortion phenomena in a dynamic of 100dB. However,
80 higher orders might be necessary in case of sound synthesis with
100 1
10 102
Frequency (Hz)
103 104 audio signal as input, since intermodulation distortion can be much
higher with complex signals.
Figure 5: Results of simulation HD1. The amplitude spectrum
of the input piston velocity (top) and the ouput particle velocity • Finally, Figure 9 shows the intermodulation distortion for simu-
(bottom) are shown. Harmonic distortion is observed at 2f1 . ation IMD2 in the time domain, resulting in phase modulation of
the high frequency component.
0 20
20log(|ξ0|/max|ξ0|) (dB)
20 40
40 60
60 80
80 100 1
10 102 103 104
100 1 Frequency (Hz)
10 102 103 104
Frequency (Hz) Output particle velocity
Output particle velocity 0
20log(|v|/max|v|) (dB)
0 20
20log(|v|/max|v|) (dB)
20 40
40 60
60 80
80 100 2640 3000 3360
100 1 Frequency (Hz)
10 102 103 104
Frequency (Hz)
Figure 7: Amplitude spectrum of the particle velocity (bottom) for
Figure 6: Results of simulation HD2. The amplitude spectrum simulation IMD1 (top). The bottom figure is rescaled around the
of the input piston velocity (top) and the ouput particle velocity frequency of interest, f2 . Intermodulation distortion is observed at
(bottom) are shown. Harmonic distortion is observed at 2f1 . f2 − p.f1 and f2 + p.f1 for p = 1, 2, 3.
1 Total Harmonic Distortion is calculated as the ratio of RMS amplitude 2 Intermodulation Factor is calculated as the ratio of the RMS amplitude
of the harmonics to the RMS amplitude of the fundamental. of the sidebands to the RMS amplitude of the carrier frequency.
DAFX-6
DAFx-53
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0 Input piston velocity • the improvement of the model by including the convection
20log(|ξ0|/max|ξ0|) (dB) 20 phenomenon.
40
60
80
10000 0
8
Frequency (Hz)
100 1
10 102 103 104
Frequency (Hz) 8000 16
Output particle velocity 24
0
6000 32
Level [dB]
20log(|v|/max|v|) (dB)
20
40
40
60 4000 48
80 56
100 5640 6000 6360 2000 64
Frequency (Hz) 72
0.0 0.1 0.2 0.3 0.4 0.5 80
Time (s)
Figure 8: Amplitude spectrum of the particle velocity (bottom) for
simulation IMD2 (top). The bottom figure is rescaled around the
frequency of interest, f2 . Intermodulation distortion is observed at
f2 − p.f1 and f2 + p.f1 for p = 1, 2, 3, 4. 10000 0
8
Frequency (Hz)
8000 16
24
6000 32
Level [dB]
Falling edge Rising edge
With Doppler With Doppler 40
4000 48
Particle velocity
Particle velocity
56
2000 64
72
80
0.104 0.105 0.120 0.121 0.0 0.1 0.2 0.3 0.4 0.5
Time (s) Time (s) Time (s)
(a) (b)
Figure 10: Spectrograms of the particle velocities, normalized in
Figure 9: Time samples of simulation IMD2. Phase modulation is dB, simulated with (bottom) and without (top) Doppler effect. The
observed, since the high frequency component is phase-shifted on input piston velocity is composed of two chirps and one pure tone.
the falling edges and conversely on the rising edges.
DAFX-7
DAFx-54
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[7] Thomas Hélie and Béatrice Laroche, “Computable conver- leads to, for all (x, t) ∈ K0ξ ,
gence bounds of series expansions for infinite dimensional
linear-analytic systems and application,” Automatica, vol. x x − ξ ◦ τξd ◦ L(x, t)
t+ = τξd ◦ L(x, t) + . (47)
50, no. 9, pp. 2334–2340, 2014. c0 c0
[8] Wolfgang Klippel, “Loudspeaker nonlinearities–causes, pa- This equation is equivalent to, for all (x, t) ∈ K0ξ ,
rameters, symptoms,” in Audio Engineering Society Conven-
tion 119. Audio Engineering Society, 2005. t = Fξ τξd ◦ L(x, t) , (48)
[9] DW van Wulfften Palthe, “Doppler effect in loudspeakers,”
where
Acta Acustica united with Acustica, vol. 28, no. 1, pp. 5–11, Fξ : R 7−→ R
1973. (49)
t 7−→ t − ξ(t)/c0 .
[10] Henry C. Kessler Jr, “Equivalent eulerian boundary condi-
tions for finite-amplitude piston radiation,” The Journal of Now, for all t ∈ R, Fξ0 (t) = 1−ξ 0 (t)/c0 > 0 and limt→±∞ Fξ (t) =
the Acoustical Society of America, vol. 34, no. 12, pp. 1958– ±∞ so that Fξ is a strictly increasing bijective function. It follows
1959, 1962. that
τξd ◦ L(x, t) = Fξ−1 (t), (50)
[11] Bronislaw Zóltogórski, “Moving boundary conditions and
nonlinear propagation as sources of nonlinear distortions in proving that τξd ◦ L(x, t) does not depend on x.
loudspeakers,” Journal of Audio Engineering Society, vol. Consequently, the strong solution (12) can be written, for all (x, t) ∈
41, no. 9, pp. 691–700, 1993. Kaξ ,
[12] Guy Lemarquand and Michel Bruneau, “Nonlinear inter- v(x, t) = ξ 0 ◦ τξd (x, t)
(51)
modulation of two coherent acoustic progressive waves emit- = ξ 0 ◦ τξd ◦ L(x, t − x/c0 ),
ted by a wide-bandwidth loudspeaker,” Journal of the Audio
Engineering Society, vol. 56, no. 1/2, pp. 36–44, 2008. which has the form v + (t − x/c0 ), with v + = ξ 0 ◦ τξd ◦ L, that
concludes the proof.
B. PROOF OF THE THEOREM 1
Proof. Let ξ be a C n+1 -regular function with n ∈ N that satisfies C. PROOF OF THE PROPERTY 2
(11). Point (i) results from propositions 1 and 2.
Proof. Let us prove by induction the following property
Proof of point (ii) is divided into two steps: Cn : n = Pn ξ(t), ξ 0 (t), .., ξ (n−1) (t) ,
a. Boundary condition: v ξ(t), t = ξ0 (t) where Pn is a homogeneous multivariate polynomial of degree n.
Because of (i), τξd and v = ξ 0 ◦ τξd are C n -regular. From (9), the
following equality can be computed: • Step 1: C2 is true, since 2 = ξ(t).ξ 0 (t), which is a polyno-
mial or degree 2.
Kξ ξ(t), t = (ξ(t), t), ∀t ∈ R. (41)
• Step 2: If Cn is true, then Cn+1 is true.
Moreover, by definition of the reciprocal function Kξ−1 ,
Let n = Pn (ξ, ξ 0 , ..., ξ (n−1) ). Then
∀(x, t) ∈ Kaξ , (x, t) = Kξ−1 ◦ Kξ (x, t), (42)
n
X
so that, replacing (x, t) by ξ(t), t , n+1 = (n+1). ξ (k) .Bn,k ξ(t), .., Pn−k+1 (ξ, ξ 0 , ..., ξ (n−k) ) ,
k=1
(ξ(t), t) = Kξ−1 ξ(t), t . (43)
where the Bell polynomials are developped
Taking the second component of (43), where [Kξ−1 ]2 = τξd , yields
Bn,k ξ, .., Pn−k+1 (ξ, ξ 0 , ..., ξ (n−k) )
τξd ξ(t), t = t. (44) X ξ(t) j1
= × ...
Therefore the solution ξ 0 ◦ τξd ξ(t), t verifies the boundary con- j1 ,...,jn−k ∈Πn−1,k
1!
dition (5).
Pn−k+1 (ξ, ..., ξ (n−k) ) jn−k+1
× .
b. Propagation: v(x, t) = v (t − x/c0 )
+ (n − k + 1)!
Define the function
j1 + j2 + ... + jn−k+1 = k
and Πn,k = .
L: K0ξ 7−→ Kaξ j1 + 2.j2 + ... + (n − k + 1).jn−k+1 = n.
(45)
(x, t) 7−→ (x, t + x/c0 ), Bn,k is a sum over products of polynomials, then it is a polyno-
mial. Moreover the degree of each term of the sum over Πn,k is
where Kξ0 = (x, t) ∈ R2 |(x, t + x/c0 ) ∈ Kaξ .
Moreover, isolating the second component of 1.j1 + 2.j2 + ... + jn−k+1 + (n − k + 1) = n,
L(x, t) = Kξ ◦ Kξ−1
◦ L(x, t) from the definition of Πn,k . Therefore the degree of all nonzero
(46) terms of n+1 is n + 1, that concludes the proof.
= x, τξa x, τξd ◦ L(x, t) ,
DAFX-8
DAFx-55
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-56
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-2
DAFx-57
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-58
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Introducing this nonlinearity gives us our final model for the am-
plitude of the ith damped vibrational mode of a sounding object:
dxi (t)
+ Di Re {xγi (t)} = Si g(u(t)), (14)
dt
which is the target system formulated in (1) with gx (xi ) =
Re {xγi } and gu (u) = g(u).
4.3. Resynthesis with the State Space Model An advantage of using a relatively simple state space model such
as the one in equation (15) is its flexibility with regards to parame-
After inference is performed, we obtain an optimised set of param- ter control and time step size. We now illustrate how we can utilise
eters θ, and a posterior distribution over the outputs and the latent these features to run our model in real time with user control, and
input. We apply an inverse scaling operation to obtain the original to interpolate between parameter values to manipulate the sound
magnitude weightings. The posterior distribution provides us with timbre.
DAFX-4
DAFx-59
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 3: Latent force modelling of a clarinet note. 6 modes are picked based on their amplitude, and the predictive mean of the output
distribution is compared to the real data (top left). The frequency data (bottom left) shows the modes are, in order of magnitude, the 1st,
3rd, 5th, 4th, 7th and 6th harmonics. The mean, ū, and 95% confidence interval (uncertainty) of the latent input u is shown (bottom right).
g(ū) is fed through the state space model to resynthesise the output (top right). Low uncertainty results in resynthesis very similar to the
predictive mean.
5.1. Real Time Synthesis relative to one another. Individual modification of these parame-
ters is possible, but not desirable if we wish to maintain coherence
In the previous section ∆tk was fixed at the analysis time step size, across dimensions. Instead, we interpolate parameters between
corresponding to framewise modelling. During synthesis we can models to create new sound timbres not present in the original
set the step size to be as large or small as required. Based on our recordings.
desired sampling frequency, we modify ∆tk such that the model Prior to parameter interpolation we match the modes between
calculates sample-rate data and runs in real time. models by ranking them in order of frequency. We also normalise
This modification allows us to handle audio-rate input, which the magnitude of the excitation functions, adjusting the stiffness
may be crucial for a synthesis model that requires expressive user parameters accordingly. For sounds without definable harmonic
control. As mentioned in Section 4.3, resynthesis can be per- structure, pairing the modes is straightforward and simply based
formed by sampling from the posterior distribution over the la- on their rank position. For harmonic sounds we must be careful to
tent excitation function and passing the sample through the model. match the nth harmonic in model A to the nth harmonic in model
However, with the aim of user-controllable synthesis in mind, and B. If we fail to do so, interpolation of the frequency value will
given that the excitation function is interpreted as physical energy compromise the harmonic structure of the sound.
forcing the system, it is possible to replace the mean of the latent
Once modes have been paired we perform linear interpola-
distribution with a new function dependent on some user input.
tion of physical parameters Si , Di and the initial conditions, and
We control the synthesis model with user input data corre- logarithmic interpolation of the frequency. Synthesis in this man-
sponding to the pressure applied to a MIDI CC button or a force- ner negates the need for time-domain modification (such as time-
sensing-resistor, scaling the data appropriately such that it has sim- stretching) usually associated with morphing [20].
ilar properties to the learnt latent input. Alternatively, we provide
the user with a modifiable plot of the excitation function, which
they can re-draw and modify to create new sounds. 6. RESULTS
5.2. Sound Morphing In order to show the versatility of our approach we consider two
case studies: musical instruments, demonstrated here by a short
Our linear time-invariant synthesis model has fixed stiffness and clarinet note, and impact sounds, demonstrated by the sound of
damping parameters corresponding to each mode. Adjusting these metal being struck by a solid object. We then measure the accuracy
parameters has an impact on perceptual characteristics relating to of our reconstructed data for a number of recordings, and show the
timbre such as attack time, decay time and the modes’ amplitudes output produced by morphing between two different sounds.
DAFX-5
DAFx-60
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 4: Latent force modelling of a metal impact sound. The real analysis data shows some variation in behaviour between modes (top
left). An increase in uncertainty in the posterior distribution after 0.1s reflects this fact (bottom right). The posterior mean of the latent
distribution, ū, is fed through the state space model, and the result shows that much of the variable behaviour was captured (top right).
PCA results are shown as a comparison, and we can see that the variable damping rates have not been reproduced (bottom left).
6.1. Musical Instruments number of modes, we risk losing much of the audio content. How-
ever, our selected modes are capable of reproducing much of the
Most musical instruments have strong harmonic structure, and the deterministic character of the signal. The remainder is treated as a
majority of the signal energy tends to be contained within rela- residual, and not addressed here. We found that for impact sounds
tively few sinusoids representing these harmonics. By inspecting a model choice of γ = 3/4 was more appropriate since the decay
the data and experimentally testing the results for various values rate varies as the amplitude decreases (in Figure 4a, the partials’
for γ, we found that musical instruments tend to have a relatively gradient flatten out over time).
linear decay, and a choice of γ = 1/2 fits the data best.
Figure 4a shows that for a metal impact sound large varia-
Figure 3 shows the results of latent force modelling of a clar-
tion of behaviour occurred between modes. To account for this,
inet note. The first 6 modes are considered, and on viewing the
a large variation of stiffness and damping parameters were learnt,
mean of the distribution of the outputs (Figure 3a), we can see that
enabling much of the behaviour to be captured. Comparing the
much of the behaviour has been captured in the model. The attack
synthesised outputs for the two largest modes in Figure 4b, we see
of the largest mode is partially altered to fit the shape of the other
that whilst they have a similar attack, encoded by the stiffness or
modes, since the simple mechanistic model struggles to encode
sensitivity measure Si , they have a very different decay, encoded
peaks that are out of phase with each other. However, the variable
by the damping measure Di .
damping rates have successfully been learnt, with the largest mode
Uncertainty in the metal impact model (Figure 4d) increased
reducing to zero at a much slower rate than the smallest modes.
more quickly than in the clarinet model, reflecting the fact that
We plot the 95% confidence interval for the latent input (Fig-
behaviour is less consistent across these vibrational modes than
ure 3d), and observe that the uncertainty in the learnt model in-
across the harmonics of the clarinet. In particular we observe an
creases towards the end of the signal, as some partials reduce to
increase in uncertainty after the initial attack, when the modes’
zero and their behaviour no longer correlates with the non-zero
behaviour begins to diverge from one another.
partials. The resynthesised outputs (Figure 3b) are almost identi-
cal to the predictive mean of the outputs when passing the mean of
the latent input, ū(t), through the model (14). This suggests that 6.3. Model Accuracy and Comparison with PCA
the observed degree of uncertainty is acceptable.
To evaluate our results we calculated the root-mean-square (RMS)
6.2. Impact Sounds error between the actual data and our synthesised outputs. This
gives us a measure of our ability to reproduce the analysed si-
Impact sounds often lack clear harmonic structure, and energy is nusoidal partials. Readers are also invited to listen to the sound
distributed across the frequency spectrum. In selecting just a small examples provided.
DAFX-6
DAFx-61
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 5: Sound morphing between an oboe and a clarinet. The modes of an oboe (left) are matched with the modes of a clarinet (right)
and colour-coded based on their pairings. Since the modes represent harmonics, it is important to maintain the harmonic structure, so the
2nd mode of the oboe does not have a match. Similarly, the 6th mode of the clarinet is not matched. Stiffness and damping parameters are
interpolated, and a user-drawn excitation function of arbitrary length is used to produce the morphed output (middle).
As a comparison, we run principal component analysis (PCA) tion, such as the oboe, even one principal component sometimes
on our amplitude data. PCA is another dimensionality reduc- outperformed the latent force model. This poor performance of
tion technique that similarly maps high-dimensional data to a the LFM for the oboe could be due to the optimisation procedure
lower-dimensional space through an input-output process (a sim- converging on a sub-optimal local minimum, or due to the fact that
ple scalar weighting), providing us with a set of orthogonal vari- the oboe partials reduce to zero at an almost linear rate, and their
ables, called principal components, ranked in order of how much behaviour was not fully captured by our choice of γ = 1/2, i.e. a
of the data’s variance they describe. more optimal choice of γ exists.
Latent force modelling has many benefits over PCA, such as
physical interpretability, model memory (PCA is an instantaneous RMS error
mapping), the ability to introduce nonlinear mappings between in- Audio recording LFM PCA
puts and outputs, and a probabilistic framework for calculating Clarinet 0.0325 0.0593
uncertainty and resampling new data (although probabilistic PCA Oboe 0.0189 0.0156
techniques also exist). Regardless, PCA is a worthwhile compari- Piano 0.0441 0.0520
son due to its simplicity and reliability. Metal impact 0.0377 0.0609
Figure 4c shows the results of PCA on a metal impact Wooden impact 0.0139 0.0291
sound. Using just one principal component to reproduce the 8-
dimensional output fails to capture much of the behaviour, most Table 1: Root-mean-square (RMS) error between modal amplitude
notably the variable damping rates. With the one-dimensional data and outputs of latent force modelling (LFM) and principal
LFM we are able to capture much more of the behaviour (Figure component analysis (PCA). The LFM outperforms PCA when dis-
4b). Note that it is possible to introduce more principal compo- parate behaviour across dimensions is observed.
nents, and also possible to run latent force modelling with more
than one latent dimension, but this violates our assumption that
the modes are produced by a common excitation function.
Table 1 compares the RMS error for latent force modelling and 6.4. Morphing
PCA for a number of audio recordings. The data is normalised to
give equal weighting to each dimension of the model. When dis- Figure 5 shows the results of sound morphing between recordings
parate behaviour occurs across dimensions, latent force modelling of an oboe and a clarinet. A user-drawn excitation function is used
is more accurate than reconstruction with one principal compo- as input to the morphed model (Figure 5b) and we observe the ex-
nent. For recordings in which the dimensions have high correla- pected change in relative amplitudes. The modes of the oboe have
DAFX-7
DAFx-62
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
much faster decay times than the clarinet, and visual inspection of [5] Julius O. Smith, Physical audio signal processing: For vir-
the morphed sound confirms that decay rates in between these two tual musical instruments and audio effects, W3K Publishing,
extremes are achieved. 2010.
[6] Mauricio A. Alvarez, David Luengo, and Neil D Lawrence,
7. CONCLUSIONS “Latent force models.,” in International Conference on Ar-
tificial Intelligence and Statistics (AISTATS), 2009, vol. 12,
The aim of this work was to demonstrate our ability to learn about pp. 9–16.
the physical behaviour of sound from recordings. Such an ap- [7] Robert McAulay and Thomas Quatieri, “Speech analy-
proach will aid those looking to design and build synthesis models sis/synthesis based on a sinusoidal representation,” IEEE
that are faithful to the real-world sounds we hear around us, whilst Transactions on Acoustics, Speech, and Signal Processing,
also providing opportunities for control and expression. vol. 34, no. 4, pp. 744–754, 1986.
We utilised knowledge about the way in which objects vibrate
to produce sound to construct a simple mechanistic model for the [8] Michael Klingbeil, “Software for spectral analysis, editing,
behaviour of sinusoidal modes. Although this model does not de- and synthesis.,” in International Computer Music Confer-
scribe all the physical interactions that create sound, its simplicity ence (ICMC), 2005.
enables the application of nonlinear latent force modelling tech- [9] Julien Bensa, Stefan Bilbao, Richard Kronland-Martinet, and
niques to infer physically relevant parameters from audio record- Julius O. Smith, “The simulation of piano string vibration:
ings, in addition to the excitation required to produce meaningful From physical models to finite difference schemes and dig-
output. ital waveguides,” The Journal of the Acoustical Society of
After the learning process was complete, we demonstrated America, vol. 114, no. 2, pp. 1095–1107, 2003.
how to perform synthesis in this framework, adapting the model [10] Lutz Trautmann and Rudolf Rabenstein, “Digital sound syn-
to run in real time with user control. We then provided a way thesis based on transfer function models,” in IEEE Workshop
to manipulate sound characteristics through parameter morphing. on Applications of Signal Processing to Audio and Acoustics,
We showed how the model often outperforms PCA when attempt- 1999, pp. 83–86.
ing to map sinusoidal data to a one-dimensional control space, but
noted how higher accuracy is not guaranteed since we rely on a [11] Perry R. Cook, Real sound synthesis for interactive applica-
high-dimensional optimisation procedure to find suitable parame- tions, CRC Press, 2002.
ter values. [12] Alfonso Pérez Carrillo, Jordi Bonada, Esteban Maestre, En-
As future work, the inference process would benefit greatly ric Guaus, and Merlijn Blaauw, “Performance control driven
from intelligent selection of initial conditions to aid optimisation in violin timbre model based on neural networks,” IEEE Trans-
finding appropriate solutions. Automatic identification of linearity actions on Audio, Speech, and Language Processing, vol. 20,
measure γ, or inclusion of γ as a parameter to be optimised during no. 3, pp. 1007–1021, 2012.
inference, would also be highly beneficial. The introduction of [13] Mauricio A. Alvarez, David Luengo, and Neil D. Lawrence,
additional latent functions would allow us to model more complex “Linear latent force models using gaussian processes,” IEEE
systems with multiple control inputs. Transactions on pattern analysis and machine intelligence,
Subjective evaluation of our ability to reproduce the quality of vol. 35, no. 11, pp. 2693–2705, 2013.
a given audio recording was not presented here, but is necessary
to further assess the suitability of our approach. Complex ampli- [14] Jouni Hartikainen and Simo Särkkä, “Sequential inference
tude modulation is difficult to model if the modes’ peaks are out for latent force models,” in Twenty-Seventh Annual Confer-
of phase with each other, and a system that allows for variable ence on Uncertainty in Artificial Intelligence (UAI-11), 2011,
frequency would greatly improve its applicability. Finally, consid- pp. 311–318.
eration of the residual component of the signal is crucial for further [15] Carl Edward Rasmussen and Christopher K. I. Williams,
development of these techniques. Gaussian processes for machine learning, MIT Press, 2006.
[16] Jouni Hartikainen and Simo Särkkä, “Kalman filtering and
8. REFERENCES smoothing solutions to temporal gaussian process regres-
sion models,” in IEEE International Workshop on Machine
[1] Jean Marie Adrien and Eric Ducasse, “Dynamic modeling Learning for Signal Processing (MLSP), 2010, pp. 379–384.
of vibrating structures for sound synthesis, modal synthesis,”
[17] Mohinder S. Grewal, Kalman filtering, Springer, 2011.
in Audio Engineering Society 7th International Conference:
Audio in Digital Times, 1989. [18] Simo Särkkä, Bayesian filtering and smoothing, vol. 3, Cam-
bridge University Press, 2013.
[2] Perry R. Cook, “Physically informed sonic modeling
(PhISM): Synthesis of percussive sounds,” Computer Mu- [19] Jouni Hartikainen, Mari Seppanen, and Simo Sarkka, “State-
sic Journal, vol. 21, no. 3, pp. 38–49, 1997. space inference for non-linear latent force models with ap-
plication to satellite orbit prediction,” in 29th International
[3] Gerhard Eckel, Francisco Iovino, and René Caussé, “Sound Conference on Machine Learning (ICML), 2012, pp. 903–
synthesis by physical modelling with Modalys,” in Proc. In- 910.
ternational Symposium on Musical Acoustics, 1995, pp. 479–
482. [20] Marcelo Caetano and Xavier Rodet, “Musical instrument
sound morphing guided by perceptually motivated features,”
[4] Zhimin Ren, Hengchin Yeh, and Ming C. Lin, “Example- IEEE Transactions on Audio, Speech, and Language Pro-
guided physically based modal sound synthesis,” ACM cessing, vol. 21, no. 8, pp. 1666–1675, 2013.
Transactions on Graphics (TOG), vol. 32, no. 1, pp. 1, 2013.
DAFX-8
DAFx-63
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-64
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Indeed, the Tp coefficient in this case is directly related to the Γp mode thanks to the new energy defined in Section 2. Finally, con-
3Γ
by Tp = 8ωp2 . Then, provided that: clusion and perspectives for further research offered by this study
p
are discussed in Section 5.
ωp2 Bp
Γp = Ap + Cp +
3 2. PHYSICAL MODEL
the equation (1) is equivalent to (2). Consequently, the coefficient
Γp and therefore the nonlinear mode can be experimentally iden- 2.1. Original Duffing model
tified with the measurement of Tp .
Finally, the softening Duffing model is assumed for two rea- 2.1.1. Equation of motion
sons: first, it provides a convenient basis to experimentally identify
isolated nonlinear modes in the case of gongs; second, it gives the As explained before, the nonlinear normal mode associated with
opportunity to define a single parameter well-posed energy that the fundamental mode is modelled by a softening Duffing oscilla-
can be manipulated through energy shaping control in order to tor, expressed as in Eq. (2) with an added viscous modal damping:
change its softening or hardening behaviour.
ẍ(t) + 2ξω0 ẋ(t) + ω02 x(t) − Γx3 (t) = f (t) (3)
2000
-55
ξ is the modal damping factor, ω0 is the modal pulsation, Γ is the
-60
nonlinear cubic coefficient (Γ > 0) and f is the input accelera-
tion. These parameters have been experimentally identified, how-
Power/frequency (dB/Hz)
-65
Frequency [Hz]
1500
-70 ever the description of the identification methods are beyond the
1000
-75
scope of the paper. We assume then the following parameters val-
-80
ues:
500
ξ = 1.4.10−3
-85
0 -90
0.5 1 1.5 2 2.5 3
Time [s]
3.5 4 4.5 5 5.5
ω0 = 2π × 449 rad/s
Γ = 6, 7.106 S.I
Figure 1: Spectrogram of the sound of a xiaoluo after being struck
by a mallet. The fundamental mode (∼ 449 Hz) displays a soften-
ing behaviour. 2.1.2. Dimensionless problem
This study aims at controlling the softening behaviour of the For more convenience, equation (3) is written with dimensionless
xiaoluo gong’s nonlinear fundamental mode, that we assume to amplitude x̃ and time t̃, defined such as x = X0 x̃ and t = τ t̃. The
be a in the form of the softening Duffing equation described by Duffing equation (3) becomes:
Eq. (2). The control process relies on guaranteed-passive simula-
tions that use port-Hamiltonian approach. Port-Hamiltonian sys- X0 ¨ X0 ˙
x̃(t̃) + 2ξω02 x̃(t̃) + ω02 X0 x̃(t̃) − ΓX03 x̃3 (t̃) = f (τ t̃)
tems (PHS) are an extension of Hamiltonian systems, which rep- τ2 τ
resent passive physical systems as an interconnection of conserva-
tive, dissipative and sources components. They provide a unified that is:
mathematical framework for the description of various physical 2
systems. In our case, the PHS formalism allows for the writing ˙ t̃) + ω02 τ 2 x̃(t̃) − Γτ 2 X02 x̃3 (t̃) = f (τ t̃)τ
¨ t̃) + 2ξω02 τ x̃(
x̃(
of an energy-preserving numerical scheme [12] in order to simu- X0
late and control the Duffing equation - note that other and more q
precise guaranteed-passive numerical schemes [13] are available Choosing τ and X0 such that τ = 1
and X0 = 1
leads to
ω0 τ 2Γ
but not used in this work. The first observation when tackling the
the following Duffing equation:
control problem is that the softening Duffing equation defined by
Eq. (2) is energetically ill-defined (Section 2). For large ampli- ˙ t̃) + x̃(t̃) − x̃3 (t̃) = f˜(t̃)
¨ t̃) + µx̃(
tudes of vibration, the total system energy, written with the PHS x̃( (4)
approach, becomes negative and thus, physically inconsistent. The √
first step of this study consists then to redefine the energy for the where µ = 2ξ and f˜(t̃) = f (τ t̃) Γ
ω03m .
fundamental nonlinear mode. This new energy must be (i) as close For sake of legibility, tilde will be omitted in the following.
as possible of the energy of the Duffing equation described in (2)
and (ii) physically consistent. Secondly (Section 3), the Duffing
energy and the new well-posed energy are both simulated using a 2.2. Port-Hamiltonian approach
guaranteed-passive numerical scheme that relies on the energy dis-
crete gradient. Simulation results confirm the relevance of using This section introduces some recalls on port-Hamiltonian systems
the new energy for the control design. Thirdly (Section 4), control in finite dimensions. The calculation of the Hamiltonian H of the
simulation of the fundamental mode’s pitch glide is realized by Duffing system demonstrates that its potential energy H1 is nega-
shaping the system’s new energy. The simulation results confirm tive for some displacement values. A new equivalent positive def-
the ability to modify the softening behaviour of the fundamental inite potential energy H1? is then defined for the control design.
DAFX-2
DAFx-65
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.2.1. General formulation The port-Hamiltonian formulation of Eq. (4) can be deduced:
A port-Hamiltonian system of state x(t), input u(t) and output ẋ = (J − R)∇x H(x) + Gu
y(t) can be represented by the following differential equations
ẋ1 0 1 0 0 0
[14]: = − ∇x H(x1 , x2 ) + u
ẋ2 −1 0 0 µ 1
ẋ = J (x) − R(x) ∇x H(x) + G(x)u
The physical interpretation of a system is often analyzed through
y = G(x)T ∇x H(x) the derivative of the potential energy (or forces), which is written
in our case:
where ẋ is the time derivative of state x, ∇x denotes the gra- H10 (x1 ) = x1 − x31
dient with respect to state x, H(t) is a positive definite function
The derivative H10 is plotted on Figure 2(a) with the potential en-
that represents the total energy of the system, matrix J is skew-
ergy derivative of the underlying linear system for comparison.
symmetric and R is positive definite (R ≥ 0). The power balance
One can see that looking at H10 does not give any information
of the system can be expressed by the temporal energy variation of
about the physical existence of the softening Duffing system. It
the system Ḣ(x(t)) = ∇x H(x(t))T ẋ(t), that is:
is only by plotting the potential energy H1 (see Figure 2(b)) that
the softening system proves not to√ be physically defined for some
Ḣ = ∇x H T J ∇x H − ∇x H T R∇x H + yT u .
| {z } | {z } | {z } displacement values x1 : if |x1 | > 2, H1 < 0 and H can be neg-
=0 (J=−J T ) Dissipated power Pd >0 Entering power Pe ative. Moreover, the equilibrium points x1 = 1 and x1 = −1 are
saddle points, which means that the physical problem is restricted
This power balance equation guarantees the system passivity. The to |x1 | < 1.
variation of the system energy is expressed as the sum of elemen- The softening Duffing system (4) is then not energetically de-
tary power functions corresponding to the storage, the dissipation fined, and a new well-posed energy needs to be sought for the con-
and the exchanges of the system with the external environment. trol design.
The dissipation term Pd is positive because R is positive definite.
The power term Pe denotes the energy provided to the system by
the ports u(t) and y(t) (external sources). 2.2.3. Well-posed problem
T
The formulation Ḣ(x) = ∇x H(x) ẋ underlines the fact The aim of this paper is to seek functions H1? such that:
that each power function can be expressed as the product of a flux
([∇x H(x)T ]i or [ẋ]i ) with its associated efforts ([ẋ]i or [∇x H(x)T ]i ). • ∀x ∈ R, H1? (x) ≥ 0
A concrete example is given below with the Duffing oscillator de- • H1? increases on R+
scribed by Eq. (4) .
• H1? decreases on R− (5)
00
2.2.2. Softening Duffing oscillator energy • H1? is equivalent at order 2 to the dynamical stiffness of
The port-Hamiltonian system corresponding to the Duffing equa- the softening Duffing H100 (x) = 1 − 3x2
tion (4) can be defined as follow:
H100 (x) corresponds to the first terms of the Taylor expansion of
00
• State: x =
x1
=
l x → exp(−3x2 ). Then, one simple choice for H1? is:
x2 p
+∞
X
where l and p are the string elongation and the mass mo- 00 (−3)n 2n
mentum, respectively. ∀x ∈ R H1? (x) = exp(−3x2 ) = x
n=0
n!
• Dissipation: PD = µp2 > 0.
0
If we assume the conditions H1? (0) = 0 and H1? (0) = 0, the
• Source: input u = f and output −y = p. 00
simple and double integration of H1? give, for all x ∈ R:
where p is the velocity of the nonlinear normal mode. The total
+∞
X
energy of the system H is the sum of the energy of the spring H1 0 (−3)n
and the energy of the mass H2 : H1? (x) = x2n+1 = x − x3 + O(x5 )
n=0
n!(2n + 1)
1 1 1 +∞
X (−3)n x2 x4
H(x1 , x2 ) = H1 (x1 ) + H2 (x2 ) = x21 − x41 + x22 H1? (x) = x2n+2 = − + O(x6 )
2 4 2 (2n + 2)(2n + 1)n! 2 4
n=0
0
The flux and efforts associated with the energies H1 and H2 are and H1? meets the requirements (5). Note that H1? can be ex-
given in Table 1. pressed with the help of the error function erf which is defined
for x ∈ R by:
Spring Mass Z ∞
Energy H1 (x1 ) = 21 x21 − 41 x41 H2 (x2 ) = 12 x22 2 x 2 2 X x2n+1
erf (x) = √ e−λ dλ = √ (−1)n
Effort dH1 (x1 )
= x1 − x31 dx2 π 0 π n=0 n!(2n + 1)
dl dt
dx1 dH2 (x2 )
Flux = x2
dt dx2 leading to: √
0 π √
Table 1: Energies and associated efforts and flux. H1? (x) = √ erf ( 3x)
2 3
DAFX-3
DAFx-66
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
−x2
2 where p(x) = x × erf (x) + e√π is a primitive of the erf func-
1
tion. The potential energy H1? and its derivative are represented
in Figure 2 along with H1 and the linear system potential energy
0 for comparison. Note that H1? is positive and equals the Duffing
Second derivative H''
-2 3. SIMULATION
-3 H'' 1 : Duffing ocillator
This section describes the MATLAB guaranteed-passive structure
H'' 1* : New potential energy (second derivative)
-4 Linear case simulation relying on the discrete energy gradient. The simulation
H''=0 of the systems defined by (i) the Duffing potential energy H1 and
-5
-1.5 -1 -0.5 0 0.5 1 1.5 (ii) the well-posed problem defined by the new potential energy
x1 H1? are performed and compared.
1
0.6
Equation (6) is implicit and requires an iterative algorithm
0.4 to be solved. In this work we use the Newton-Raphson method,
0.2
which is written for time step k:
DAFX-4
DAFx-67
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
and
H2 (x2 + δx2 ) − H2 (x2 )
∇d H2 (x2 , δx2 ) =
δx2
δx2
= x2 +
2
Then we have:
δx1
F1 (δx1 , δx2 ) = − ∇d H2 (x2 , δx2 )
δt
δx2
F2 (δx1 , δx2 ) = + ∇d H1 (x1 , δx1 ) + µ∇d H2 (x2 , δx2 ) − u
δt
and the Jacobian matrix is:
1
δt
− 12
JF = 1
2
− 2 x1 − 2x1 δx1 − 43 δx21
3 2 1
δt
+ µ2
The simulation of the Duffing oscillator is performed with an ex-
citation force f (t) = f0 · g(t) where g is an impulse. The poten-
tial energy H1 versus the simulated displacement x1 , for the limit Figure 4: Spectrogram of the simulated Duffing system response
excitation f0 = fmax = 9.5 · 107 , is plotted in Figure 3. The x1 for an input force fmax = 9.5 · 107 .
spectrogram of the oscillator response x1 is also shown in Figure
4 and highlights the softening behaviour of the oscillator. If the
value of f0 exceed fmax , the simulation fails since |x1 | > 1 (see
∇d H2 (x2 , δx2 ) is the same as in the Duffing case. We can then
Section 2). We will see in the next section that this difficulty can
deduce the Jacobian matrix:
be overcome with the definition of a new potential energy.
1
!
δt √
− 12
√ √ 0 √ √
JF = π 3p ( 3(x1 +s1 ))s1 −p( 3(x1 +s1 ))+p( 3x1 )
0.3 6 s2
1
δt
+ µ2
1
0.25 It is possible to simulate the system associated with the new en-
ergy H1? for an excitation force f0 = fmax , as in section 3.2. The
Potential energy
In the case of the new problem defined by H1? the discrete gradient
is 4.1. Energy reshaping
H1? (x1 + δx1 ) − H1? (x1 )
∇d H1? (x1 , δx1 ) =
Let H1? be the potential energy parameterized by 6= 0 such that:
δx1
√ √ √
π p( 3(x1 + δx1 )) − p( 3x1 )
= (8) x2 x4
6 δx1 H1? (x) = − + O(x6 ) (9)
2 4
DAFX-5
DAFx-68
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The gradient term +∇d H11 aims at "cancelling" the original sys-
1 2
tem defined by H1? whereas the gradient term −∇d H1? intro-
2
duces the new target system defined by H1? . In our case, note that
1 the original system is defined by ∇d H1? = ∇d H1? 1 =1
.
Potential energy H 1*
0.8
4.2. Control simulations
0.6 Control simulations using energy shaping principle are performed.
The simulation parameters are:
0.4 1
• initial (uncontrolled) energy: H1? = H1?
0.2 2
• target energy: H1? , with 2 ∈ {0.0746, 0.746, 1.79, −2}
0 • input force: f0 = 2 · 108
Note that 2 > 0 and 2 < 0 leads to a softening and hardening
-0.2 behaviour, respectively.
-3 -2 -1 0 1 2 3 Figure 7 presents the potential energy H1? 2
computed from
x1 the simulated responses for the different values of 2 . Theoretical
quadratic energy of the underlying linear system is also plotted to
Figure 5: Potential energy H1∗ as a function of the displacement distinguish softening from hardening behaviour. The results show
x1 resulting from simulations of the new well-posed system for an that positive control parameter 2 leads to a softening behaviour
input force f0 = 2 · 108 > fmax = 9.5 · 107 . (which increases with the value of 2 ), whereas negative value of
2 results in a hardening behaviour, as expected. This is confirmed
by looking at the pitch glide variation of the system response, in
Figure 8 to 11.
These results underline the benefits of the definition of the new
energy H1? , i.e. the ability to compute systems dynamics with
important pitch glide (downward and upward) caused by both large
input forces and nonlinear coefficients.
1.2
ǫ2 = 1.79
ǫ2 = 0.746
1
ǫ2 = 0.0746
ǫ2 = -2
0.8
ǫ2 = 0 (linear)
Energy Hǫ1*
0.6
0.4
DAFX-6
DAFx-69
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 8: Spectrogram of the system response x1 for a control Figure 10: Spectrogram of the system response x1 for a control
parameter value 2 = −2. parameter value 2 = 0.746
Figure 9: Spectrogram of the system response x1 for a control Figure 11: Spectrogram of the system response x1 for a control
parameter value 2 = 0.0746 parameter value 2 = 1.791
DAFX-7
DAFx-70
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
7. REFERENCES [16] Antoine Chaigne, Cyril Touzé, and Olivier Thomas, “Non-
linear vibrations and chaos in gongs and cymbals,” Acous-
[1] N.H. Fletcher, “Nonlinear frequency shifts in quasispherical- tical science and technology, vol. 26, no. 5, pp. 403–409,
cap shells: Pitch glide in chinese gongs,” The Journal of the 2005.
Acoustical Society of America, vol. 78, no. 6, pp. 2069–2073,
[17] Antoine Falaize and Thomas Hélie, “Passive simulation of
1985.
the nonlinear port-hamiltonian modeling of a rhodes piano,”
[2] Cyril Touzé M. Ducceschi and S. Bilbao, “Nonlinear vibra- Journal of Sound and Vibration, vol. 390, pp. 289 – 309,
tions of rectangular plates: Investigation of modal interaction 2017.
and coupling rules,” Acta Mechanica, pp. 213–232, 2014.
[18] M. Jossic, A. Mamou-Mani, B. Chomette, D. Roze, F. Ol-
[3] Thomas D Rossing and NH Fletcher, “Nonlinear vibrations livier, and C Josserand, “Modal active control of chinese
in plates and gongs,” The Journal of the Acoustical Society gongs,” Journal of the Acoustical Society of America, 2017,
of America, vol. 73, no. 1, pp. 345–351, 1983. Submitted.
[4] N.H. Fletcher and T.D. Rossing, The Physics of musical in- [19] KA Legge and NH Fletcher, “Nonlinearity, chaos, and the
struments, Springer-Verlag, 1998. sound of shallow gongs,” The Journal of the Acoustical So-
[5] A. Chaigne C. Touzé O. Thomas, “Hardening/softening be- ciety of America, vol. 86, no. 6, pp. 2439–2443, 1989.
haviour in non-linear oscillations of structural systems using [20] T. Meurisse, A. Mamou-Mani, S. Benacchio, B. Chomette,
non-linear normal modes,” Journal of Sound and Vibration, V. Finel, D. B. Sharp, and R. Caussé, “Experimental demon-
pp. 77–101, 2004. stration of the modification of the resonances of a simplified
[6] G. Kerschen, M. Peeters, J.C. Golinval, and A.F. Vakakis, self-sustained wind instrument through modal active con-
“Nonlinear normal modes, part i: A useful framework for trol,” Acta Acustica, vol. 101, no. 3, pp. 581–593(13), 2015.
the structural dynamicist,” Mechanical Systems and Signal [21] J.P. Noel and G. Kerschen, “Nonlinear system identification
Processing, vol. 23, no. 1, pp. 170–194, 2009. in structural dynamics: 10 more years of progress,” Mechan-
[7] C. Touzé M. Ducceschi, O. Cadot and S. Bilbao, “Dynam- ical Systems and Signal Processing, vol. 83, pp. 2–35, 2017.
ics of the wave turbulence spectrum in vibrating plates: a [22] Andre Preumont, Vibration control of active structures: an
numerical investigation using a conservative finite difference introduction, vol. 50, Springer Science & Business Media,
scheme,” Physica D, pp. 73–85, 2014. 2012.
[8] S. Bilbao, “A family of conservative finite difference
schemes for the dynamical von karman plate equations,” Nu-
merical Methods for Partial Differential Equations, pp. 193–
216, 2008.
[9] Cyril Touzé and M Amabili, “Nonlinear normal modes
for damped geometrically nonlinear systems: application to
reduced-order modelling of harmonically forced structures,”
Journal of sound and vibration, vol. 298, no. 4, pp. 958–981,
2006.
[10] Ali Hasan Nayfeh and Dean T. Mook, Nonlinear oscilla-
tions, J. Wiley, 1995.
[11] Simon Peter, Robin Riethmüller, and Remco I. Leine, Track-
ing of Backbone Curves of Nonlinear Systems Using Phase-
Locked-Loops, pp. 107–120, Springer International Publish-
ing, Cham, 2016.
[12] A. Falaize and T. Hélie, “Passive guaranteed simulation of
analog audio circuits: A port-hamiltonian approach,” Ap-
plied Sciences (Balcan Society of Geometers), vol. 6, pp.
pp.273 – 273, 2016.
[13] N. Lopes, T. Hélie, and A. Falaize, “Explicit second-order
accurate method for the passive guaranteed simulation of
port-hamiltonian systems,” IFAC-PapersOnLine, vol. 48, no.
13, pp. 223 – 228, 2015.
[14] V. Duindam, A. Macchelli, S. Stramigioli, and H. Bruyn-
inckx, Modeling and Control of Complex Physical Systems:
The Port-Hamiltonian Approach, Springer, 2009.
[15] S. Benacchio, B. Chomette, A. Mamou-Mani, and V. Finel,
“Mode tuning of a simplified string instrument using time-
dimensionless state-derivative control,” Journal of Sound
and Vibration, vol. 334, pp. 178–189, 2015.
DAFX-8
DAFx-71
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-72
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
-3
Following [10] (2) is cast into Hamiltonian form as 1.252
×10
analytical
EC
dy ∂T 1.25
= (6a) RK4
dt ∂p
1.248
dp ∂Vc ∂Vc
=− −r , (6b)
dt ∂y ∂t 1.246
energy
2
where T = p /(2m) represents the kinetic energy, p being the 1.244 1.246
×10 -3
approximations for all terms (see, e.g. [2, 3]). The approximation
Figure 1: Simulation of a mass colliding with a barrier, using
to the continuous variable y(t) at time n∆t, where ∆t is the sam-
the Hunt-Crossley model. The system energy of the presented
pling interval, is denoted by y n . Then (6) is discretised as
method and the fourth order Runge-Kutta method during the im-
y n+1 − y n T (pn+1 ) − T (pn ) pact is compared to the continuous (analytical) solution from (3).
= (7a) A zoomed area shows a more clearer comparison of the approxi-
∆t pn+1 − pn
mation methods.
pn+1 − pn Vc (y n+1 ) − Vc (y n ) Vc (y n+1 ) − Vc (y n )
=− − r .
∆t y n+1 − y n ∆t
(7b) processors [13] and is hence labelled EC for the remainder of this
text.
Defining the normalised momentum q n = pn ∆t/(2m) yields Following [12] the presented method is compared to the ana-
lytical solution (3) and to a higher order approximation given by a
y n+1 − y n = q n+1 + q n (8a) fourth order Runge-Kutta method (RK4). Figure 1 shows the en-
ergy during the impact as a function of the compression velocity.
∆t2 Vc (y n+1 ) − Vc (y n ) The parameters’ values used for the simulations (listed in Table 1)
q n+1 − q n = −
2m y n+1 − y n are taken from [12]. It can be observed that the EC method accu-
∆t rately reproduces the analytical result, outperforming the higher-
−r Vc (y n+1 ) − Vc (y n ) . (8b)
2m order Runge-Kutta method. This is a result of the exact energy-
conserving nature of this algorithm, in the case of Hamiltonian
Using the auxiliary variable x = y n+1 − y n leads to the following systems, that is projected here to a dissipative case.
nonlinear equation in x
DAFX-2
DAFx-73
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
∆t2 n
F (x) − µt+ fex = 0. (19)
Figure 2: Sketch of a damped harmonic oscillator (a) before and 2m
(b) during contact with the barrier.
The energy balance accordingly becomes
DAFX-3
DAFx-74
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
By = 2H
0
. quency. The other model parameters are the same as in Section 2
k
and the initial conditions are y 0 = −0.1 mm; p0 = 0.005 kg m/s.
The dashed line in Figure 3(a) shows the mass displacement in
3.3. Bound on the number of iterations the absence of the external force. Figure 3(c) shows how many
One common issue when using iterative methods to solve nonlin- Newton iterations are required at each time step for convergence to
ear equations is the number of iterations required for the numerical machine precision. These are well below the theoretical bound cal-
solution to converge. Indeed, for certain parameter choices similar culated in Section 3.3, which is equal to 12 iterations for Newton’s
iterative schemes may fail to converge, as reported in [7]. There- method and 38 iterations for the bisection method; note however
fore a formal calculation is presented here for the maximum num- that the bisection method requires less operations at each iteration.
ber of iterations required for the numerical solution of (19) using Figure 4 presents the case where the external force and the linear
Newton’s method (or alternatively the bisection method). damping are omitted (γ = 0), and the impact damping factor r is
In the presence of an external force, the uniqueness of the so- increased 500 times to exaggerate its effect. It can be observed that
lution of F (x) = 0 needs to be analysed separately. Assuming in both cases K n is conserved within machine precision and New-
uniqueness, the bisection method halves the interval whose mean ton’s method converges quite fast, which can be explained by the
is an approximation to the solution of F (x) = 0 in each iteration. presence of a good starting point for x, available from the solution
From the bound (28b) on x therefore it follows that it takes at most at the previous time step.
Figure 5 shows how the number of iterations may increase
Bx when an arbitrary starting point is chosen. This starting point is en-
k = log2 (30)
ε forced for all time-steps during a 10 ms long simulation, using the
iterations for the bisection method to converge up to precision ε, same parameters as in Figure 3. The maximum number of required
when starting within the interval [−Bx , Bx ]. iterations across all time-steps is plotted for each chosen starting
Uniqueness can be guaranteed when the external force does point value. Since F ′ (x) ≥ 1 and F ′′ (x) ≥ 0 only a poor starting
not depend on the state (y and y ′ ) of the oscillator, hence fex
n+1
is point larger than Bx will result in slower convergence rates2 , as
a function independent of x. Under this assumption, one can show explained in the Appendix.
that
2 In practice, using the solution at the previous time-step as a starting
F ′ (x) ≥ 1, (31) point guarantees that x0 ∈ [−Bx , Bx ]. However, depending on the shape
of F (x), x1 may indeed lie on the right hand side of Bx . In that case one
F ′′ (x) ≥ 0, (32) should set x1 = Bx , since the solution x∗ is expected to lie in [−Bx , Bx ].
When generating Figure 5, starting from an arbitrary x0 , this substitution
and use these facts for the analysis of Newton’s method. This leads was not carried out, in order to demonstrate the possibility of slow conver-
(see Appendix) to the fact that in order to achieve a given precision gence rates in the absence of a bound on the discrete solution.
DAFX-4
DAFx-75
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
displacement [mm]
displacement [mm]
0.1 0.1
(a) (a)
0 0
-0.1 -0.1
0 2 4 6 8 10 0 0.5 1 1.5 2
time [ms] time [ms]
-15
×10 ×10 -16
1 5
(b) (b)
0
n
n
e
e
-1
-5
0 2 4 6 8 10 0 0.5 1 1.5 2
time [ms] time [ms]
15 40
(c) (c)
10
N iter
N iter
20
5
0
0 0 0.5 1 1.5 2
0 2 4 6 8 10
time [ms] time [ms]
Figure 3: Simulation of a damped, driven oscillator involving mul- Figure 4: Simulation of an undamped oscillator involving multiple
tiple impacts modeled using the Hunt-Crossley approach (γ = impacts modeled using the Hunt-Crossley approach (γ = 0, r = 5
3000 s−1 , r = 0.01 s/m). (a): mass displacement (the dashed line s/m). (a): mass displacement. (b): error in the conservation of
shows the displacement in the absence of the external force). (b): K n . Horizontal lines indicate multiples of single bit variation. (c):
error in the conservation of K n . Horizontal lines indicate multi- Number of iterations required for Newton’s method to converge
ples of single bit variation. (c): Number of iterations required for (the dashed line indicates the theoretical bound on the number of
Newton’s method to converge (the dashed line indicates the theo- iterations).
retical bound on the number of iterations).
Bx
15
10
iter
max(N
DAFX-5
DAFx-76
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
9000 9000
31.6 51 330
0
6000 6000
reflection function
frequency [Hz]
frequency [Hz]
−0.1 5000 5000
3000 3000
−0.3
0 2 4 6 8 10 12
time [ms] 2000 2000
Figure 7: Top: schematic profile of a simplified clarinet bore (not 1000 1000
to scale). Bottom: the simulated reflection function.
0 0
0 0.1 0.2 0 0.1 0.2
time [s] time [s]
and h = yeq − y the reed opening, yeq being the equilibrium open-
ing of the reed after the player positions his lip (see [23]). These Figure 8: Spectrogram of the simulated mouthpiece pressure pin
parameters, related to the single-reed excitation mechanism, are with (left) and without impact damping (right).
visualized in Figure 6. Note that in principle h is allowed to be-
come negative, something avoided in the simulations presented
here, due to the effect of the collision force. Nevertheless, it is harmonics, especially during the transient. The parameters related
safer to define h = ⌊yeq − y⌋, in order to allow an arbitrary varia- to this loss mechanism are here chosen arbitrarily; they are ex-
tion of model parameters that might affect the reed opening. pected to vary depending on the type of reed used (plastic reeds
The mouthpiece pressure can be obtained using convolution have seen increased use lately) and the material properties and ex-
with the reflection function of the tube [22]. The geometry of the act geometry of the mouthpiece lay. The latter is often specified
tube (including the mouthpiece) used in the numerical simulations by musicians as having a significant effect on the response of the
is shown in Figure 7. Its input impedance was calculated using the instrument.
Acoustics Research Tool [24], including viscothermal losses at the
walls and radiation losses at the open end. This can be converted to 5. DISCUSSION
the reflection function of the tube (also plotted in Figure 7) follow-
ing the procedure described in [25]. An energy balance for such a A power balance model for impact damping has been presented.
coupled system has been explored in [3] where the air column is This leads to the numerical solution of a nonlinear equation, with
also discretised using the finite difference method. Newton’s method being a suitable solver. The fact that such equa-
The effect of including the impact damping in the single-reed tions need to be solved iteratively led to an efficiency analysis in
model is visualized in Figure 8 where the spectrogram of the mouth- terms of the maximum number of iterations required for conver-
piece pressure pin is compared to that of the same simulation but gence. Note that such a limit represents a worst-case scenario.
with the impact damping omitted. The model parameters used in Convergence is usually achieved earlier, due to the presence of a
the simulations are given in Table 2. It can be observed that taking good starting point for the solver, which is given by the solution at
impact damping into account results in the dissipation of higher the previous time step. Nevertheless, the presented analysis pro-
DAFX-6
DAFx-77
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
vides a guarantee that simulation algorithms will always converge [8] C. Issanchou, S. Bilbao, J. Le Carrou, C. Touzé, and
within a given number of iterations steps. O. Doaré, “A modal-based approach to the nonlinear vibra-
Including impact damping when modelling object collisions, tion of strings against a unilateral obstacle: Simulations and
a quantity that is often neglected in sound synthesis applications, experiments in the pointwise case,” Journal of Sound and
appears to have a significant effect on certain systems. This is Vibration, vol. 393, pp. 229–251, 2017.
illustrated by a simulation of a clarinet tone, in which case the in- [9] K. Hunt and F. Crossley, “Coefficient of restitution inter-
fluence of impact damping on the simulated tone is apparent on preted as damping in vibroimpact,” Journal of Applied Me-
the calculated spectrum. The necessity of such a model for sim- chanics, vol. 42, pp. 440–445, 1975.
ulating other types of instruments (or different acoustic systems)
remains to be investigated using both a numerical and a perceptual [10] M. van Walstijn and V. Chatziioannou, “Numerical simu-
approach. lation of tanpura string vibrations,” in Proc. International
Including two types of damping in this study (parameterised Symposium on Musical Acoustics, Le Mans, 2014, pp. 609–
using γ and r) provides a framework for the treatment of a wide 614.
range of lumped systems involving nonlinear interactions. An in- [11] C. Desvages and S. Bilbao, “Two-polarisation physical
teresting extension of the presented convergence study would be model of bowed strings with nonlinear contact and friction
to analyse distributed systems, such as a string interacting with a forces, and application to gesture-based sound synthesis,”
barrier (see, e.g. [7, 8, 10, 13]). For such systems the nonlinear Applied Sciences, vol. 6, no. 5, 2016.
equation to be solved is a vector equation of the form
[12] S. Papetti, F. Avanzini, and D. Rocchesso, “Numerical meth-
F(x) = 0. (37) ods for a nonlinear impact model: A Comparative study
with closed-form corrections,” IEEE Transactions on Audio,
In this case a direct analysis of Newton’s method is more involved. Speech and Language Processing, vol. 19, no. 7, pp. 2146–
However (37) could be interpreted as a fixed-point iteration prob- 2158, 2011.
lem x = T(x), where the Lipschitz constant of T [26] relates to [13] V. Chatziioannou and M. van Walstijn, “Energy conserv-
the number of required iterations until convergence to the solution ing schemes for the simulation of musical instrument contact
x∗ is achieved. dynamics,” Journal of Sound and Vibration, vol. 339, pp.
262–279, 2015.
6. ACKNOWLEDGMENTS [14] D.E. Hall, “Piano string excitation. VI: Nonlinear modeling,”
Journal of the Acoustical Society of America, vol. 92, no. 1,
This research is supported by the Austrian Science Fund (FWF):
pp. 95–105, 1992.
P28655-N32. S. Bilbao was supported by the European Research
Council, under grant 2011-StG-279068-NESS. [15] A. Chaigne and V. Doutaut, “Numerical simulation of xy-
lophones. I. Time-domain modeling of the vibrating bars,”
Journal of the Acoustical Society of America, vol. 101, no. 1,
7. REFERENCES
pp. 539–557, 1997.
[1] A. Chaigne and J. Kergomard, Acoustics of Musical Instru- [16] T. Taguti, “Dynamics of simple string subject to unilateral
ments, Springer, New York, 2016. constraint: A model analysis of sawari mechanism,” Acous-
[2] V. Chatziioannou and M. van Walstijn, “An energy con- tical science and technology, vol. 29, no. 3, pp. 203–214,
serving finite difference scheme for simulation of collisions,” 2008.
in Sound and Music Computing (SMAC-SMC 2013), Stock- [17] N. Lopes and T. Hélie, “Energy balanced model of a jet
holm, 2013, pp. 584–591. interacting with a brass player’s lip,” Acta Acustica united
[3] S. Bilbao, A. Torin, and V. Chatziioannou, “Numerical mod- with Acustica, vol. 102, no. 1, pp. 141–154, 2016.
eling of collisions in musical instruments,” Acta Acustica [18] V. Chatziioannou and M. Van Walstijn, “Discrete-time con-
united with Acustica, vol. 101, no. 1, pp. 155–173, 2015. served quantities for damped oscillators,” in Proc. Third Vi-
[4] M. Ducceschi, S. Bilbao, and C. Desvages, “Modelling colli- enna Talk on Music Acoustics, 2015, pp. 135–139.
sions of nonlinear strings against rigid barriers: Conservative [19] A. Torin, Percussion Instrument Modelling In 3D: Sound
finite difference schemes with application to sound synthe- Synthesis Through Time Domain Numerical Simulation,
sis,” in International Congress on Acoustics, Buenos Aires, Ph.D. thesis, The University of Edinburgh, 2016.
2016. [20] M. van Walstijn and F. Avanzini, “Modelling the mechanical
[5] S. Bilbao, Numerical Sound Synthesis, Wiley & Sons, Chich- response of the reed-mouthpiece-lip system of a clarinet. Part
ester, UK, 2009. II. A lumped model appproximation,” Acustica, vol. 93, no.
[6] G. Evangelista and F. Eckerholm, “Player-instrument in- 1, pp. 435–446, 2007.
teraction models for digital waveguide synthesis of guitar: [21] J.P. Dalmont, J. Gilbert, and S. Ollivier, “Nonlinear charac-
Touch and collisions,” IEEE Transactions on Audio, Speech teristics of single-reed instruments: Quasistatic volume flow
and Language Processing, vol. 18, no. 4, pp. 822–832, 2010. and reed opening measurements,” Journal of the Acoustical
[7] M. van Walstijn and J. Bridges, “Simulation of distributed Society of America, vol. 114, no. 4, pp. 2253–2262, 2003.
contact in string instruments: a modal expansion approach,” [22] V. Chatziioannou and M. van Walstijn, “Estimation of clar-
in Europ. Sig. Proc Conf (EUSIPCO2016), 2016, pp. 1023– inet reed parameters by inverse modelling,” Acta Acustica
1027. united with Acustica, vol. 98, no. 4, pp. 629–639, 2012.
DAFX-7
DAFx-78
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-79
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Michele Ducceschi
Acoustics and Audio Group,
University of Edinburgh
Edinburgh, UK
michele.ducceschi@ed.ac.uk
1. INTRODUCTION
2. MODEL EQUATIONS
In many musical instruments, collisions and contact forces are in-
volved at various levels in the mechanism of sound production In the course of this paper, the displacement u(t) of a particle of
[1]. The interaction of strings with a bow, a membrane (like in mass M is described by an ordinary differential equation of the
the snare drum), a mallet, a finger or a fretboard are all exam- following form
ples of contact forces. Collisions of the reed in wind instruments φ̇(u)
have a major impact on the perceived tonal quality. Prepared pi- M ü = −φ0 (u) = − , (1)
u̇
ano string (coupled to rattling elements) are yet another example of
such collisions. Outside of musical acoustics, collisions represent where the last equality was obtained by means of the chain rule.
an important field of study in robotics [2], and of course computer In the equation, the particle is assumed to be subjected to a force
graphics [3, 4]. The models employed to describe all the above described by the potential φ(u). By multiplying both sides of the
forces are necessarily nonlinear, and hence they represent a chal- equation by u̇ the following energy identity is obtained
lenge in terms of numerical simulation. Although many methods
d M 2 d
have been used to simulate some specific examples of collisions u̇ + φ(u) , H = 0 → H = H0 , (2)
(including digital waveguides [5], modal methods [6, 7], and time dt 2 dt
stepping methods [8]), a fairly recent general framework was pro- and hence the energy H is non-negative if and only if φ(u) ≥
posed in order to simulate a large class of collisions and contact 0 ∀u. Various potentials satisfy such requirement. Specific forms
forces [1]. In this framework, the forces are generated by a poten- of interest here are the following
tial which takes the form of some kind of power law, depending
on one stiffness coefficient, and one exponent. Such framework φ(u) =
K
uα+1 , α = 1, 3, 5, ... (3a)
allows to simulate collisions of two lumped objects (like a mass α+1
and a spring), of one lumped and one distributed object (like a K
mallet and a string), and even of two distributed systems (like a φ(u) = [u − h]α+1
+ , α ∈ R, α > 1 (3b)
α+1
string and a membrane). For very stiff collisions, the forces are K
generated by a spurious interpenetration, which can be made as φ(u) = [|u| − h]α+1
+ , α ∈ R, α > 1 (3c)
α+1
small as possible by increasing the stiffness coefficient. Though
extremely versatile, the associated energy-conserving numerical In the equations, the symbol [x]+ denotes the positive part, i.e.
schemes rely on iterative root finding algorithms, such as Newton- 2[x]+ , x + |x|. The constant K is a stiffness coefficient. The
Raphson, which is most cases represent a computational bottle- forces originated by such potentials are depicted in Fig. 1. In mu-
neck. In this paper, a novel family of finite difference schemes is sical acoustics, though often for distributed systems, these poten-
proposed for the lumped case, which do not require an iterative tials are used in modelling large-amplitude nonlinearities, contact
root finding method. It should be mentioned that previous works nonlinearities (such as the hammer-string interaction), rattling el-
(see for example [9, 10]) do present numerical schemes able to ements, and other important nonlinear interactions [11].
DAFX-1
DAFx-80
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
From (2), one finds immediately a bound on the growth of the Hence, one has
solution, i.e.
2H0 (s) x2
(u̇)2 ≤ , (4) Pn−1,n = φ(un−1 ) 1 + s(α + 1)x + s(α + 1)α (9)
M 2
where H0 (a constant) is the total energy. where
un − un−1
x, . (10)
3. FINITE DIFFERENCE SCHEMES ūn−1
The potential energy will be non-negative if and only if the dis-
Solutions to (1) are now sought in terms of appropriate finite dif- criminant of the quadratic above is less than or equal to zero, i.e.
ference schemes, upon the introduction of a sample rate fs and s2 (α + 1)2 − 2s(α + 1)α ≤ 0. (11)
the associated time step k = 1/fs . The solution is evaluated at
the discrete times nk, where n ∈ Z+ , and is denoted un . Finite This is a parametric inequality that must be evaluated according
difference operators are introduced as: the sign of s and (α + 1). Notice that, for the potentials in (3), one
must check that the solutions are valid ∀α ≥ 1. This gives
• backward and forward shift operators
et− un , un−1 , et+ un , un+1 0 < s ≤ 1. (12)
• backward, centered and forward first time derivatives Such values will ensure that
(s)
Pn−1,n is non-negative, and therefore
δt− , k1 (1−et− ), δt· , 2k
1
(et+ −et− ), δt+ , k1 (et+ −1) that the discrete Hamiltonian (7) is non-negative, ∀α ≥ 1.
• averaging operators In this case, boundedness of the potential energy can be achieved
µt+ , 12 (1 + e+ ), µt− , 1
(1
(s)
+ e− ), µt− , s + (1 − for values of s which do not guarantee non-negativity of the poten-
s)et− (s ∈ R)
2
tial energy. In fact, given the particular form of the potential energy
(9), if the coefficient multiplying x2 is positive, then the parabola
• second time derivative will always have a minimum regardless of the value of un−1 , and
δtt , δt+ δt− = k12 (et+ − 2 + et− ) hence ∀n (remember that φ is non-negative by definition). Such
In order to derive a finite difference scheme, a general form for the coefficient is s(α + 1)α and, because in this work α ≥ 1, the
Hamiltonian is given here in terms of the generalised averaging potential energy will then be bounded from below ∀s ≥ 0. The
operator defined above, as bound depends on the intial conditions, and tends to zero as the
sampling rate is increased, see also Fig. 5.
M (s) Summarising, in this work
hn−1/2 = (δt− un )2 + µt− φ(un ). (5)
2
s > 0, α≥1 (13)
Notice that the particular choice s = 12 leads to the Hamilto-
nian considered in [1], whose associated finite difference scheme with the particular case 0 < s ≤ 1 guaranteeing non-negativity of
is second-order accurate, and whose update requires an iterative the potential energy.
root finding method such as the Newton-Raphson algorithm.
In this work, the potential energy is Taylor-expanded around 3.2. Energy conservation. Finite difference scheme
the point un−1 up to second order, giving A finite difference scheme can be derived from the Hamiltonian
(s) above by imposing
µt− φ(un ) ≈ φ(un−1 ) + s(un − un−1 )φ0 (un−1 )+
δt+ hn−1/2 = 0. (14)
(un − un−1 )2 00 n−1 (s) Before deriving the scheme, notice that when the potential energy
s φ (u ) , Pn−1,n (6)
2 is non-negative, one immediately finds a bound similar to (4), i.e.
Hence, the Hamiltonian considered in this work, depending on the 2h0
parameter s, is (δt− un )2 ≤ , (15)
M
M (s) When the potential energy is not positive, but bounded from below,
hn−1/2 = (δt− un )2 + Pn−1,n (7) such inequality is true up to a correction of the order of k2 . Upon
2
the introduction of the variable y , un+1 − un−1 , the scheme is
3.1. Boundedness of Potential Energy C
Ay + B + = 0, (16)
The potential energy defined in (6) is a parabola in u − u n
. n−1 y
Moreover, for the potentials considered in (3) the following iden- where the coefficients A, B, C depend on previous values, and are
tities hold given as
φ(u) φ(u) M
φ0 (u) = (α + 1) , φ00 (u) = α(α + 1) , (8) A= + sφ00 (un )
ū ū2 k2
2M
where B = − 2 (un − un−1 ) + 2sφ0 (un ) + 2s(un−1 − un )φ00 (un )
k
ū , u for (3a) (s) (s)
C = 2Pn,n−1 − 2Pn−1,n
ū , u − h for (3b) Under the assumption y 6= 0, the scheme can be written as a
|u| − h quadratic in y, i.e.
ū , for (3c)
sign(u) Ay 2 + By + C = 0. (17)
DAFX-2
DAFx-81
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 1 1
(a) (b) (c)
0.5 0.5 0.5
dφ/du
dφ/du
dφ/du
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
u u u
Figure 1: Nonlinear forces. (a): nonlinear power law, as per (3a), for K = 1, α = 1, 3, 5. (b): one-sided power law, as per (3b), for
K = 1, α = 1, 1.7, 2.8, h = 0.2. (c): center limited power law, as per (3c), for K = 1, α = 1, 1.7, 2.8, h = 0.3.
3.3. Existence and Uniqueness Remembering that for this case un h > 0, uh
n−1
< 0 one can find a
sufficient condition for positivenenss by imposing
Scheme (17) requires the knowledge of the roots of a quadratic
function at each update, and is thus very attractive numerically as −AK̄ + sAK̄ ≥ 0, → s ≥ 1.
one may employ the well-known closed-form solution for quadrat-
ics instead of a nonlinear root finding algorithm. However, exis- Case 4. Using again the same definition of K̄ and uh , one has (up
tence of the roots must be checked, and a condition on existence to a positive constant of proportionality)
must be given. Also, if the roots exist, they come in pairs, pos-
ing a question on uniqueness. Existence is first checked, and the 1
∆(1) = K̄s2 (uhn−1 )2 + 2 (δt− un
h)
2
question of uniqueness is discussed later. k
s K̄
−2 K̄uhn−1 + Ak − sK̄ (µt− un
h ) (δt− un
h)
3.3.1. Existence k 2
checked.
3.3.2. Uniqueness
4. φ(un ) > 0, φ(un−1 ) > 0. This scenario corresponds to
(3a), as well as to (3b), (3c) when the particle and the bar- Assuming realness of the roots of (17), at each time step one has
rier/spring are in full contact (spurious interpenetration). to choose either y+ or y− , defined as
This case must also be discussed. √
−B ± ∆(a)
Hence, the only cases to discuss are 3 and 4. y± = . (19)
2A
Case 3. Upon the definition of K̄ = K/M and uh = u − h, this
case gives The choice is made according to the following rule
• if B ≥ 0 choose y−
A n−1 2 1 2
∆(1) = (u ) + (u n 2
h ) − AK̄ + sAK̄ − 2 Aun n−1
h uh .
k2 h
k4 k • if B < 0 choose y+
DAFX-3
DAFx-82
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
To understand why this rule holds, consider the case of a free par-
ticle, i.e. for which the potential is zero at all times. In this case, s= 1
2
s=1 s=3
scheme (17) reduces to
Ay 2 + By = 0, (20) 5
−2 6
log(∆min )
log(∆min )
log(∆min )
(a)
(a)
(a)
with solutions −3 4
√
−B ± B 2 −B ± |B| −4 4
y± = = , (21) 3
2A 2A −5
2 2
but because the solution y = 0 is ruled out, one recovers the rule 0 5 10 0 5 10 0 5 10
above. α α α
1
log(∆min )
log(∆min )
log(∆min )
3.3.3. Comments on existence, ∆(α>1) and bounds 8
(a)
(a)
(a)
7
In this subsection, some comments are given regarding the real- 0 6
ness of the roots of (17), for α > 1. A discussion on the sign of
∆(α>1) is somewhat complicated by the fact that un n−1
h , uh appear 6
0 5 10 0 5 10 0 5 10
nonlinearly with rational exponents, or with high-order powers, α α α
and hence it is difficult to carry on a study of the sign of ∆(α>1)
along the same lines as ∆(1) . A possible procedure is to consider
a parametric study of the function d∆(α) /dα, and hence find con- Figure 2: Computability of proposed scheme: values of ∆(α)
ditions on maxima and minima of ∆(α) for the various signs of for the selected values of the free parameter s. Missing points
n−1
unh , uh . This rigorous approach, though desirable, is somewhat correspond to complex roots. For the simulations, M = 1 kg,
lengthy and perhaps beyond the scope of the current work. A less K = q103 , α ∈ [1, 3, 5, 7, 9,q 11, 13]. Time stepsqchosen as
rigorous, though revealing approach is to make use of brute force,
k1 = M
(dark blue), k2 = M
(green), k3 = 4M
(red),
i.e. to launch many simulations testing out large portions of the q 4K K K
parameter space, and to empirically verify the robustness of the k4 = (light blue). Notice that k3 is the largest timestep al-
16M
K
algorithm. lowed for the classic simple harmonic oscillator scheme, (22). Top
In Fig. 2, the scheme is checked for potential (3a), for s = row: initial velocity v0 = 1 m/s, initial displacement u0 = −1
1
2
, 1, 3. The particle has mass M = 1 kg, and the spring has stiff- mm. Bottom row: initial velocity v0 = 20 m/s, initial displace-
ness K = 103 . Each case presents two subcases, i.e. standard and ment u0 = −1 mm.
very high initial velocities (1 m/s, 20 m/s). The figures report the
minimum of ∆(α) , for α ∈ [1, 3, 5, 7, 9, 11, 13]. Each colour is
associated with a different time step. Missing points correspond
to simulations returning complex roots. The time steps are chosen s= 1
2
s=1 s=3
i−2
as ki = 2√K̄ , for i = 1, 2, 3, 4. Notice that k3 is the limit of
stability of the classic second-order accurate scheme for the sim-
0.2
ple harmonic oscillator, (22). For s = 21 , ∆(1) is always positive,
in accordance with the previous observation that for such value of 0.1
u (m)
asymptote. Notice that the values of α selected for the figures are
unreasonably high for applications in musical acoustics (in prac- −100
tice, one always chooses 1 ≤ α ≤ 3). However, it is remarkable
that the scheme still works under such extreme conditions, at no −150
10 31.6 100 10 31.6 100 10 31.6 100
extra computational cost.
f (Hz) f (Hz) f (Hz)
A discussion for potentials (3b) and (3c) is not reported here,
but the same conclusions apply.
Similar plots suggest that computability is increased as the pa- Figure 3: Simple harmonic oscillator. Comparison of current
rameter s is increased. scheme, for potential (3a) with α = 1 (solid line), and classic
Summarising, empirical observations suggest that scheme (17) scheme (22) (dashed line). Top row: time domain. Bottom row:
gives real roots in the following cases frequency domain. Free parameter s chosen as indicated on top.
• conditional realness for 1 ≤ α ≤ 3 if s = 1, the condition Natural frequency of the oscillator is f0 = 11 Hz. Sampling
being (at worst) k ≤ √2K̄ rate chosen as fs = 2300 Hz. Initial conditions: v0 = 0.5 m/s,
u0 = −0.1 m.
• unconditional realness ∀α ≥ 1, if s > 2
DAFX-4
DAFx-83
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4. APPLICATIONS 101 20
(a) (b)
Scheme (17) is now tested against various benchmark schemes. 100.5
u (mm)
u̇ (m/s)
First, the problem of the simple harmonic oscillator is considered. 0
Then the dynamics of a particle colliding against a stiff barrier is
studied, followed by the cubic oscillator. 100
−20
4.1. Simple Harmonic Oscillator 29.95 30 30.05 30.1 29.95 30 30.05 30.1
t (ms) t (ms)
In order to assess the properties of scheme (17), the simple har-
monic oscillator is numerically simulated under various choices of
the parameter s, and comparated against a second-order accurate Figure 4: Collision of a fast particle against a stiff barrier. Com-
benchmark scheme (see [11] for details) given by parison of benchmark scheme (blue) (23) and proposed scheme,
for s = 21 (green), s = 1 (red), s = 3 (cyan). Particle has mass
2 M = 1 kg, and is started at u0 = −0.5 m with initial velocity
un+1 = (2 − k2 K̄)un − un−1 , k≤ √ . (22)
K̄ v0 = 20 m/s against a barrier located at h = 0.1 m; barrier param-
eters are K = 5 · 109 , α = 1.3. For both figures, solid lines and
Fig. 3 presents a few such comparisons, for s = 12 , 1, 3. The ques- dashed lineds are obtained using, respectively, fs = 44100 and
tion of accuracy for the current scheme is an interesting one. For fs = 441000. (a): Particle displacement during collision (spuri-
the case of the simple harmonic oscillator considered here there ous interpenetration). (b): Particle velocity during collision.
are two sources of error: the first one is numerical dispersion (in
fact, this is known as phase errors for the lossless case); the second
one is due to the truncation of the Taylor series (6) to second or- mark scheme presented in [1], which reads
der. Frequency domain analysis (i.e. z-transform techinques) are
in this case out of hand, because for scheme (17) the values of the k2
y+ φ(y + un−1 ) − φ(un−1 ) + 2un−1 − 2un = 0, (23)
solution at the times n + 1, n, n − 1 appear nonlinearly even under My
linear conditions for the potential (3a).
There are some interesting facts about Fig. 3. First of all, the where y , un+1 − un−1 . The scheme is second order accurate,
scheme is less and less accurate as s is increased. This observation and unconditionally stable provided that one is able to calculate y
is somewhat in contrast with the observation on existence of the which appears implicitly in the argument of φ. In order to solve
roots of (17) (see also Fig. 2): there is a trade-off between accu- for such scheme, one must employ a nonlinear root finding algo-
racy and computability. In particular, for s = 1, 3 the fundamental rithm, such as Newton-Raphson. Considering Fig. 4, it is seen
frequency is higher than the one obtained with the classic scheme, that the current scheme departs more and more from the bench-
resulting in the sinusoids shifting apart in the time domain. The mark scheme as s is increased; this observation is consistent with
second interesting fact is that ∀s 6= 12 the simulated system is non- what was already noted for the case of the harmonic oscillator.
linear, even though the model equation of the simple harmonic In particular, for s = 21 the proposed scheme is virtually undis-
oscillator is completely linear. Nonlinearities appear as odd har- tinguishable from the benchmark scheme, whereas for s = 1, 3
monics in the spectra of the cases for s = 1, 3. Even though such differences can be noticed. When the sampling rate is increased,
peaks are much lower in energy than the fundamental (for s = 1 such differences are reduced, providing evidence that the bench-
the second harmonic is lower than 60 dB in amplitude), as s is mark scheme and the proposed scheme display the same dynamics
increased they become more and more prominent. However, the in the limit of infinite sampling rate, and ∀ s.
amplitude of such peaks is insensitive to the initial conditions (in From this simulation, it is interesting to plot the energy com-
particular they do not grow when higher inital velocities or dis- ponents for the benchmark scheme and for the proposed scheme.
placements are used). This is done in Fig. 5. In accordance to what was noted previously,
It is of course the case to point out that scheme (17) is prob- for s = 3 the potential energy is not non-negative, but remains
ably not very well suited for the problem of the simple harmonic bounded from below and thus stability is guaranteed. When the
oscillator, at least ∀s 6= 12 . This is because in general the scheme sampling rate is increased, the minimum of the potential energy
does not make a distinction between linear and nonlinear cases, so tends to zero.
long as the potential φ is positive-definite and therefore the prop-
erties of existence of the roots and of positiveness of the discrete 4.3. Cubic Oscillator
Hamiltonian are preserved. In other words, the scheme offers a
general way to treat a large class of nonlinear problems, including Another interesting system is offered by the cubic oscillator, de-
the linear case as some sort of “sub-case”, but where the scheme scribed by an equation of the type
remains nonlinear. K 3
ü = − u , (24)
M
4.2. Colliding Mass and for which a second-order accurate, unconditionally stable scheme
is offered by (see [11])
In this subsection the dynamics of a colliding mass against a stiff
barrier is simulated. Fig. 4 presents the comparison of the current 2
un+1 = un − un−1 . (25)
scheme, under various choices of the parameter s, and a bench- 1+ K
k2 (un )2
2M
DAFX-5
DAFx-84
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
h (J)
h (J)
h (J)
100 100
0
h (J)
h (J)
100 100 100
Figure 5: Collision of a fast particle against a stiff barrier, energy components. Comparison of benchmark scheme (23) and proposed
scheme, for s = 1 and s = 3. For all plots, red line is total energy, green is kinetic, and blue is potential. (a)-(c): fs = 44100, (d)-(f):
fs =441000. The energy components are taken from the same simulation as Fig. 4
respect to the case of the simple harmonic oscillator, in this case 1 ·10
the spectral content of the solution is sensitive to the initial con- 0.5
u (m)
−100
5. CONCLUSIONS −150
10 1,000 10 1,000 10 1,000
Nonlinear forces, and in particular collisions, are of prime impor- f (Hz) f (Hz) f (Hz)
tance for many applications in musical acoustics. In this paper, a
novel family of finite difference schemes was presented for colli-
sions in the lumped case, as well as for nonlinear oscillators of any Figure 6: Cubic oscillator. Comparison of current scheme, for po-
odd power. With respect to previous numerical models, the current tential (3a) with α = 3 (solid black line), and benchmark scheme
scheme requires only the evaluation of a square root at each up- (25) (dashed blue line). Top row: time domain. Bottom row: fre-
date, therefore no iterative root finders are needed. The scheme is quency domain. Free parameter s chosen as indicated on top. Stiff-
energy-conserving, and conditions for boundedness of the nonlin- ness of the oscillator is K = 1010 , and mass is M = 1 kg. Sam-
ear energy can be given in terms a free parameter in the Hamilto- pling rate is fs = 44100. Initial conditions: v0 = 5 m/s, u0 = 0
nian. The robustness of the algorithm was tested for a large number m.
of cases, showing very good computability properties ∀α ≥ 1 for
a choice of the free parameter s ≥ 1. The choice of s = 12 gives
the most accurate results, however preliminary brute force analy-
sis shows that such case is also the least computable (i.e. the roots
of the quadratic are complex in many cases). Although brute force
DAFX-6
DAFx-85
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6. ACKNOWLEDGMENTS
This work was supported by the Royal Society and by the British
Academy, through a Newton International Fellowship. Dr Stefan
Bilbao is kindly acknowledged for his precious comments on the
properties of the scheme.
7. REFERENCES
DAFX-7
DAFx-86
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-87
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Finally to express property (P3) we require the power-balance Pc the stored energy level en and its variation law defined by
e0n = ∇Hn (xn )x0n for the state variable xn .
E 0 (t) = −Pd + Pe (5) Pd the dissipated power qm (w) ≥ 0 with the component’s flux
and effort variables being in algebraic relation of a single
where Pd and Pe are respectively the dissipated and external power variable w.
and E 0 (t) is the instantaneous energy variation of the system.
Pe the external power up yp brought to the system through this
port with up being the controllable input of the system and
3. PORT-HAMILTONIAN SYSTEMS yp being the observable output.
In this article, nonlinear passive physical audio systems are de- For a storage component, en = Hn (xn ) gives the physical energy
scribed under their Port-Hamiltonian formulation. The theory of storage law. If x0n is a flux (resp. effort) variable then ∇Hn (xn )
Port-Hamiltonian Systems (PHS) [2] [3] extends the theory of Hamil- is the dual effort (resp. flux) variable.
tonian mechanics to non-autonomous and dissipative open sys- Similarly, for a dissipative component, the power is qm =
tems. It provides a general framework where the dynamic state- Rm (wm ) so that if wm is a flux (resp. effort) variable then z(wm ) =
Rm (wm )
space equations derives directly from an energy storage function wm
is the effort (resp. flux) and gives the dissipation law.
and power-conserving interconnection of its subsystems. We then consider a passive system obtained by interconnection
of these components given by
3.1. Explicit differential form 0
x ∇H(x)
Consider a system with input u(t) ∈ U = RP , with state x(t) ∈ w = S(x, w) z(w) (13)
X = RN and output y(t) ∈ Y = RP with the structured state- −y u
space equations [2] | {z } | {z }
b a
0 P
x = (J(x) − R(x)) ∇H(x) + G(x)u = f (x, u) with S = −ST being skew-symmetric, H(x) = N
(6) i=1 Hn (xn )
y = G(x)T u
and z(w) = [z1 (w1 ), . . . , zm (wm )]T .
where H gives the stored energy of the system The S matrix represents the power exchange between compo-
nents: since S = −ST we have a · b = aT Sa = 0 which again
E(t) = (H ◦ x)(t) (7) leads to the power balance1 .
with H ∈ C 1 (X, R+ ), ∇ being the gradient operator, J = −JT a ∇H(x) · x0 + z(w) · w − u · y = 0 (14)
| {z } | {z } | {z }
skew-symmetric matrix and R = RT 0 a positive-semidefinite Pc =E 0 (t) Pd Pe
matrix. The energy variation of this system satisfies the power-
balance given by the derivative chain rule The explicit form (6) can be found by solving the second row of
(13). The S matrix represents a Dirac structure [2] that expresses
E 0 (t) = ∇H(x)T x0 (8) the power-balance and can be constructed from a component con-
nection graph [8] [18].
which can be decomposed as
1 The minus sign in −y in Eq. (13) is used to restore the receiver con-
E 0 (t) = Pc − Pd + Pe (9) vention used for internal components.
DAFX-2
DAFx-88
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-89
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The idea is to exploit the dynamic equation at each junction Common choices are the observable and controllable state-space
point xn where the approximation is known to be O(hp+1 ). forms.
Indeed, if we had the samples of the exact trajectory, by the Furthermore when the denominator can be factored with dis-
Weierstrass approximation theorem, arbitrarily close polynomial tinct roots, it is possible to rewrite the transfer function using par-
approximations converging uniformly to the exact solution could tial fraction expansion as.
be obtained by computing its derivatives to any desired order.
c1 cN
Since we only have an approximation of order p = 2, we H(s) = c0 + + ... + (30)
restrict ourselves to a regularity k = 1. This gives four constraints s − λ1 s − λN
which leads to the canonical diagonal form
x̂(0) = xn , x̂(1) = xn+1 , x̂0 (0) = f (xn ), x̂0 (1) = f (xn+1 )
λ1 1
that can be satisfied by a cubic polynomial (r = 3). We choose to
. .
represent it using the Bézier form, A= .. B = .. (31)
3
! λN 1
X 3 n n
x̂(τ ) = Xi Bi (τ ), Bi (t) = (1 − t)n−i ti (23) C = c1 . . . cN D = c0 (32)
i=0
i
with {Xi } being its control polygon and Bin (t) being the Bern- 6.2. Exact exponential integration
stein polynomial basis functions, because they have important ge- The exact state trajectory is given by the integral
ometric and finite differences interpretations [23].
This choice immediately leads to the following equations, Z t
At
x̃(t) = x̃h (t) + x̃e (t) = e x̃0 + eA(t−τ ) Bũ(τ )dτ (33)
1 0
X0 = xn X1 = xn + f (xn ) (24)
3 as the sum of the homogeneous solution to the initial conditions
1 x̃h and the forced state-response with zero initial conditions x̃e
X3 = xn+1 X2 = xn+1 − f (xn+1 ) (25)
3 given by the convolution of the input with the kernel eAt .
where the internal control points X1 , X2 are computed from the Furthermore when A is diagonal we have
end points xn , xn+1 by first order forward / backward prediction λ1 t
using the derivative rule. e
..
eAt = . (34)
n−1
X
0 λN t
x̂ (t) = Di Bin−1 (t), Di = n(Xi+1 − Xi ) (26) e
i=0
which greatly simplifies the computation of the exponential map.
An example trajectory is shown in Figure 1. In that case (33) can be evaluated component-wise as
Z t
6. ANTI-ALIASED OBSERVATION x̃i (t) = eλi t x̃i0 + eλi (t−τ ) ũ(τ )dτ i ∈ {1 . . . N } (35)
0
Given an observed signal ũ(t) = y(t) belonging to the class of
piecewise polynomials, in order to reject the non-band-limited part where we used the notation xi to detonate the i-th coordinate of
of the spectrum, we would like to apply an antialiasing filter oper- the vector x
ator given by its continuous-time ARMA transfer function H(s),
then sample its output ỹ(t) to get back to the digital domain. 6.2.1. Polynomial input
Since our anti-aliasing filter will be LTI, we will make use
of exact exponential integration and decompose its output on a With ũ(t) being a polynomial of degree K in monomial2 form and
custom basis of exponential polynomial functions. coefficients ũk
XK
Without loss of generality we only consider single-input single- tk
ũ(t) = ũk (36)
output filters (SISO) since we can always filter each observed out- k!
k=0
put independently.
we can expand the forced response x̃e in (35) as a weighted sum
6.1. State-space ARMA filtering of polynomial input Z t K
! K
λi (t−τ )
X tk (τ ) X
e ũk dτ = ũk ϕk+1 (λi , t) (37)
We want to filter the trajectory by an ARMA filter given by its 0 k!
k=0 k=0
Laplace transfer function
with the basis functions {ϕk } being defined by the convolution
Y (s) b0 sN + b1 sN −1 + . . . + bN
H(s) = = N (27) Z
U (s) s + a1 sN −1 + . . . + aN t
τ k−1
ϕk (λ, t) = eλ(t−τ ) dτ k≥1 (38)
This filter can be realized in state-space form as 0 (k − 1)!
x̃0 = Ax̃ + Bũ (28) One of the main advantages of using a polynomial input (rather
than a more general model) lies in the fact that these basis func-
ỹ = Cx̃ + Dũ (29) tions can be integrated exactly, avoiding the need of a quadrature
DAFX-4
DAFx-90
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
λ = i2π λ = −5 1.0
1.0 1.0
0.5 0.8 0.8
0.6 0.6
0.0
0.4
0.5 0.4
0.2
1.0 0.0 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t 0.0
0 1 2 3 4
Figure 2: Normalized ϕ-functions for k ∈ {0 . . . 4}. The real samples
parts of the impulse (blue), step (red), ramp (green), quadratic
(magenta) and cubic (black) responses are shown for a complex Figure 3: Exact continuous-time responses of a first order low-
pole λ = i2π (left plot) and a real pole λ = −5 (right plot) over pass filter to a polynomial input (in blue).
the unit interval t ∈ [0, 1].
1.0
approximation formula. See Appendix 12 for a detailed derivation 0.5
and a recursive formula, and Figure 2 for their temporal shapes.
Using those we can decompose the local state trajectories as. 0.0
K
X 0.5
x̃i (t) = x̃i0 ϕ0 (λi , t) + ũk ϕk+1 (λi , t) (39)
k=0
1.0
0 2 4 6 8 10
We note that the initial condition is equivalent to an impulsive samples
input x̃i0 δ(t). This filtering scheme can thus be generalized to non Figure 4: Exact continuous-time response of the order 3 Butter-
polynomial impulsive inputs. worth filter with cutoff pulsation ωc = π to a triangle input at the
Nyquist frequency.
6.2.2. Numerical update scheme
Since we only wish to sample the trajectory on a fixed grid tn ∈
Z, we just need to evaluate the local state trajectory x(t) and the 7. APPLICATION: NONLINEAR LC OSCILLATOR
output y(t) at t = 1 to finally get the following numerical scheme
In order to illustrate the proposed method, we consider the sim-
K
X plest example having non linear dynamics. For that purpose, we
x̃in+1 = x̃in ϕ0 (λi ) + ũk,n ϕk+1 (λi , 1) (40) use a parallel autonomous LC circuit with a linear inductor and a
k=0
saturating capacitor with the Hamiltonian energy storage function
N
X given by
ỹn+1 = ci x̃in+1 + c0 ũn (1) (41)
i=1 ln(cosh(q)) φ
H(q, φ) = + (42)
where the coefficients ϕk (λi , 1) can be pre-computed and the com- C0 2L
ponents x̃in+1 evaluated in parallel.
where the state q is the charge of the capacitor and φ the flux in the
inductor. Its circuit’s schematic is shown in figure 5 and its energy
6.3. Filter examples storage law are displayed in 6
6.3.1. Low-pass filter of order 1
+
We consider a first order low-pass filter with transfer function +
H(s) = s+a a
. The temporal response to a piecewise polynomial
VL L VC C(q)
2
input {t , 1 − t, 0, 1} is shown in Figure 3 for a ∈ {1, 3, 6, 10}.
− IC
− IL
6.3.2. Butterworth Filter of order 3
To further illustrate the non-band-limited representation capacity
of piece-wise polynomials, and the effectiveness of the filtering Figure 5: A nonlinear LC oscillator circuit
scheme, we have shown in Figure 4 the response of a third-order
Butterworth filter with cutoff ωc = π to a triangular input signal. By partial differentiation of the Hamiltonian function H by re-
Its Laplace transfer function for a normalized pulsation ωc =√ 1 spectively q and φ we get the capacitor’s voltage and the inductor’s
is given by H(s) = (s2 +s+1)(s+1)
1
with poles λ1 = −1−i 2
3
, current, while applying the temporal derivative on q, φ gives the
√ √
capacitor’s current and inductor’s voltage.
λ2 = −1+i
2√
3
, λ3 = −1 and coefficients c0 = 0, c1 = −3+i 3
6
,
−3−i 3
c2 = , c3 = 1. tanh(q)
6
VC = ∂q H = IC = q 0 (43)
2 We C0
use the monomial form here instead of Bernstein polynomials be-
cause this is the one that leads to the most straightforward and meaningful φ
IL = ∂φ H = V L = φ0 (44)
derivation. L
DAFX-5
DAFx-91
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0
2.0 Hamiltonian 1.0 Gradient 20 10x
40
Hφ 60
dB
80
1.5 Hq 0.5 100
10
0
2 103 104 105
1.0 0.0 20 ZOH
40
60
dB
80
0.5 0.5 ∇φ H 100
∇q H 10
0
2 103 104 105
0.0 1.0 20 FOH
2 1 0 1 2 2 1 0 1 2 40
60
dB
80
Figure 6: Respective energy storage functions (left plot) and their 100
10
0
2 103 104 105
gradients (right plot), of the nonlinear capacitor (in red) and lin-
ear inductor (in blue), for C = 1, L = 1.
20 proposed cubic
40
60
dB
80
3 state space (q(t),φ(t)) 3 φ(t) 100
10
0
2 103 104 105
2 2 20 proposed + AA
40
60
dB
1 1
80
0 0 100
102 103 104 105
1 1 Hz
This gives the Branch Component Equations. around the Nyquist frequency. It also has a higher spectral images
Applying Kirchhoff Current and Voltage Laws gives the con- decay rate thanks to its C 1 regularity. Its aliased components start
straints IC = −IL , VC = VL . We can summarize the previous at -85 dB at the Nyquist frequency and decay much faster to reach
equations with the conservative autonomous Hamiltonian system. -100 dB at about 14 kHz where they reach a kind of aliasing noise
floor caused by higher harmonics fold-back.
x0 = J∇H(x) (45) Finally, as expected, the 12th-order Butterworth half-band low-
with. pass filter removes components above the Nyquist frequency thanks
to the piecewise continuous cubic input.
q 0 −1 ∂q H
x= , J= , ∇H = (46)
φ 1 0 ∂φ H
8. DISCUSSION
Its state space and temporal trajectories are shown in Figure 7.
We can see that the numerical scheme preserves the energy since First, we highlight the fact that the vector field approximation in
the discrete points lie exactly on the orbit of the reference trajec- (17) acts as a first-order antialiasing filter: it is a projection of the
tory. The reconstructed state-space trajectory also shows a good vector field on a rectangular kernel. It prevents high-order spec-
match with the reference for most of the interpolated segments, tral images from disturbing the low frequency dynamic during the
except around transition regions at the bottom and top. numerical simulation and it is consistent with the underlying piece-
The spectrum of the flux φ is shown in Figure 8. One can see wise linear approximation model.
that the reference spectrum contains harmonics above twice the Second, the numerical scheme is energy-preserving. From a
representable bandwidth where they pass below -90 dB. signal processing perspective, the lowpass filtering effect on the
The ZOH and FOH spectrums contains spectral images of the vector field is compensated by the finite difference approximation
non bandlimited spectrum that decay respectively at -6dB/oct and of the derivative. This is a direct generalization of the mid-point /
-12dB/oct. Their aliased components in the audio bandwidth start bilinear methods to nonlinear differential equations.
around -80 dB at the Nyquist frequency and decay slowly toward Third, using the fact that the trajectory approximation has ac-
approximately -100 dB at low frequencies. curacy order p = 2 at the junctions, we can re-exploit the dif-
Contrary, our method, informed by the dynamic, exhibits both ferential equation to reconstruct an informed C 1 -continuous cubic
reduced aliasing in the audio bandwidth and sharpened spectrum trajectory. It exhibits reduced aliasing in the passband and better
DAFX-6
DAFx-92
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-93
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
τ k−1
energy resp. dissipation in numerical PDEs using the ’aver- with [a, b] = [0, t], u(τ ) = eλ(t−τ ) , v 0 (τ ) = (k−1)!
and its prim-
age vector field’ method,” Journal of Computational Physics, k
itive v(τ ) = τ
gives
vol. 231, no. 20, pp. 6770 – 6789, 2012. k!
t Z t
[21] N. Lopes, T. Hélie, and A. Falaize, “Explicit second-order τk τk
accurate method for the passive guaranteed simulation of ϕk (λ, t) = eλ(t−τ ) +λ eλ(t−τ ) dτ
k! 0 0 k!
port-Hamiltonian systems,” in 5th IFAC Work- shop on La-
grangian and Hamiltonian Methods for Non Linear Control, tk
= + λϕk+1 (λ, t)
Lyon, France, July 2015, IFAC. k!
[22] E. Celledoni, R. I. McLachlan, D. I. McLaren, B. Owren, which after using (49) and identification gives
G. R. W. Quispel, and W. M. Wright, “Energy-preserving
ϕk (λ, t) − ϕk (0, t)
runge-kutta methods,” ESAIM: Mathematical Modelling and ϕk+1 (λ, t) =
Numerical Analysis. λ
The ϕ-functions, that appear when doing exact integration of an Proof. It is immediate to verify that (56) is satisfied for k = 0.
LTI system with polynomial input given in monomial form, are Then assuming that (56) is true for some k ∈ N and using the
defined by the convolution integral recurrence (50) we prove
Z t ϕk (λ, t) − ϕk (0, t)
τ k−1 ϕk+1 (λ, t) =
ϕk (λ, t) = eλ(t−τ ) dτ k≥1 (47) λ
0 (k − 1)! !
k−1
X (λt)n
1 1 tk
and by definition = k+1 eλt − −
λ n! λ k!
ϕ0 (λ, t) := eλt (48) n=0
!
Xk
For λ = 0 it is immediate that 1 (λt)n
= k+1 eλt −
λ n=0
n!
tk
ϕk (λ = 0, t) = (49)
k! that (56) is also true for k + 1. By induction (56) is thus satisfied
for all k ∈ N.
12.1. Recurrence relation
The ϕ-functions represent thus the tail of the truncated taylor
We first prove that they satisfy the recurrence formula series expansion of eλt up to a scaling factor. This is clear when
rewriting (56) as
ϕk (λ, t) − ϕk (0, t)
ϕk+1 (λ, t) = λ 6= 0 (50) k−1
X
λ (λt)n
eλt = + λk ϕk (λ, t) (57)
n!
Proof. Using integration by parts n=0
Z b Z b
u(τ )v 0 (τ )dτ = [uv]ba − u0 (τ )v(τ )dτ
a a
DAFX-8
DAFx-94
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT the command gains. It was quickly understood that fitting the mag-
nitude response of an EQ through the command points, which are
The design of graphic equalizers has been investigated for decades, usually spaced logarithmically in frequency is very difficult. Only
but only recently fitting the magnitude response closely enough to recent digital design methods have achieved sufficient accuracy for
the control points has become possible. This paper reviews the de- hi-fi audio [12, 13, 1].
velopment of graphic equalizer design and discusses how to define
One straightforward method is to fit the response of a finite
the target response. Furthermore, it investigates how to find the
impulse response (FIR) filter to the command point data using in-
hardest target gain settings, the definition of the bandwidth of band
terpolation and to apply the inverse discrete Fourier transform to
filters, the estimation of the interaction between the bands, and
obtain the FIR filter coefficients [14, 15]. However, the frequency
how the number of iterations improves the design. The main focus
division in graphic EQs is usually logarithmic, such as octaves,
is on a recent design principle for the cascade graphic equalizer.
and the use of an FIR filter leads to complications at low frequen-
This paper extends the design method for the case of third-octave
cies: the impulse response associated with a sharp change in a
bands, showing how to choose the parameters to obtain good ac-
low-frequency band becomes very long [15]. One idea to reduce
curacy. The main advantages of the proposed approach are that
this complication is using a multirate system to optimally down-
it keeps the approximation error below 1 dB using only a single
sample the long filters at low frequencies [16, 17]. Alternatively,
second-order IIR filter section per band, and that its design is fast.
frequency warping can be used to shorten the filter at low frequen-
The remaining challenge is to simplify the design phase so that
cies, but still, FIR graphic EQs are currently more costly to imple-
sufficient accuracy can be obtained without iterations.
ment than infinite impulse response (IIR) graphic EQs [18].
During the analog era, most graphic EQs were based on the
1. INTRODUCTION parallel structure, which refers to a bank of bandpass filters all of
which receive the same input signal [19, 5, 20, 21, 12, 1]. The
The design of graphic equalizers (EQ) is surprisingly difficult, and outputs of the bandpass filters were amplified according to the
for this reason it has been investigated for decades [1]. The first ap- command gain of that band, and then combined (summed). Such
plication of graphic EQs was the enhancement of audio quality in structures suffered from complications due to the interference of
movie systems in the 1950s [2, 1]. In the early years, graphic EQs the magnitude and phase responses of the neighboring bandpass
were analog, but since the 1980s digital designs have been pro- filters. These often led to notches in the transition region from
posed [2, 3, 4, 5]. The graphic EQ has become one of the standard one band to another, or accumulation of gain in some bands due to
tools in music production [2, 6] and in audio systems [7, 8, 9, 10]. leakage from neighboring bands. Early digital graphic EQs inher-
This paper reviews the development of graphic EQs, inves- ited the parallel structure from the analog world [5]. A few years
tigates the graphic EQ design problem, and discusses the recent ago, Rämö et al. showed how a graphic EQ can be designed accu-
efforts to improve and simplify the design of digital graphic EQs. rately using a parallel filter, which is based on least-squares (LS)
Section 2 of this paper reviews the development of graphic equal- optimization of a bank of IIR filters having fixed poles [12].
izers. In Section 3, we first tackle the definition of target response As an alternative, the cascade structure has been considered
and how to evaluate the accuracy of the design. Section 4 elabo- for graphic EQs [19, 22, 23, 24, 1]. Traditional parametric EQ
rates further on a recently proposed cascade octave graphic EQ de- filters can be used as building blocks of a cascade graphic EQ [22,
sign [11], trying to understand whether there are ways to improve 25, 13]. It is well known in digital signal processing that the same
it. Furthermore, in Section 5, we expand the proposed cascade de- transfer functions can be implemented using either a parallel or
sign for the third-octave case, which is another popular and useful a cascade filter [26]. However, the design of the filter parameter
configuration. Section 6 concludes this paper and shows avenues values for these structures can be very different. The main reason
for future research in this topic. for this is that considering the phase response of the band filters is
unnecessary in the case of the cascade structure [1, 11]. However,
in the case of the parallel structure, the phase response of each
2. HISTORICAL DEVELOPMENTS
band filter is critical, as in the end the output signals of all band
filters are combined.
The graphic EQ design problem is simple: to fit the magnitude re-
sponse of a digital filter through control points, which define the Cascade graphic EQs suffer from similar interference from
gain at several predefined frequency points. These are often called neigboring band filters as parallel EQs [1]. Some solutions to this
problem include variable-Q designs, which change the bandwidth
∗ J. Liski is supported by the Aalto ELEC Doctoral School. The work of the band filter according to the gain [27], and higher-order band
was partially funded by Nokia Tehnologies (Aalto project number 410642). filters, which improve the summation of the neighboring bands
DAFX-1
DAFx-95
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
12
Magnitude (dB)
Magnitude (dB)
8
6
6
0
4
-6
2 -12
500 1000 2000 30 100 300 1000 3000 10000
(a) (a)
12
Magnitude (dB)
Magnitude (dB)
14
6
12 0
-6
10
-12
500 1000 2000 30 100 300 1000 3000 10000
Frequency (Hz) Frequency (Hz)
(b) (b)
Figure 1: Definition of the ±1-dB error limits in the case of (a) Figure 2: Worst-case command gain (red circles) settings (a) re-
different and (b) same neighboring command gains (red circles). ported by Oliver and Jot for their design [13] and (b) observed in
In both figures, the blue diamonds indicate the extra points mid- this work for the new cascade design [11]. The black curve is the
way between each command point. magnitude response obtained with the proposed design.
at the cost of increased computational complexity [28, 23, 24]. We have observed that for a cascade octave graphic EQ hav-
Our recent work showed that in the case of the octave cascade ing low-order band filters, such as second-order IIR filters, inter-
graphic EQ, a sufficiently accurate design can be obtained using a polating a high-resolution target response is unnecessary, since the
fairly simple design, which involves an unusual definition of filter band filter’s gain cannot make large deviations between command
bandwidth and the LS optimization with one iteration [11]. Ad- points. To guarantee a monotonous transition between command
ditionally, the new cascade filter can be implemented with fewer points, one option is to simply define one intermediate point be-
second-order sections than the best parallel graphic EQ, which re- tween each command point, where the gain error is evaluated, as
quires twice as many biquad filter sections as there are command suggested in [11]. The gain at the intermediate point is determined
points. It now seems clear that the best graphic EQ design must be as the average of the adjacent command gains in dB. The frequency
based on the cascading of second-order band filters. of the intermediate point must be the geometric mean of the neigh-
boring command points. Figure 1 visualizes this error definition.
Figure 1(a) shows the ±1-dB error limits at three command points
3. DEFINING THE BEST OCTAVE GRAPHIC EQ and at two intermediate points between them.
Figure 1(b) presents a common special case in which two or
There are multiple ways to compare different EQ designs. Com- more neighboring command gains are equal. In this case, it is
monly, the maximum error, the computational cost of the imple- meaningful to require the magnitude response to stay within 1 dB
mentation, and the complexity of the design are used as criteria from the command gain also at all frequencies between these points.
[12]. This section discusses ways to define the target response and
how to test graphic EQ designs. Both aspects affect the evaluation
of the maximum error. 3.2. Hardest Gain Settings
In a graphic EQ, the gains can usually be adjusted in the range
3.1. Target Response of ±12 dB [1]. When trying to find the hardest command gain
combinations, testing the settings where the gains are set either to
Obviously, the magnitude response of a graphic EQ should match +12 or −12 dB in order to produce the largest deviations between
the command gains at the control frequencies very closely. In hi-fi the bands is natural. Since there are ten bands in an octave EQ
audio, a 1-dB accuracy requirement is typical at those points [23, and we have two alternatives, we obtain 210 = 1024 different
12, 1]. The error function used is usually the magnitude frequency combinations, or binary settings. These combinations can then
response error in dB. easily be tested to find the one producing the largest error.
However, how to define the target response between the com- Figure 2 shows two examples where the gains are selected
mand points is less obvious. Smooth and monotonous transitions from the binary cases. Oliver and Jot found that for their proposed
from one command point to another are usually desirable, since design the worst-case gain setting was the one shown in Fig. 2(a)
in practice a large increase or decrease in the filter gain between [13]. On the other hand, our previous work found the settings seen
the command points is not what an audio engineer expects from a in Fig. 2(b) to cause the largest error. Clearly, these two exam-
graphic EQ. Various interpolation methods have been suggested ples have similarities: there are steep transitions between ±12 dB,
for producing a virtually continuous target magnitude response but also plateaus, where the EQ has to produce the same gain in
from command gain data [12, 29]. Some design methods require multiple adjacent bands. In both cases, the largest error was actu-
such a high-resolution target response. ally produced at such a plateau, at approximately 1 kHz [13] and
DAFX-2
DAFx-96
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.2 1k 2k 4k 8k 16k
(a)
1
0.8
Figure 3: Effect of the value of c on the maximum error in three Figure 4: Normalized amplitude responses of the 8-kHz band filter
different command gain settings. The worst case is the one shown for three different peak gain values: (a) the original bandwidth,
in Fig. 2(b). The dashed line indicates the 1-dB error that should which leads to too wide a response (the responses do not cross
not be exceeded. inside the boxes) and (b) the adjusted bandwidth, which makes the
responses cross at the lower neighboring center frequency. The
squares indicate the points where the responses should meet.
at 9 kHz [11].
Based on the two cases reviewed above and our own compre-
hensive testing, we propose that the 1024 hard binary cases should found that the choice c = 0.3 leads to a successful octave graphic
be utilized to test graphic octave EQs in order to reveal the largest EQ design [11]. This is illustrated in Fig. 3, which shows the ef-
approximation error. Note, however, that for the third-octave EQ, fect of the parameter c with three different command gain settings.
there are about 30 command gains, so exhaustive testing of all As is seen, the desired accuracy is achieved when c has values be-
combinations may not be viable. tween 0.28 and 0.38, and thus c = 0.3 can be used. This value
of the parameter c leads to an unusual definition of the bandwidth,
4. OCTAVE GRAPHIC EQ DESIGN since traditionally the bandwidth of a resonance is determined as
the difference of −3-dB points on each side of a peak, which refers
Here, a summary of the new cascade graphic EQ design proposed to 0.7 times the linear gain. However, this non-traditional choice
in [11] is presented. The method is based on designs proposed by appears to be crucial for accurate automatic design.
Abel and Berners [30] and Oliver and Jot [13]: one filter per octave The bandwidths is selected as Bm = 1.5ωc,m , which equals
band is used whose interaction with its two neighboring filters at the difference between the neighboring upper and lower center fre-
their center frequency is exactly controlled. quencies. This way, the behavior of each band filter can be exactly
The method uses as band filters the following second-order IIR controlled at the center frequencies of both its neighbors. How-
peak/notch filter given by Orfanidis in which the reference gain at ever, at high center frequencies, the bandwidth needs to be adjusted
dc is set to 1 [31]: in another way, because the filter response becomes asymmetric.
Figure 4(a) shows examples of the normalized magnitude re-
sponse of the band filter at 8 kHz for three different gains. The
1 + Gβ − 2 cos(ωc )z −1 + (1 − Gβ)z −2
H(z) = , (1) magnitude responses have been normalized on the dB scale by
1 + β − 2 cos(ωc )z −1 + (1 − β)z −2 dividing them by their respective dB gain, as suggested in [30].
Without the bandwidth adjustment, the filter responses do not cross
where G is the linear peak gain, ωc = 2πfc /fs is the normalized
at the desired points, i.e., at their center frequency and the two
center frequency in radians (the ten standard octave frequencies
neighboring center frequencies, as seen in Fig. 4(a). This anomaly
31.25 Hz, 62.5 Hz, 125 Hz etc. are used), fs is the sampling fre-
leads to difficulties in predicting the interaction between neigh-
quency, β is defined as
boring band filters. In the octave design, the asymmetry concerns
the three band filters with the highest center frequencies, 4 kHz,
tan
(B/2) , when G = 1,
s 8 kHz, and 16 kHz. Their bandwidth is set so that the 0.3gm point
β= 2
|GB − 1| B (2) occurs at the lower neighboring center frequency (but not at the
tan otherwise,
|G2 − GB 2 | 2 higher one), as shown in Fig. 4(b). This leads to bandwidth values
fB,8 = 5580 Hz, fB,9 = 9360 Hz, and fB,10 = 12160 Hz instead
and GB is the linear gain at bandwidth B = 2πfB /fs . of 6000 Hz, 12000 Hz, and 24000 Hz, respectively.
In the IIR section defined by (1) and (2), the bandwidth can The resulting filter-response shapes for all band filters of the
be selected such that for the mth band filter, a specified dB gain cascade octave graphic EQ are shown in Fig. 5, where the re-
gB,m = cgm is reached at the neighboring center frequencies. We sponses are seen to meet at all the desired frequency points (the
DAFX-3
DAFx-97
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.6 0.6
0.3
0.3
0
16 31 63 125 250 500 1k 2k 4k 8k 16k 0
Frequency (Hz)
1
Figure 5: Normalized amplitude responses of all the band filters, 3
Fre
5
which have gains of 0.3 times its peak gain in dB at the neigh- 7
que
boring center frequencies, with different peak gain values: 2 dB 9
ncy
11
(blue), 17 dB (red), and 24 dB (black). 13
poin
15
17
tm
19
squares) except for the three highest filters whose responses do not 1 2 3 4 5 6 7 8 9 10
cross at their higher neighboring frequency. Additionally, Fig. 5 Filter index k
demonstrates the self-similarity of the band filters. Three responses
are plotted at each band with different peak gains. Due to the simi- Figure 6: A 19-by-10 interaction matrix containing the normalized
lar shape of each normalized response, the samples taken from the leakage of each band filter to the other frequency points.
dB amplitude responses can be used as a basis function in order to
control the interaction of the band filters at the selected frequen-
Difference (dB)
cies [30, 13, 11]. The normalized dB amplitude responses of the 0.4
filters are stored in an interaction matrix B.
The octave design uses 19 design points [11] instead of ten, 0.2
as suggested by Oliver and Jot [13]. These 19 points include the
filter center frequencies and the geometric mean of these values
0
between them. This decreases the error and thus improves the be- 0 1 2 3 4 5
havior of the EQ between the command points. With ten octave (a)
bands and 19 design frequencies we obtain a 19-by-10 interaction
matrix, which is visualized in Fig. 6. The filters for the interaction
1.2
matrix are designed with (1) and (2) using a prototype dB-gain gp ,
Error (dB)
which in this case is 17 dB. The values inserted into the interaction
matrix represent the magnitude response divided by gp at the com- 1
mand points and the intermediate points. As is seen in Fig. 6, the
diagonal values of the interaction matrix are 1 dB, representing the 0.8
filter center frequencies, and the other stems indicate the relative 0 1 2 3 4 5
effect of each band filter at the other design points. Number of iterations
The interaction matrix is utilized to determine the optimal dB (b)
gain for each filter in the LS sense by solving its inverse matrix
[32]. However, since the method uses a non-square matrix, the Figure 7: (a) Maximum difference in filter gains between the iter-
pseudoinverse of the interaction matrix B+ is necessary to obtain ation rounds, and (b) maximum error after each iteration for the
the optimal solution [32]. The filter gains g are then obtained as hardest command gain setting of the octave graphic EQ, cf. 2(b).
DAFX-4
DAFx-98
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Center frequencies and bandwidths for the 31 filters of the third-octave graphic equalizer. The adjusted bandwidths of the six
highest band filters are shown in italics.
1 2 3 4 5 6 7 8 9 10 11
fc (Hz) 19.69 24.80 31.25 39.37 49.61 62.50 78.75 99.21 125.0 157.5 198.4
fB (Hz) 9.178 11.56 14.57 18.36 23.13 29.14 36.71 46.25 58.28 73.43 92.51
12 13 14 15 16 17 18 19 20 21 22
fc (Hz) 250.0 315.0 396.9 500.0 630.0 793.7 1000 1260 1587 2000 2520
fB (Hz) 116.6 146.9 185.0 233.1 293.7 370.0 466.2 587.4 740.1 932.4 1175
23 24 25 26 27 28 29 30 31
fc (Hz) 3175 4000 5040 6350 8000 10080 12700 16000 20160
fB (Hz) 1480 1865 2350 2846 3502 4253 5038 5689 5573
1
Magnitude (dB)
0.7
0.4
0
16 31 63 125 250 500 1k 2k 4k 8k 16k
Frequency (Hz)
Figure 8: Normalized amplitude responses of all third-octave band filters with three different peak gains, as in Fig. 5. The filter gain at the
neighboring center frequencies, indicated with squares, is set to be 0.4 times the peak dB-gain.
with extra frequencies is used with one iteration step to optimize droops between the command points. This is shown in Fig. 9(a):
the filter gains. Differences between the two designs are caused when all command gains are set to +12 dB, the error exceeds 1 dB
by the unequal number of bands and different bandwidths, and at high frequencies. Figure 9(c) shows that the same can happen
therefore some of the parameters must be reselected. in the middle frequencies with certain command gain settings.
The third-octave design has 31 bands, whose center frequen- To improve the behavior of the third-octave EQ, different val-
cies are given in Table 1. We use the center frequencies as well ues of c were tested. Additionally, a minor modification needed
as the geometric mean frequencies between them as design fre- to be applied to the error criterion with respect to the octave de-
quencies, which leads to 61 design points. Thus, the size of the sign. For the third-octave design, the error is not evaluated at the
interaction matrix is now 61-by-31, and since it is non-square, its intermediate points between the center frequencies, although those
pseudoinverse is used in the LS design. The initial interaction ma- points are accounted for in the design. Even though the magnitude
trix is designed using the same prototype gain value as the octave response of the EQ varies smoothly from one command point to
design, gp = 17 dB. another the approximation error exceeds 1 dB in the narrow and
The bandwidths for the third-octave design are defined in the steep transition bands. As the transition bands are very narrow,
same way as in the octave design by selecting a specified dB value such minor undulations are not expected to be perceivable.
that is achieved at the neighboring center frequencies. When using Figure 10 shows the effect of parameter c on the maximum er-
the same principle as in the octave design for calculating the band- ror in three different command gain settings. A suitable c param-
widths (i.e., the difference of √
center frequencies
√ above and below eter range for the third-octave design, where the maximum error
a filter), we obtain Bm = ( 3 2 − 1/ 3 2)ωc,m ≈ 0.4662ωc,m . of all three test cases remains below 1 dB, is observed to be 0.38–
Due to the filter asymmetry, the bandwidths of the six uppermost 0.4. In this work, c = 0.4 is used. The improvements achieved
filters are tuned by hand, resulting in the values presented in Table by adjusting c to 0.4 are shown in Figs. 9(b) and in 9(d), where
1. The effect of the manual adjustments are also seen in Fig. 8, the approximation errors is now smaller than 1 dB and the desired
where the left sides of the six last filters now cross the boxes at the accuracy is thus achieved.
lower neighboring center frequencies. Finally, a non-extreme third-octave EQ design example, taken
The largest difference between the octave and the third-octave from [12], is presented in Fig. 11(a). This command gain setting
design is found in the selection of the parameter c. Initially the leads to a varied target curve, which the proposed design matches
value c = 0.3 was tested, but this resulted in too narrow filters well, confirmed by the error curve shown in Fig. 11(b). In the
and large approximation errors. The EQ is unable to create a flat plateaus, the error has been evaluated at 16 frequency points be-
response when the filters are too narrow, since the overall response tween each neighboring command points.
DAFX-5
DAFx-99
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
14 -10
12
Magnitude (dB)
Magnitude (dB)
13 -11 8
4
12 -12 0
-4
11 -13 -8
c = 0.3 c = 0.3 -12
10 -14
10k 13k 16k 20k 500 1k 2k 4k 16 31 63 125 250 500 1k 2k 4k 8k 16k
(a) (b) (a)
14 -10
Magnitude (dB)
1
13 -11
Error (dB)
0.5
12 -12 0
11 -13 -0.5
c = 0.4 c = 0.4 -1
10 -14
10k 13k 16k 20k 500 1k 2k 4k 16 31 63 125 250 500 1k 2k 4k 8k 16k
Frequency (Hz) Frequency (Hz) Frequency (Hz)
(c) (d) (b)
Figure 9: Effect of the value of c on the magnitude response. For Figure 11: Third-octave EQ example showing varying command
(a) and (c), all the command gains are at 12 dB, and for (b) and gains: (a) the complete response of the EQ and (b) the error, as
(d), the command gains are at ±12 dB: the target seen in Fig. 2(a) defined for the third-octave design in Sec. 5. The dots illustrate the
repeats itself until the 31 target points are filled. The horizontal frequency points where the error has been evaluated.
dashed lines indicate the ±1-dB error tolerances and the arrows
show the points where the error exceeds the 1-dB limit. The error
is reduced, when the value of c is changed from 0.3 to 0.4. Table 2 compares the accuracy of the two EQs with different
command gain settings. We are interested in the maximum error
2 that is determined by the guidelines shown in Fig. 1 apart from
All +12 dB using the intermediate frequency points as explained in the pre-
1.8 Zigzag
Hard case
vious section. The first test case is a zigzag setting in which the
1.6 command gains alternate between ±12 dB that reveals the EQ’s
ability to create steep transitions. As is seen in Fig. 12(a) and (b),
1.4
the proposed design and the PGE produce very similar responses
Max error (dB)
1.2 everywhere except at very low frequencies below the first com-
mand point, and they both stay within ±1 dB of the targets. How-
1 ever, when looking at the maximum error, the proposed method is
0.8 slightly better with an approximately 0.2-dB smaller error.
The responses of the second test case are shown in Figs. 12(c)
0.6 and (d). Here too the command gains vary between ±12 dB, but
0.4 there are also flat regions between the steep transitions. The gain
setting is inspired by the ones seen in Fig. 2. However, in the
0.2 case of third-octave filters we could not test all the hard binary
settings as in the octave version, because of the huge number of
0
0.2 0.25 0.3 0.35 0.4 0.45 0.5 combinations (231 compared to 210 ). Instead, some combinations
Parameter c that we thought were hard were tested, and we ended up using the
following demanding command gain configuration: t = [12 –12 12
Figure 10: Effect of the parameter c on the maximum error using 12 –12 12 –12 –12 –12 12 –12 12 12 12 –12 12 –12 –12 12 12 12
three different command gain settings. The hard setting is shown –12 –12 12 12 –12 12 –12 –12 12 12]T . The two methods again
in Fig. 12(a). produce similar responses, as seen by comparing Figs. 12(c) and
(d). When investigating the maximum error, we see that the PGE
is slightly better, but that both methods stay within ±1 dB of the
5.1. Comparison With a Previous Accurate Method target.
Table 3 lists the number of operations during real-time filter-
In this section, the proposed third-octave design is compared with
another graphic EQ, which, to our knowledge, is currently the most
accurate, i.e., the high-precision parallel graphic EQ (PGE) [12]. Table 2: Maximum error of two third-octave graphic EQs in two
It comprises twice as many second-order filter sections as there are test target settings. The best result in each case is highlighted.
bands and has a maximum approximation error of less than 1 dB at
all tested command gain configurations. However, it is difficult to Case PGE Proposed
design, since a high-resolution interpolation of the target magni-
Zigzag 0.65 dB 0.41 dB
tude response is required as well also a phase response estimation
[29]. Hard case 0.66 dB 0.98 dB
DAFX-6
DAFx-100
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Magnitude (dB)
Magnitude (dB)
12 12
8 8
4 4
0 0
-4 -4
-8 -8
-12 -12
16 31 63 125 250 500 1k 2k 4k 8k 16k 16 31 63 125 250 500 1k 2k 4k 8k 16k
(a) (c)
Magnitude (dB)
Magnitude (dB)
12 12
8 8
4 4
0 0
-4 -4
-8 -8
-12 -12
16 31 63 125 250 500 1k 2k 4k 8k 16k 16 31 63 125 250 500 1k 2k 4k 8k 16k
Frequency (Hz) Frequency (Hz)
(b) (d)
Figure 12: Magnitude response of (a) the proposed design and (b) the PGE with zigzag (±12 dB) command settings and (c) the proposed
design and (d) the PGE with hard command gain settings (±12 dB) inspired by the targets in Fig. 2.
ing for the PGE and the proposed methods. As is seen, the pro- EQs, and expanded a previously proposed accurate graphic EQ to
posed method is approximately 44% more efficient. This advan- the third-octave equalization problem. In the case of a cascade
tage mainly comes from using one biquad filter per band rather octave graphic EQ with low-order band filters, one alternative is
than the two per band in the PGE. Even though the PGE uses an op- to evaluate the error at intermediate points between the command
timized structure by having one fewer numerator coefficient com- points in addition to the command points themselves. This is
pared to the traditional second-order filter [12], the larger number enough to guarantee a monotonous transition between the bands,
of filter sections negates that advantage. since large deviations between bands are impossible using second-
Additionally, we compared another computational aspect of order IIR filters having a restricted bandwidth. In addition, some
the graphic EQ design, namely the parameter computing time when hard command gain settings were presented that can be used to
a command gain is changed. The update times were calculated in test graphic EQ designs. The largest errors were observed when
MATLAB as an average of 1000 updates using random values be- all command gains were at the extreme values, usually at ±12 dB.
tween ±12 dB as command gains. The Internet connection and all Finally, a previously proposed accurate graphic EQ design was
other programs were shut down so as not to affect the computation. expanded to the third-octave case. The design method uses one
The average time for the command gain update was 24 ms for second-order IIR filter per band. The interaction between the dif-
the PGE method and 6.0 ms for the proposed method. This implies ferent band filters is optimized at the band center frequencies and
that the proposed method is 75% faster than the PGE in updating at one extra point between each center frequency with the help of
its parameters. The proposed method requires linear interpolation an interaction matrix. With one iteration step in the interaction
between the command gains, a matrix inversion, and large matrix matrix design, the method achieves 1-dB accuracy and thus is ap-
multiplications, which increase its update time. However, due to plicable to high-quality audio.
the computation of the high-resolution target magnitude and phase The new third-octave design was compared with a previously
responses and a pseudoinverse of a large matrix, the PGE requires proposed parallel graphic EQ, which, to our knowledge, was the
much more time to update its parameters. state-of-the-art graphic EQ prior to this work. The new method
In summary, the proposed method is approximately equally achieves approximately the same accuracy but requires fewer op-
accurate as the PGE, but requires fewer operations per output sam- erations per output sample and is faster to design. The proposed
ple and is faster in command gain updating, making it the superior method is thus currently the best graphic EQ design. The relevant
design. MATLAB code is available online [33].
A remaining research challenge is to simplify the EQ design
so that sufficient accuracy is achieved without an iteration step.
6. CONCLUSION AND FUTURE WORK Possible approaches to this end are the determination of data- or
frequency-dependent c and gp parameters. This would lead to
This paper reviewed the methods and the target response definition computational benefits, since the interaction matrix and its pseu-
for graphic EQ design, proposed a methodology for testing graphic doinverse would not have to be calculated with each command
gain change. Furthermore, the cascade EQ could be converted into
a parallel form in order to reap benefits in the filter implementa-
Table 3: Comparison of the operation count in third-octave EQs. tion.
DAFX-7
DAFx-101
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-102
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT This should effectively reduce the time spent by producers and en-
gineers on time consuming, menial tasks, and allow them to focus
In sound production, engineers cascade processing modules at var- on the creative aspects of music production. Previously, these sys-
ious points in a mix to apply audio effects to channels and busses. tems have been developed for automatic mixing, whereby aspects
Previous studies have investigated the automation of parameter set- of the production workflow such as balance [2, 3] and panning [4],
tings based on external semantic cues. In this study, we provide an can be optimised according to psychoacoustic principles.
analysis of the ways in which participants apply full processing Studies in the area of Intelligent Music Production have ex-
chains to musical audio. We identify trends in audio effect usage plored methods for the automation of parameter settings based on
as a function of instrument type and descriptive terms, and show external semantic cues, such as descriptive language [5]. This
that processing chain usage acts as an effective way of organising has been applied to a range of audio effects such as equalisation
timbral adjectives in low-dimensional space. Finally, we present a [6, 7, 8], distortion [9], compression [10] and reverberation [11].
model for full processing chain recommendation using a Markov In each of these cases, parameter automation is applied within the
Chain and show that the system’s outputs are highly correlated context of a single effect, meaning the user of the system needs to
with a dataset of user-generated processing chains. be aware of the type of processing required to achieve the desired
aesthetic.
The use of descriptive terms in music production can be shown
1. INTRODUCTION
to represent changes in musical timbre, which often requires the
application of multiple audio effects [12, 13]. This suggests that
Mixing audio involves a range of complex processes that include
in order to provide users with a flexible interface, combinations
balancing source amplitudes, applying audio effects, and position-
of effects need to be explored. This process is nontrivial, as in-
ing a sound source in a perceived space. These tasks can be time
dividual audio effects are complex, multidimensional processing
consuming and are often carried out for either corrective or cre-
units [6], and the combination of linear and nonlinear systems are
ative reasons. Corrective tasks are straightforward but time con-
noncommutable, resulting in a large number of possible combina-
suming operations, including noise removal, temporal alignment,
tions. For example, Dynamic range compression placed before an
and level correction using equalisation and dynamics processing.
equaliser (EQ) will provide a potentially different outcome than an
Creative tasks on the other hand require artistic interpretation, and
EQ placed after the compressor [14], even when the settings of the
can involve perceptually mapping ideas and descriptions of audio
two audio effects are retained and only the order of them is altered.
to processing parameters.
This is confounded by additional contextual conventions, such as
During production, audio effect modules are often cascaded
effects being used for specific instrument classes (e.g. drum com-
to create processing chains for each channel and bus in the mix,
pression or vocal equalisation).
including the master. This allows combinations of linear and non-
linear systems to be able to apply processing to the audio signal
at various points in the workflow, typically using a Digital Audio 1.2. Objectives
workstation (DAW). For corrective purposes, the effects included In this paper, we present a method for full processing chain rec-
in each of the processing chains are selected by the engineer as a ommendation, based on a dataset of empirically captured user data.
reaction to audible cues, such as artefacts in the mix. For creative We provide a comparative analysis of audio processing tools, based
purposes, they can be selected with a view to make a source more firstly on the frequency of individual effect usage within a chain,
or less prominent in the mix, or to achieve a given aesthetic. then on combinations thereof. Using this information, we can then
identify commonalities and patterns in audio effects usage, with
1.1. Intelligent Music Production respect to contextual attributes such as timbral descriptions, audio
effects and genre. We conclude by presenting a system which is
Intelligent Music Production aims to provide interfaces and algo- able to recommend a number of processing modules based on the
rithms to automate and facilitate the music production process [1]. likelihood of an effect’s position in a processing chain, weighted
DAFX-1
DAFx-103
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
by external factors. This method for processing chain recommen- the JS-Xtract feature extraction library. A full list of the audio
dation can help lower the barrier to entry of novice engineers, and features can be found in [19].
can reduce the time required for expert users, when incorporated
into an intelligent mixing environment [15]. 3. SINGLE EFFECTS
DAFX-2
DAFx-104
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Plugin
ins applied and generality ggenre across all descriptors in braces. 1.2
Reverb
0.9
Term EQ Comp. Dist. Reverb.
Genre 0.786 0.799 0.713 0.904 0.6
Instrument 0.692 0.625 0.656 0.640 Comp. 0.3
Descriptor 0.766 0.631 0.291 0.464
0.0
Plugin Position
Table 4: Generality of plugins against the type of term (genre,
instrument, descriptor).
Figure 1: Euclidean distances of the features according to plu-
gin type (SAFE distortion, SAFE equaliser, SAFE compressor and
SAFE reverb) and position in the chain
instance, it has a high generality score. Equations 1 and 2 calcu-
late a single generality score from the data for a given contextual
term (instrument, genre or descriptor). If a plugin only occurs for
a small number of the terms, then the g will be low. These were 3.2. Effect Salience
adapted from [12]. Table 1 gives the generality of each plugin ac- Each processor in the chain has a varying number of parameters,
cording to the source instrument. each set empirically by the participant. We quantify this by ex-
tracting audio features before and after each effect in the chain.
D−1 Over 30 temporal and spectral features are averaged over each au-
2 X
gi (p) = dsort x (d, p)i (1) dio sample, extracted using JS-Xtract [19], in line with the features
D−1 extracted by the SAFE project [5]. The impact of each plugin can
d=0
then be characterised by the change in feature space before and af-
np (d, i) ter processing. We measure this using the Euclidean distance over
x (d, p)i = PD−1 (2) each feature dimension, defined in Eq. 3.
d=0 np (d, i)
v
uN
Here, gi (p) is the generality of a plugin p on instrument i. uX
d(p, q) = t (qi − pi )2 (3)
np (d, i) is the number of plugin occurrences on descriptor d and i=1
instrument i. These measurements refer to the range of plugins are
used to process a given instrument. Each feature is vector normalised against all other instances of
Table 1 shows that distortion is the least general effect whilst that same feature within a processing chain, thus capturing relative
the EQ exhibits a consistently high generality score. The com- salience in the chain. Fig. 1 shows the feature distance as a func-
pressor is most general on Piano, Drums and full Mixes, suggest- tion of position in the chain across all entries into the dataset. This
ing that in these cases, the effects are applied irrespective of the indicates the first plugin generally has the greatest impact, irre-
genre and timbral descriptor. Table 2 measures the generality of spective of plugin type. As the plugin index increases, the feature
descriptors across instrument classes. The genre, like the instru- differences decrease. The mean effect chain length is 1.64, where
ment, has little impact on the choice of plugin with all plugins at- the probability of an effect being selected for a given position in
taining similar generality scores across the various genres, shown the chain is presented in Table 5.
in table 4. Distortion is significantly less general when used on
Folk samples, indicating that it must only be used in very specific Effect 1st 2nd 3rd 4th
use cases, with only 2 instances when a distortion was used from EQ (E) 0.44 0.43 0.21 0.33
all 22 responses for Folk. The reverb effect followed similarly low Compressor (C) 0.22 0.28 0.25 0.33
generality scores. Distortion (D) 0.15 0.10 0.16 0.00
Table 4 shows the cumulative generality scores for an effect Reverb (R) 0.17 0.18 0.38 0.33
being used across contexts: genre, instrument and descriptor. A
low score here indicates the term has a high impact on whether a Table 5: Probability of effects per chain position
plugin appears. This suggests the genre and instrument play a rel-
atively small role in the selection of effects in a processing chain. For the first position, which includes chains of single effects,
However the descriptor has an impact on whether distortion or the EQ is the most popular, appearing in 44.9% of the total in-
reverb is used in the chain, indicating these only appear when a stances, followed by the compressor (22%). The first and sec-
specific descriptor is used. EQ and compressor appear to be uni- ond positions retain the same structure, and the effects in order of
versally more general and can appear in any chain. descending popularity are EQ, compressor, reverb and distortion.
DAFX-3
DAFx-105
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
This aspect shifts when moving to the third position, were the most
popular effect is the reverb (37.5%), followed by the compressor D
(25%), EQ (20%) and distortion (16%). Finally the fourth posi- DE
tion is split equally between EQ, compressor and reverb with the ERC
ECR
distortion never appearing in that position. CER
DCE
ED
3.3. Plugin Order DC
CED
We consider each processing chain to be a multi-dimensional vec- EDC
DR
tor, where each dimension represents a plugin instance. Each in- EC
dex in the vector is then considered to have a finite state. The CE
C
likelihood of a plugin appearing at position k, given the state at E
positions {0, . . . , k − 1} can be evaluated using a Markov chain ER
[20, 21]. Eq. 4 and 5 give the state transition matrix for the chain, RE
RCE
highlighting the probability of the next plugin type given the pre- R
vious plugin, defined in equation 6. A fifth state is entered which 0.0 0.5 1.0 1.5 2.0
represents a blank plugin state. The chain must start at this empty
plugin and the chain terminates once this state is re-entered. The
state vector v comprises equalisation (E), compression (C), distor- Figure 2: Hierarchical clustering of unique chains based on term
tion (D), and reverb(R). usage
DAFX-4
DAFx-106
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
chains. For example warm is a descriptor that can be achieved by By implementing an independent matrix for each timbral de-
8 different unique plugin chains, using 42% of the unique chains scriptor, we are also able to compare the way in which terms are
in our dataset, while fuzz makes use of just 15%. organised in our generated processing chains, to the way the terms
are organised in our dataset. By generating the same number of
4.1. Transform Similarity entries per transform, we construct a similarity matrix of terms,
based on their frequency of plugin chain usage. We then apply di-
An audio effect can have a more significant effect on the audio mensionality reduction and the resulting mapping is presented in
signal than others in the chain, based on its respective parameter Fig. 5. Here, descriptors that generate similar chains are placed
space. In order to quantify this, audio feature differences across together such as room and damp, or sharp and punch. This be-
each effect in the chain are captured using Euclidean distance. This haviour adds a level of versatility to the system, given that similar
is represented as a matrix P with dimensions M × E, where M or identical chains, which can be used for achieving neighbouring
represents the number of descriptors and E is the number of base descriptors, have a high probability of co-occurring.
effects (EQ, compressor, distortion, reverb). We can then apply To demonstrate the similarity of the original and generated de-
dimensionality reduction to this matrix using PCA and project the scriptor mappings, the two spaces can be assessed using the trust-
audio effect classes into the low-dimensional space, as shown in worthiness (Tk ) and continuity (Ck ) metrics [23, 24], shown in Eq.
Fig. 4. 8 and 9.
Xn X
2
Tk = 1 − (r(i, j) − k) (8)
nk(2n − 3k − 1) i=1 (k)
1.0 j∈Ui
crunch
fuzz Xn X
Distortion 2
Ck = 1 − (r̂(i, j) − k) (9)
0.5 punch nk(2n − 3k − 1) i=1 (k)
j∈Vi
Compressor crisp
sharp thick
smooth
boom Here, the distances of the n entries in two spaces (U and V ) are
0.0
converted to ranks (r and r̂) between points i and j. The measure-
warm ments then evaluate the distributions of datapoint in the respective
creamair
deep thin spaces over a number of neighbouring datapoint (k).
close dream
Equaliser room The low-dimensional space generated by the descriptor-wise
−0.5 bright damp
Reverb state transition matrices (presented in Fig. 5) achieves a trustwor-
thiness score of 0.78 for the original structure of unique terms (the
−0.5 0.0 0.5 1.0 space used to generate the clusters in Fig. 3). This suggests that
the organisation of terms is preserved when generating processing
chains using the Markov Chain approach. Similarly, for continu-
Figure 4: Low-dimensional descriptor mapping with prevalent di- ity, the structure of the probability matrix is retained with a score
mensions of 0.782.
For terms with low generality, i.e. those that have very specific
The figure shows that terms associated with specific effects, i.e. plugin usage patterns such as fuzz and dream (see table 2), very
those with low generality scores, are highly correlated with the specific plugin chains will appear from Pd . However, for entries
effect dimensions. Fuzz and crunch for example are correlated which consist of more general effects, warm, smooth and cream,
with distortion, damp and room with the reverb, and sharp with more chains can appear which have lower probabilistic scores.
the compressor. Terms with a more general representation, such as This can be interpreted as a low confidence score, thus produc-
warm tend to exhibit lower correlation scores. ing more variance in the results. Psmooth generates the following
chains: EQ (16.67%), EQ-compressor (13.33%), reverb (11.11%)
and distortion (11.11%). These all have low and relatively equal
5. FULL CHAIN RECOMMENDATION probability of occurring, highlighting how unspecified this matrix
is.
5.1. Descriptor-based chain recommendation This can be improved by weighting the transition matrix prob-
The Markov chain approach to processing chain generation uses a abilities to penalise plugins which are not related to the term. This
matrix of conditional probabilities, based solely on the effect in the is done by incorporating weights which indicate the plugin’s preva-
chain at position k − 1. This method will be inherently biassed to lence in a chain. A weight wp is found using Eq. 10 to 12.
favour plugins with a high generality (tables 1 and 2). For example, PN −1 PL−1
there is a high probability that the matrix P (Eq. 5) will generate a l=0 f (d, l)g(x(l), p)
wp = n=0
PN −1 (10)
chain consisting of a single effect, given the likelihood of 60.7%. n=0 n
To improve the recommendation, we use a state transition matrix, (
based on the specific probabilities for each descriptive term Pd . 1, if l = argmax(d)
f (d, l) = (11)
Sequentially generating states using Pd will then produce a set 0, otherwise
of chains for the specified descriptor d. For instance, fuzz will (
generate a chain of just distortion with a likelihood of 74.38%, 1, if x = p
g(x, p) = (12)
and an EQ-distortion chain with a likelihood of 16.53%. 0, otherwise
DAFX-5
DAFx-107
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
fuzz
1.0
1.0 crunch
thin
warm
smooth
0.5
dream
0.5 thick
−1.0 −1.5
−1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
DAFX-6
DAFx-108
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
shop on Applications of Signal Processing to Audio and [17] Mark Cartwright, Bryan Pardo, Gautham J. Mysore, and
Acoustics, Oct 2009, pp. 1–4. Matt Hoffman, “Fast and easy crowdsourced perceptual au-
[3] Stuart Mansbridge, Saorise Finn, and Joshua D. Reiss, “Im- dio evaluation,” in 2016 IEEE International Conference on
plementation and evaluation of autonomous multi-track fader Acoustics, Speech and Signal Processing (ICASSP), March
control,” in Audio Engineering Society Convention 132, 2016, pp. 619–623.
2012. [18] Nicholas Jillings, Yonghao Wang, Joshua D. Reiss, and Ryan
[4] Stuart Mansbridge, Saorise Finn, and Joshua D. Reiss, “An Stables, “JSAP: A plugin standard for the web audio api
autonomous system for multitrack stereo pan positioning,” in with intelligent functionality,” in Audio Engineering Society
Audio Engineering Society Convention 133, 2012. Convention 141, 2016.
[5] Ryan Stables, Sean Enderby, Brecht De Man, György [19] Nicholas Jillings, Jamie Bullock, and Ryan Stables, “JS-
Fazekas, and Joshua D. Reiss, “SAFE: A system for the Xtract: A realtime audio feature extraction library for the
extraction and retrieval of semantic audio descriptors,” in web,” in 17th International Society for Music Information
15th International Society for Music Information Retrieval Retrieval Conference, 2016.
Conference, 2014. [20] Andrey Markov, “Extension of the limit theorems of prob-
ability theory to a sum of variables connected in a chain,”
[6] Spyridon Stasis, Ryan Stables, and Jason Hockman, “Se-
Dynamic Probabilistic Systems, vol. 1, 1971.
mantically controlled adaptive equalisation in reduced di-
mensionality parameter space,” Applied Sciences, vol. 6, no. [21] George Tauchen, “Finite state markov-chain approximations
4, 2016. to univariate and vector autoregressions,” Economics Letters,
vol. 20, no. 2, pp. 177 – 181, 1986.
[7] Spyridon Stasis, Jason Hockman, and Ryan Stables, “Navi-
gating descriptive sub-representations of musical timbre,” in [22] Warren S. Torgerson, Theory and methods of scaling., Wiley,
Proceedings of the International Conference on New Inter- 1958.
faces for Musical Expression, Copenhagen, Denmark, 2017. [23] Jarkko Venna and Samuel Kaski, “Visualizing gene interac-
[8] Mark Cartwright and Bryan Pardo, “Social-EQ: Crowd- tion graphs with local multidimensional scaling.,” in ESANN,
sourcing an equalization descriptor map,” in Proceedings 2006, vol. 6, pp. 557–562.
of the International Society of Music Information Retrieval, [24] Laurens Van Der Maaten, Eric Postma, and Jaap Van den
2013, pp. 395–400. Herik, “Dimensionality reduction: a comparative,” J Mach
[9] Sean Enderby and Ryan Stables, “A nonlinear method for Learn Res, vol. 10, pp. 66–71, 2009.
manipulating warmth and brightness,” in Proceedings of the
20th International Conference on Digital Audio Effects, Ed-
inburgh, UK, 2017.
[10] Jacob A Maddams, Saorise Finn, and Joshua D Reiss, “An
autonomous method for multi-track dynamic range compres-
sion,” in Proceedings of the 15th International Conference
on Digital Audio Effects, 2012.
[11] Prem Seetharaman and Bryan Pardo, “Reverbalize: a crowd-
sourced reverberation controller,” in Proceedings of the
22nd ACM international conference on Multimedia, 2014,
pp. 739–740.
[12] Ryan Stables, Brecht De Man, Sean Enderby, Joshua D.
Reiss, György Fazekas, and Thomas Wilmering, “Semantic
description of timbral transformations in music production,”
in Proceedings of the 2016 ACM on Multimedia Conference,
2016, pp. 337–341.
[13] Taylor Zheng, Prem Seetharaman, and Bryan Pardo, “So-
cialFX: Studying a crowdsourced folksonomy of audio ef-
fects terms,” in Proceedings of the 2016 ACM on Multimedia
Conference. ACM, 2016, pp. 182–186.
[14] Roey Izhaki, Mixing audio: concepts, practices and tools,
Taylor & Francis, 2013.
[15] Nicholas Jillings and Ryan Stables, “Investigating music
production using a semantically powered digital audio work-
station in the browser,” in Audio Engineering Society Con-
ference on Semantic Audio, Erlangen, Germany, June 2017.
[16] Michael H. Birnbaum, “Human research and data collection
via the internet,” Annual Review of Psychology, vol. 55, pp.
803–832, October 2003.
DAFX-7
DAFx-109
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT sampled sounds, since acting on such cues can keep effects syn-
chronized while accommodating for human error and reaction
Electronic music often uses dynamic and synchronized digital au- times.
dio effects that cannot easily be recreated in live performances. In this paper, we explore a cross-adaptive framework that enables
Cross-adaptive effects provide a simple solution to such problems a percussionist to orchestrate effects on other instruments in real
since they can use multiple feature inputs to control dynamic var- time. In particular, we evaluate such an approach for several sce-
iables in real time. We propose a generic scheme for cross-adap- narios where musicians try to recreate, in live performance, a syn-
tive effects where onset detection on a drum track dynamically chronized audio effect that otherwise would only be implemented
triggers effects on other tracks. This allows a percussionist to or- in the studio. The effects presented here are feed-forward cross-
chestrate effects across multiple instruments during performance. adaptive audio effects as the features are extracted from the drum
We describe the general structure that includes an onset detection input source(s) and effects are implemented on the synth input [1].
and feature extraction algorithm, envelope and LFO synchroniza- They include short duration effects such as ducking, tremolo and
tion, and an interface that enables the user to associate different filter sweep effects that can be triggered by drum cues in real time.
effects to be triggered depending on the cue from the percussionist. These effects have been implemented as a live performance tool
Subjective evaluation is performed based on use in live perfor- which can synchronize effects across multiple instruments using
mance. Implications on music composition and performance are cues from the drummer. This enables the drummer to orchestrate
also discussed. effects across multiple instruments during performance.
Keywords: Cross-adaptive digital audio effects, live processing, Though similar effects can be implemented using MIDI trig-
real-time control, Csound. gers and sidechaining, our framework has more flexibility. With
MIDI triggers or sidechaining, each drum hit would trigger the
1. INTRODUCTION cross-adaptive effect. In our framework, the onset detector can be
manipulated by the performer to trigger the effect only on louder
Adaptive audio effects are characterized by a time-varying control notes or avoid re-triggering on closely spaced notes. Using an on-
on a processing parameter. The control is computed by extracting set detector also enables the use of onsets from physical cymbals
features from the input audio and mapping their scaled versions to and/or drums, and does not require additional hardware.
relevant control parameters of an effect [1]. Cross-adaptive audio
effects are adaptive audio effects where the features of a signal are
analysed to control the processing parameters of another signal. 2. IMPLEMENTATION
This opens a range of possibilities as it presents a new form of
communication between musicians. The idea of another performer
interfering with the way your instrument sounds is disruptive at
first, but if routed carefully it can have profound impact on the way
we perform music.
In recent years, we have seen a few implementations of cross-
adaptive audio effects. [2] described how such effects can be used
to automate the mixing process by utilizing features across multi-
ple tracks to determine the processing applied to each track. Such
intelligent mixing systems have been implanted and evaluated for
automation of multitrack equalization [3] and of multitrack dy-
namic range compression [4]. [5-7] presented audio processing
plugins with the capability to implement user-defined cross-adap-
tive effects. [8] developed a genetic algorithm (GA) and artificial
neural network (ANN) based cross-adaptive audio effect that can
control user-defined parameters to make a source audio file sound
as close as possible to a given target audio file.
An advantage of audio post-production is the ability to move
recorded segments and synchronize them after recording. Effects
are applied during production which are time-aligned with audio
events. Replicating such effects live with musicians controlling
these effects while playing their instrument is often not feasible.
However, it may be possible to use cross-adaptive digital audio
effects to synchronize the effect applied to one source to events
produced by another source. Using cues from a percussionist to
synchronize effects across instruments could be highly beneficial Figure 1: A signal flowchart for the cross-adaptive audio effect
in live performance. It can introduce live instruments to replace framework.
DAFX-1
DAFx-110
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The cross-adaptive effects were implemented as a VST plugin amount by which the signal needs to decay before allowing the
using Csound (code available here: https://code.soundsoft- next onset to be detected, and Tslope is the minimum increase in
ware.ac.uk/attachments/download/2231/CrossA- amplitude required between e[n] and e[n-delay] to detect an onset.
daptiveMain.csd). The effects run within a signal flow as depicted Whenever the onset detector detects an event from the drum sig-
in Fig. 1, where an onset detector runs on percussive input chan- nal, it initiates an event for the duration set by the envelope control
nels. Upon detection of an event, it initializes an envelope profile panel.
and an LFO which are sent to the desired effects acting on another
instrument. 2.2. Envelope Control
The VST plugin has three inputs, two control inputs coming
from kick and snare drums and one instrument input (in this case The envelope control panel, shown in Fig. 2, sets the duration and
the synthesizer) on which the effect was applied. Onset detection behaviour of the ASR envelope that is mapped to the parameter of
the effects. The parameters mean the same as for any conventional
was applied on both the control inputs to trigger two different ef-
dynamic effect. There are individual envelope controls for effects
fects on the synth channel.
associated to different drums.
REAPER was used to run the VST plugin since it supports The attack time sets the time required for the control value to
multitrack plugins. The effects were routed as following. reach its maximum value. The sustain time sets the duration during
• Track 1 – Kick Drum Input. The VST is loaded on this track. which the control value will approach and hold its maximum
• Track 2 – Snare Drum Input. Master Send is switched off, value. Therefore, if the attack time is greater than sustain time, the
channel 3/4 is sent to Track 1. control value will never reach its maximum value, and it will start
• Track 3 – Synth Input. Master Send is switched off, channel reducing after the sustain time has passed. Thus, the actual dura-
5/6 is sent to Track 1. tion of the effect is not affected by the attack time.
The release time sets the amount of time the control value
takes to drop to zero from its value at the end of the sustain. The
sum of sustain time and release time determine the effect duration.
DAFX-2
DAFx-111
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4. PERFORMANCE EVALUATION
DAFX-3
DAFx-112
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-113
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 9: Song 2 Recording without effect (Expt. No. 2). Figure 11: Song 3 Recording without effect (Expt. No. 13).
This song required the synthesizer to start a note with full depth In this experiment, each synthesizer note had ~0.1 second
tremolo on each kick drum hit, and stop the note at every snare attack and release time, and were played staccato. Each of the
drum hit. This was implemented by using the kick-based trigger to synth notes were played only with each kick and snare notes.
enable output and start an LFO to control the amplitude modulator With Effect:
for a long duration (longer than the length of one bar), and setting The effect had an exponential ASR envelope to control
the snare-based trigger to mute the output. The manual perfor- amplitude of the synth output. Due to the exponential nature of the
mance used a regular note-synchronized tremolo effect and re- attack curve, the output amplitude grows very slowly initially,
quired the keyboardist to control the note onset and release. staying inaudible for a while. Hence, the perceived onset of the
With Effect: note from the effect was delayed compared to the trigger time of
This song required the drummer to play normally (as seen in the effect, especially for short attack times (<0.2 seconds). As seen
Fig. 8 & 9, the drum tracks are identical), but the keyboardist had in Fig. 10, performers perceived a lag of about 70 ms while using
to hold the notes slightly before the drummer and hold it longer the effect. This was disruptive for the performers and made it
than the song required. This ensured that the note onset and release difficult for them to use the effect. We observed that during
were controlled by the drum triggers alone. Since the song was performance, notes that occurred with the snare hits did not seem
slow, this was not difficult and the effect worked very well, see to have the distinct lag, although the processing complexity was
Fig. 8. A constant delay of 2.3 ms was observed between the peak the same. This was probably because the snare drum itself has
of each drum transient and the note onset/release. This delay was longer sustain, thus masking the duration when the effect output is
limited by the latency of the sound card and driver. inaudible and slowly blends in as the effect becomes louder and
Without Effect: the snare fades away. The song had multiple, fast chord changes.
Fig. 9 shows that, when manually trying to replicate this song, This made the effect more difficult to use since the keyboardist
there were similar onset differences as in the previous experiment. needed to anticipate the drum onset and hit the note prior to that.
The slower tremolo of this song caused phase errors to be less than Without Effect:
in the previous experiment. But due to the slow tempo and abrupt This experiment was trivial as the only variable was the onset
stop at every snare hit, it was very noticeable when the keyboardist times. The difference in onset times in this experiment was greater
released the note later than the snare hit. than the previous experiments since the song was faster and had
frequent chord changes. We see in Fig. 11 an instance where the
4.3.3. Song 3 (Closer) keyboardist’s onset occurs significantly before the onset of the
drummer.
4.4.1. Drummer A
DAFX-5
DAFx-114
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-6
DAFx-115
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6. RESULTS
Performance tests showed that the effect works better for LFO
synchronization and fast attacks. Effects with longer attack times
do not provide immediate feedback to the performers, thus giving
a sense of lag. This longer attack time setting was a larger contrib-
utor to lag than any signal processing latency. Fig. 15 compares
the time taken to achieve final recording for each of the experi-
ments. D-A K-B implies the experiment conducted with Drummer
A and Keyboardist B. The time required for performers to achieve
desired performance while using the effect was consistently
shorter than when performing with manual effects. This might be
biased for the selected songs and tasks since the effects in the song
were difficult to replicate manually and the effect was particularly
useful for the given situations. Figure 15: Experiment duration comparison.
Sound samples from the performance tests are available at:
https://code.soundsoftware.ac.uk/attachments/down-
load/2232/Performance%20Examples.rar
DAFX-7
DAFx-116
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-117
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-118
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
also assume that the fundamental is considerably stronger than the (13)
partials, and 1 reduces to • x̂k|k−1 , x̂k|k , x̂k|k+1 are the a priori, current and a posteri-
ori state vector estimates respectively.
yk = a1 cos (ω1 kTs + φ1 ) + vk (2)
• P̂k|k−1 , P̂k|k , P̂k|k+1 are the a priori, current and a poste-
where a1 , w1 and φ1 are the fundamental amplitude, frequency riori error covariance matrices respectively.
and phase respectively and Ts is the sampling interval. The state • Kk is the Kalman gain that acts as a weighting factor be-
vector is constructed as tween the observation yk and a priori prediction x̂k|k−1 in
determining the current estimate.
α
2
xk = uk (3) • σw is modeled as the process noise variance and I is an
u∗k identity matrix of dimensions 3 × 3.
• Initial state vector and error covariance matrix are denoted
where as x̂1|0 and P̂1|0 respectively
α = exp(jω1 Ts ) The fundamental frequency, amplitude and phase estimates at
uk = a1 exp(jω1 kTs + jφ1 ) (4) instant k are given as
u∗k = a1 exp(−jω1 kTs − jφ1 ) ln (x̂k|k (1))
f1,k =
This particular selection of state vector ensures that we can track 2πjTs
q
all three parameters that defines the fundamental – frequency, am- a1,k = x̂k|k (2) × x̂k|k (3) (14)
plitude and phase. The relative advantage of choosing this com-
plex state vector has been described in [13]. The state vector 1 x̂k|k (2)
φ1,k = ln
estimate update rule xk+1 relates to xk as 2j x̂k|k (3)x̂k|k (1)2k
where H is the observation matrix. We can see that 3.1. Detection of Silent Zones
a1
Hxk = [exp (jω1 kTs + jφ1 ) + exp (−jω1 kTs − jφ1 )] It is important to keep track of note-off events (unpitched mo-
2
ments) in the signal because the estimated frequency for such silent
= a1 cos (ω1 kTs + φ1 ) regions should be zero. Moreover, whenever there is a transition
(7) from note-off to note-on, the Kalman filter error covariance matrix
needs to be reset. This is because the filter quickly converges to
2.1. Kalman Filter equations a steady state value, and as a result the Kalman gain Kk and er-
ror covariance matrix P̂k|k settle to very low values. If there is a
The recursive Kalman filter equations aim to minimize the trace
sudden change in frequency of the signal (which happens at note
of the error covariance matrix. The EKF equations are given as
onset), the filter will not be able to track it unless the covariance
follows
matrix is reset.
Kk = P̂k|k−1 H ∗T [H P̂k|k−1 H ∗T + 1]−1 (8) To keep track of silent regions, we divide the signal into frames
of 20 ms. One way to determine if a frame is silent or not is to cal-
x̂k|k = x̂k|k−1 + Kk (yk − H x̂k|k−1 ) (9) culate its energy. The energy of the ith frame, Ei is given as the
DAFX-2
DAFx-119
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
following properties
0.9
φyy (n) = σv2 δ(n)
0.8
Φyy (e ) = σv2 ∀ ω ∈ [−π, π]
jω
0.7
p (19)
K
σv2K
Spectral flatness
0.6
spf = 1 =1
0.5 K
(Kσv2 )
0.4
0.3
Detected peak
Magnitude spectrum
0.2
0.1
0
0 50 100 150
Frame number
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Phase spectrum
sum of the square of all the signal samples in that frame. If the
energy is below -60dB, then the frame is classified to be silent. If
the frame has N samples, then
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
N
X −1 Frequency in Hz
Ei = 20 log10 y(n)2 . (15)
n=0 Figure 2: Frequency spectrum used for calculating initial esti-
mates of the state vector.
However, for noisy input signals, the energy in silent frames is
significant. To find silent frames in noisy signals, we make use of
the fact that noise has a fairly flat power spectrum. Therefore, for 3.2. Calculating Initial State and Resetting the Error Covari-
each frame, we calculate the spectral flatness [16]. If spectral flat- ance Matrix
ness ≈ 1, then the frame is classified to be silent. Figure 1 shows
It has already been established that resetting the error covariance
spectral flatness v/s frame number for an audio file containing a
matrix is necessary whenever there is a change in the frequency
single note preceded and followed by some silence.
content of the signal (i.e., whenever a new note is played). Along
The power spectral density (PSD) of the observed signal, Φyy , with resetting the covariance matrix, we need to recalculate our
is given as initial estimates for the state vector. This is because the rate of
X∞
convergence of the Kalman filter depends on the accuracy of its
Φyy (ejω ) = φyy (n)e−jωn (16) initial estimates. Basically, we need to calculate x̂1|0 and P̂1|0
n=−∞
whenever there is a note onset, i.e., whenever there is a transition
where φyy (n) is the autocorrelation function of the input signal y, from silent frame to non-silent frame.
given as Depending on how strong the attack of the instrument is, we
∞ may need to skip a few audio frames until we reach steady state
X
φyy (n) = y(m)y(n + m). (17) to accurately estimate initial states. This is because the transient is
m=−∞
noisy which makes frequency estimates go haywire, and we must
wait for the signal to settle. The number of frames to skip after
The power spectrum is the DTFT of the autocorrelation function detecting a transition from silent to non-silent frame can be a user-
and one way of estimating it is Welch’s method [17] which makes defined parameter.
use of the periodogram. The power spectral density can be calcu- To calculate the initial estimates of the state vector x̂1|0 we
lated in Matlab with the function pwelch. The spectral flatness take an FFT of the first non-silent frame following a silent frame,
is defined as the ratio of the geometric mean to the arithmetic mean after multiplying it with a Blackman window and zero-padding it
of the PSD. by a factor of 4. Zero padding increases the FFT size which in-
qQ
K K−1 jωk ) creases the sampling density along the frequency axis. We calcu-
k=0 Φ̂yy (e
spf = 1 PK−1 (18) late the magnitude and frequency of the peaks in the magnitude
K k=0 Φ̂yy (e
jωk )
spectrum, and take f1,0 as the minimum of the frequencies corre-
sponding to the largest peaks. a1,0 is the magnitude corresponding
where Φ̂yy (ejωk ) is the estimated power spectrum for K frequency to f1,0 normalized by the mean of the window and number of points
bins covering the range [−π, π]. in the FFT. φ1,0 is the corresponding phase. Next, we perform
For white noise v ∼ N(0, σv2 ) corrupting the measurement sig- parabolic interpolation on f1,0 , a1,0 , φ1,0 to get more accurate es-
nal, y, the silent frames of y will have pure white noise with the timates. A typical plot of the spectrum used to calculate initial
DAFX-3
DAFx-120
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
350
ground truth
300 ekf estimate
Estimated frequency in Hz yin estimate
250
200
150
100
50
0
0 2 4 6 8 10 12 14 16 18 20
Time in seconds
Figure 3: Plot of a) Ground Truth (blue) b) ECKF pitch detector (red) c) YIN estimator (green)
× 10 4
0.3 2.5
0.25
2
1.5
0.15
0.1
0.5
0.05
0 0
0 5 10 15 20 0 5 10 15 20
Time Time
Figure 4: ECKF: Estimated amplitude and unwrapped phase for same input
estimates is given in Figure 2. The initial states are as follows probably caused by a change in the input that the ECKF needs to
track. In that case, σw2
increases and there’s a term added to the
x̂1|0 = E[x1 ] a posteriori error covariance matrix P̂k|k+1 . This increase in the
error covariance matrix causes the Kalman gain Kk to increase in
e(2πjf1,0 Ts )
= a1,0 e(2πjf1,0 Ts +jφ1,0 ) the next iteration according to equation 8. As a result, the next
(20) state estimate x̂k|k+1 depends more on the input, and less on the
a1,0 e(−2πjf1,0 Ts −jφ1,0 )
current predicted state x̂k|k . Thus, the innovation reduces in the
P̂1|0 = E[(x1 − x̂1|0 )(x1 − x̂1|0 )∗T ] next iteration, and so does σw 2
. In this way, the process noise acts
= 0(3,3) as an error correction term that is adaptive to the variance in input.
3.3. Adaptive Process Noise The ECKF pitch detector was tested on cello notes downloaded
from the MIS database, which contains ground truth labels in the
We only reset the covariance matrix when there is a transition from
form of annotated note names, and on cello and double bass notes
silent to non-silent frame. However, new notes may be played
played with various ornaments that we recorded ourselves.2 The
without any rest in between, and dynamics such as vibrato also
data was summed with white noise normally distributed with 0
need to be captured. To track these changes, an additional term is
mean and 0.01 variance, to test the performance of our algorithm
added to equation 12. σw2
is modeled as the process noise variance
with noisy input.
given as
2 Figure 3 shows the output when the notes A3-Bb3-E3-G3 were
log10 (σw ) = −c + |yk − H x̂k|k | (21)
played on the cello one after another with pauses in between. The
where c ∈ Z+ is a constant. 1 The term yk −H x̂k|k is known as the pitch detected by the ECKF is compared with the ground truth
innovation. It gives the error between our predicted value and the and YIN estimator output. Figure 4 shows the corresponding esti-
actual data. Whenever the innovation is high, there is a significant mated amplitude and phase plots for the same input given by the
discrepancy between the predicted output and the input, which is ECKF estimator. The mean and standard deviation of absolute er-
1 for this paper, c lies in the range 7–9 but its value can be tuned accord- 2 The Matlab implementation can be cloned from https://
ing to the input github.com/orchidas/Pitch-Tracking
DAFX-4
DAFx-121
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
with the ECKF. Unlike the YIN estimator which yields a single EKF estimate
YIN estimate
pitch estimate for an entire block of data, the ECKF yields a unique
200
pitch value for every sample of data, and it fluctuates about a mean
Estimated f0 in Hz
150
tually insignificant, hence we neglect it. It would be interesting to 0 0.5 1 1.5 2 2.5
Time in seconds
3 3.5 4
140
Estimated f0 in Hz
Table 1: Error statistics for Figure 3 120
100
80
yin 60
205
ekf
40
204 20
0
Estimated f0 (Hz)
(b) D3 vibrato
202
201 250
EKF estimate
200 YIN estimate
200
Estimated f0 in Hz
199
150
17.565 17.57 17.575 17.58 17.585 17.59 17.595 17.6
Time in seconds
100
Figure 6a shows the pitch detector output when the input is Time in seconds
a portamento played on the note D3 and glided to E3. Figure 6b (c) D3-Eb3 vibrato trill
shows the output when the input is a vibrato on the note D3, and
Figure 6c shows the output when the input is a vibrato trill on Figure 6: ECKF estimated pitch for various ornaments played on
the notes D3-Eb3. Figures 6a and 6b were notes played on the the cello and double bass
cello whereas Figure 6c was a note played on the double bass. All
three plots show excellent agreement with what we expect and the
output of the YIN estimator. In fact, in Figure 6c the output of
cheaper algorithm can be used to distinguish between non-silent
the ECKF is much smoother than that of the YIN estimator and
and silent frames. Calculating initial estimates also requires com-
shows less drastic fluctuations. Moreover, the YIN estimator gets
puting an FFT but we only need to do that whenever there is a
the pitch wrong in a number of frames where it dips and peaks sud-
transition from silent to non-silent frame, not for every block of
denly. It can be concluded that the ECKF pitch follower is ideal
data.
for detecting ornaments and stylistic elements in monophonic mu-
sic. It could also be used successfully in tracking minute changes Secondly, if the initial states estimated from the FFT are off
in speech formants. by 20 Hz or more, then the filter is slow to converge to its steady
state value. In Figure 6a, it is observed that the ECKF gets the
4.1. Limitations pitch wrong during the transient, but that is expected since there is
no well-defined pitch during the transient. To avoid getting a spu-
Although our proposed pitch detector performs well in many cases, rious value of pitch, the method of skipping a few buffers to wait
it has certain drawbacks. Firstly, the additional processing that in- until the steady state can be used as described in Section 3.2. Per-
cludes detecting silent frames and estimating initial state makes its haps the ECKF pitch tracker’s biggest limitation is tracking notes
implementation more computationally expensive than the one in played quickly together without any pause, as demonstrated in Fig-
[13], which is a drawback in real-time processing. This is because ure 7. ECKF is slow in tracking very rapid and large changes. The
estimating the power spectrum with Welch’s method requires com- faster the notes are played, the higher the latency in convergence.
puting an FFT for each block of data, which is of complexity A solution to this could be to observe note onsets and estimate ini-
O(N log2 N ). However, depending on the noise environment, a tial state and reset the covariance matrix whenever there is an onset
DAFX-5
DAFx-122
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
detected. However, we leave this problem open for future work. 6. REFERENCES
200
ception,” The Journal of the Acoustical Society of America,
vol. 23, no. 1, pp. 147–147, 1951.
150 [3] Alain De Cheveigné and Hideki Kawahara, “Yin, a funda-
mental frequency estimator for speech and music,” The Jour-
100 nal of the Acoustical Society of America, vol. 111, no. 4, pp.
1917–1930, 2002.
50
[4] A. Michael Noll, “Cepstrum pitch detection,” The Journal
0
of the Acoustical Society of America, vol. 41, pp. 293–309,
0 0.5 1 1.5 2 2.5 3 3.5
1967.
Time in seconds
[5] A. Michael Noll, “Pitch determination of human speech by
Figure 7: ECKF is slow in tracking fast note changes. Notes the harmonic product spectrum, the harmonic sum spectrum,
played are descending fifths from the A string on the double bass. and a maximum likelihood estimate,” in Proceedings of the
symposium on computer processing communications, 1969,
vol. 779.
[6] James A Moorer, “On the transcription of musical sound by
computer,” Computer Music Journal, pp. 32–38, 1977.
5. CONCLUSION [7] J Wise, J Caprio, and T Parks, “Maximum likelihood pitch
estimation,” IEEE Transactions on Acoustics, Speech, and
In this paper, we have proposed a novel real-time pitch detector Signal Processing, vol. 24, no. 5, pp. 418–423, 1976.
based on the Extended Complex Kalman Filter (ECKF). Several [8] Etienne Barnard, Ronald A Cole, Mathew P Vea, and
adjustments have been made for optimum tracking. An algorithm Fileno A Alleva, “Pitch detection with a neural-net classi-
based on spectral flatness has been proposed to detect silence in fier,” IEEE Transactions on Signal Processing, vol. 39, no.
incoming noisy audio signal. The importance of accurate initial 2, pp. 298–307, 1991.
state estimates and resetting the error covariance matrix has been
explained. A correction factor, σw2
, has been included to track fast, [9] F Bach and M Jordan, “Discriminative training of hidden
small changes in input signal. markov models for multiple pitch tracking,” in Proceedings
of International Conference on Acoustics, Speech and Signal
After all these changes have been incorporated into the ECKF
Processing. IEEE, 2005, vol. 5.
pitch detector, the results match those of robust and successful
pitch detectors like the YIN estimator. Perhaps its greatest ad- [10] Patricio De La Cuadra, Aaron S Master, and Craig Sapp, “Ef-
vantage is the fact that it estimates pitch on a sample-by-sample ficient pitch detection techniques for interactive music.,” in
basis, against most methods which are block-based. Moreover, its ICMC, 2001.
performance is robust to the presence of noise in the measurement [11] Rudolph E. Kalman et al., “A new approach to linear filtering
signal. The ECKF has been observed to be well suited for track- and prediction problems,” Journal of Basic Engineering, vol.
ing fine changes in playing dynamics and ornamentation, which 82, no. 1, pp. 35–45, 1960.
makes it an excellent candidate for real-time transcription of solo [12] Adly A Girgis, W Bin Chang, and Elham B Makram, “A
instrument music. Additionally, the ECKF also yields the ampli- digital recursive measurement scheme for online tracking of
tude envelope and phase of the signal, along with its fundamental power system harmonics,” IEEE transactions on Power De-
frequency. livery, vol. 6, no. 3, pp. 1153–1160, 1991.
[13] Pradipta Kishore Dash, G Panda, AK Pradhan, Aurobinda
Routray, and B Duttagupta, “An extended complex kalman
5.1. Future Work filter for frequency measurement of distorted signals,” in
Power Engineering Society Winter Meeting, 2000. IEEE.
We hope this work will encourage new uses of the Kalman filter in IEEE, 2000, vol. 3, pp. 1569–1574.
audio and music. In recent years, the Kalman filter has been used
[14] Xavier Serra and Julius Smith, “Spectral modeling synthesis:
in music for online beat tracking [18] and partial tracking [19, 20].
A sound analysis/synthesis system based on a deterministic
It has also been used for frequency tracking in speech [21]. Since
plus stochastic decomposition,” Computer Music Journal,
the Kalman filter is such a powerful tool that can work for any
vol. 14, no. 4, pp. 12–24, 1990.
valid model with the right state-space equations, we believe it can
have many more potential real-time applications in music. Some [15] Gabriel A Terejanu, “Extended kalman filter tutorial,” Tech.
of the other topics we wish to explore include partial tracking and Rep., University of Buffalo, 2008.
real-time onset detection with the Kalman filter. A real-time pitch [16] A Gray and J Markel, “A spectral-flatness measure for study-
detector along with an onset detector lays the groundwork for real- ing the autocorrelation method of linear prediction of speech
time transcription, which remains an exciting and advanced prob- analysis,” IEEE Transactions on Acoustics, Speech, and Sig-
lem. nal Processing, vol. 22, no. 3, pp. 207–217, 1974.
DAFX-6
DAFx-123
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[17] Peter Welch, “The use of fast fourier transform for the esti-
mation of power spectra: a method based on time averaging
over short, modified periodograms,” IEEE Transactions on
Audio and Electroacoustics, vol. 15, no. 2, pp. 70–73, 1967.
[18] Yu Shiu, Namgook Cho, Pei-Chen Chang, and C-C Jay Kuo,
“Robust on-line beat tracking with kalman filtering and prob-
abilistic data association (kf-pda),” IEEE transactions on
consumer electronics, vol. 54, no. 3, 2008.
[19] Andrew Sterian and Gregory H Wakefield, “Model-based
approach to partial tracking for musical transcription,” in
SPIE’s International Symposium on Optical Science, Engi-
neering, and Instrumentation. International Society for Op-
tics and Photonics, 1998, pp. 171–182.
[20] Hamid Satar-Boroujeni and Bahram Shafai, “Peak extraction
and partial tracking of music signals using kalman filtering.,”
in ICMC, 2005.
[21] Özgül Salor, Mübeccel Demirekler, and Umut Orguner,
“Kalman filter approach for pitch determination of speech
signals,” SPECOM, 2006.
DAFX-7
DAFx-124
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT has also explored techniques that are a hybrid of these two streams
Digital oscillators with discontinuities in their time domain sig- [12, 13]. More recently, both approaches to anti-aliasing have been
nal derivative suffer from an increased noise floor due to the un- generalized to apply to the processing of arbitrary input signals
bound spectrum generated by these discontinuities. Common anti- with a nonlinear waveshaping function [14, 15, 16, 17, 18].
aliasing schemes that aim to suppress the unwanted fold-back of In the following Section 2, the implementation of the polygon-
higher frequencies can become computationally expensive, as they based oscillator is layed out, which is then anti-aliased in Sec-
often involve repeated sample rate manipulation and filtering. tion 3. The results in terms of SNR and performance are discussed
In this paper, the authors present an effective approach to ap- in Section 4, followed by a short summary in Section 5.
plying the four-point polyBLAMP method to the continuous order
polygonal oscillator by deriving a closed form expression for the
2. POLYGON OSCILLATOR
derivative jumps which is only valid at the discontinuities. Com-
pared to the traditional oversampling approach, the resulting SNR The following is a quick recap of the continuous order polygon
improvements of 20 dB correspond to 2–4× oversampling at 25× waveform synthesis presented in [1]. To create the polygon P of
lower computational complexity, all while offering a higher sup- order n, where n denotes the number of vertices after one rotation
pression of aliasing artifacts in the audible range. of the sampling phasor, a corresponding radial amplitude p(ϕ) is
generated:
1. INTRODUCTION π
cos n
A novel complex oscillator algorithm was recently proposed [1], p(ϕ) = π , (1)
cos 2π · mod ϕn ,1 − n
which generates waveforms by traversing a two-dimensional poly- n 2π
gon over time. Such a polygon may contain any number of ver- where ϕ is a linearly incrementing phase whose slope depends on
tices, corresponding to a desired order n > 2, which expresses the desired pitch f0 . The amplitude p(ϕ) can then used to scale a
the vertices per rotation. The term complex conveniently com- unit circle, resulting in the polygon P in the complex plane:
bines both the internal dependence on the complex plane as well
as the resulting complex spectral behaviour. A projection of the P (ϕ) = p(ϕ) · ejϕ (2)
path around a shape in the complex plane can be interpreted as a
time-domain signal, with the rotational speed corresponding to the
fundamental pitch. Such a signal will naturally contain a number
of discontinuities in its derivative. These discontinuities produce
an unbounded spectrum and therefore introduce aliasing artifacts
into the signal. In order to produce high-quality audio output, this order: 2.5
aliasing should be minimized.
Anti-aliasing of digital oscillator algorithms is a well devel-
oped topic in literature, but to-date was focused primarily on the
generation of classical waveforms or on wavetable synthesis. Re-
search into the anti-aliasing of classical analog waveforms began order: 3
with the invention of the Band Limited Impulse Train (BLIT) method
[2], which generates all waveforms by integrating an underlying
sequence of bandlimited impulses. The next major advancement
was the invention of the Band Limited stEP (BLEP), and derived
Band Limited rAMP (BLAMP) methods [3], which can be applied order: 4
to the anti-aliasing of discontinuities (of any order) in any type of
waveform, as long as the position and magnitude of the discon-
tinuity is known. Further research has concentrated on more ef-
ficient polynomial approximations of the ideal BLEP, known as
polyBLEP [4, 5, 6, 7]. order: 5.5
A parallel stream of research has investigated techniques based
on pre-integration of the waveform to be generated, followed by
digital differentiation. These techniques have been applied to clas- Figure 1: Projections of polygons P (φ) of different orders n from
sical waveforms [8, 9] and to wavetable synthesis [10, 11]. Work the 2D space (left) radially sampled into the time domain (right).
DAFX-1
DAFx-125
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Amplitude
Several of such polygons P (ϕ, n) and their corresponding time- 0
domain projections fx (ϕ) are shown in Figure 1. One can see that -0.2
with increasing order / edge count n, the polygon naturally ap-
proaches a unit circle, which corresponds to a pure sine in the time -0.4 Osc. signal
domain when sampled spatially. A more in-depth analysis of the Derivative
Disc. jumps
-0.6
link between the shape of the polygon and the resulting spectrum Quantized disc.
can be found in [1]. -0.8
0 5 10 15 20 25 30
Time (samples)
3. ANTI-ALIASING
Figure 2: Signal f (ϕ) of order n = 3.75 and pitch f0 = 1415 Hz
Anti-aliasing of the polygon oscillator could be achieved via most with its derivative f 0 (ϕ). The discontinuities fˆ0 (ϕ) are shown at
of the available approaches. However, the BLAMP technique is exact and quantized positions.
particularly suited to this problem, as the exact positioning and
magnitude of the discontinuities in the derivative of the waveform
can be obtained in a very efficient closed-form solution. The closed form derivative of equation ( 3 ), fx0 (ϕ) with a = π
n
is:
To apply a four-point polyBLAMP based on third-order B-
spline approximation (as presented in [18]), the jump of the first dfx (ϕ) d cos(ϕ) cos (a)
= , (6)
order derivative has to be evaluated at the points of discontinuities, dϕ dϕ cos [mod (ϕ, 2a) − a]
along with the the fractional delay d between the exact time of the
discontinuity and the next sample in discrete, sampled time. Ta- Treating ϕm = mod(ϕ, 2a) as a special case of ϕ, which
ble 1 lists the corresponding polynomials that are to be subtracted needs to be differentiated but marked, yields:
from the four samples surrounding the discontinuity.
cos(ϕ) sin(ϕm − a) − sin(ϕ) cos(ϕm − a)
fx0 (ϕ) = cos(a) (7)
cos(ϕm − a)2
[−2T, −T ] d5 /120
cos(ϕ) cos(ϕm − a) − sin(ϕ) sin(ϕm − a)
[−T, 0] [−3d + 5d + 10d3 + 10d2 + 5d + 1]/120
5 4
fy0 (ϕ) = cos(a) (8)
cos(ϕm − a)2
[0, T ] [3d5 − 10d4 + 40d2 − 60d + 28]/120
[T, 2T ] [−d5 + 5d4 − 10d3 + 10d2 − 5d + 1]/120
3.3. Efficient implementation
Table 1: Four-point polyBLAMP residual for the four samples sur- We are only interested in the change in amplitude of the deriva-
rounding a discontinuity, where d is the fractional delay. tive at the discontinuities fˆ0 , which happens when ϕm ∈ [0...2a)
wraps around. Looking from both sides (ϕm↓ = 0 and ϕm↑ = 2a)
yields:
3.1. Fractional delay 0
lim fˆx (ϕ) = fˆ↓0 (ϕ) = − sin(ϕ) cos(a)+sin(a) cos(ϕ)
cos(a)
(9)
ϕm ↓0
In the sampled digital time domain, the samples left and right of 0
the discontinuity can be traced from the continuous form, as the lim fˆx (ϕ) = fˆ↑0 (ϕ) = − sin(ϕ) cos(a)−sin(a) cos(ϕ)
cos(a)
(10)
ϕm ↑2a
positions of the discontinuities are exactly at ϕ = 2π/n · k, k ∈
[0, 1, ..., ∞). The fractional delays d are the difference between This leaves us with a simple expression for the change in am-
the ceiled sample time and the exact time: plitude at discontinuities:
d = dfs /(nf0 ) · ke − fs /(nf0 ) · k , (5) fˆ0 (ϕ) = fˆ↓0 (ϕ) − fˆ↑0 (ϕ) = −2 tan(a) cos(ϕ) (11)
where d·e denotes the ceiling function, f0 is the fundamental pitch Using Equation ( 11 ), we can now correctly scale the residuum
and fs is the sampling frequency. of Table 1. Figure 3 shows the original function and its anti-aliased
Figure 2 shows the discrete time-domain signal f (ϕ) along version as well as their difference, with is zero everywhere except
with its derivative f 0 (ϕ) while marking the precise positions and the 4 samples surrounding a discontinuity.
amplitude jumps of the derivative at the discontinuities as x and
their quantized position by o.
DAFX-2
DAFx-126
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
SNR (dB)
0.8
Signal
f0 n original 2x OX 4x OX BLAMP
Signal AA
0.6 Residuum 400 Hz 2.53 57.2 72.5 82 78.7
751 Hz 4.42 66.7 82.6 94 87.8
0.4
1350 Hz 3.75 59.4 74 83.6 80.3
0.2
Amplitude
-0.6
filter and Matlab’s Polyphase FIR decimator as implemented in the
-0.8
dsp.FIRDecimator object.
0 5 10 15 20 25 30 Figure 4 shows the Fourier spectra of the original (top) and
Time (samples) three anti-aliased versions: 2x oversampling (second), 4x over-
sampling (third) and closed form BLAMPS (bottom), with the
Figure 3: Original and anti-aliased versions of the signal f (ϕ) of
overtones of the fundamental marked in each. The chosen settings
order n = 3.75 and pitch f0 = 1350 Hz.
correspond to Table 2 and are n = 2.53, f0 = 400 Hz in Fig. 4a)
and n = 3.75, f0 = 1350 Hz in Fig. 4b. Although the three anti-
aliased versions exhibit slightly different characteristics, the low-
4. RESULTS ered noise floor can be seen clearly in each. In both oversampling
cases, a considerate drop in harmonics close to fs /2 can be ob-
In the following Section, the presented method is compared to tra- served due to the low-pass filtering before decimation, as well as a
ditional oversampling (2x and 4x) in terms of Signal-to-Noise Ra- uniformly suppressed noise floor. On the other hand, the BLAMP
tio (SNR) and computational load. implementation exhibits only a small drop in high-frequency har-
monics and has a continuously decreasing noise floor.
4.1. Signal-to-Noise Ratio The stronger aliasing at higher frequencies, although less audi-
ble, lowers the overall SNR measurement of the BLAMP approach
Sharp discontinuities exhibit unbounded frequency requirements, which uniformly weights all anti-aliasing noise. This is an advan-
which results in foldback at fNY = fs /2 and effectively raises tage that is not easy to measure but arguably of large importance
the noise floor of the oscillator. Therefore, the Signal-to-Noise - BLAMP suppresses the potentially more disturbing anti-aliasing
Ratio (SNR), which denotes the ratio between the energy in the artifacts (below 10 kHz) stronger compared to other approaches.
fundamental + harmonics and that of the rest of the spectrum is
a good metric for measuring and comparing the performance of 4.2. Performance
different anti-aliasing approaches.
Firstly, the harmonic overtones fH,k , which depend on order As shown above, relatively large improvements in SNR can be
n and the fundamental pitch f0 , need to be determined. For the achieved quite efficiently by precisely analysing the oscillator func-
continuous order polygonal oscillator, the frequencies of the first tion at hand. The validity of the derivation is limited to the points
K harmonics fH, k can be found to be: at the discontinuities but for a given order, only a single trigono-
metric function has to be evaluated.
Compared to traditional oversampling, the presented approach
k k−1 falls in between 2x and 4x oversampling in terms of SNR improve-
fH,k (f0 , n) = f0 2 + 1 + (n − 2)(1 + ) ,
2 2 ment, as shown in Table 2. However, oversampling involves three
(12) computationally expensive steps (upsampling - lowpass - down-
where b·c denotes the flooring function, n is the order, f0 the fun- sampling), which on a current machine come with an averaged
damental frequency and k ∈ [1...K]. 25x performance hit when comparing execution times in Matlab.
The energy of the fundamental and harmonics is extracted di-
rectly from the Fourier spectrum, the noise energy is simply the
5. CONCLUSIONS
difference of the full and the signal energy:
P It was shown that even for non-traditional digital oscillators, el-
Esig |f0 + fH |2
SNR = = P P (13) egant and computationally cheap solutions for the polyBLAMP
Enoise |f∀ | − |f0 + fH |2
2
anti-aliasing approach may be found by only considering the nec-
essary points of discontinuity. For an oscillator method that in-
The SNR can then be calculated for both the original signal
herently generates high levels of aliasing noise such as the one
and various anti-aliased versions. While the measured SNR de-
presented, this can increase the SNR by 20 dB - beating 2× over-
pends on order n as well as the fundamental frequency f0 , large
sampling - at 25x lower computational complexity. The percep-
improvements of the SNR on the magnitude of 20 dB compared to
tual comparison are even be more favorable, as the polyBLAMP
the original signal were found consistently, as shown in Table 2.
method continuously drives down the noise floor (compared to
It can be seen that the measured improvements of the employed
constant uniform suppression of oversampling methods) which re-
method falls between 2× and 4× oversampling, which was imple-
sults in lower aliasing artifacts at audible frequencies below 10 kHz.
mented using a 64 / 128 (2×/4× oversampling) order FIR lowpass
DAFX-3
DAFx-127
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Original spectrum (SNR: 57.2 dB) Original spectrum (SNR: 59.4 dB)
0 0
-20 -20
Magnitude (dB)
Magnitude (dB)
-40 -40
-60 -60
-80 -80
-100 -100
0 5 10 15 20 0 5 10 15 20
Frequency (kHz) Frequency (kHz)
2x oversampled (SNR: 72.5 dB) 2x oversampled (SNR: 74 dB)
0 0
-20 -20
Magnitude (dB)
-60 -60
-80 -80
-100 -100
0 5 10 15 20 0 5 10 15 20
Frequency (kHz) Frequency (kHz)
4x oversampled (SNR: 82 dB) 4x oversampled (SNR: 83.6 dB)
0 0
-20 -20
Magnitude (dB)
Magnitude (dB)
-40 -40
-60 -60
-80 -80
-100 -100
0 5 10 15 20 0 5 10 15 20
Frequency (kHz) Frequency (kHz)
BLAMP AA (SNR: 78.7 dB) BLAMP AA (SNR: 80.3 dB)
0 0
-20 -20
Magnitude (dB)
Magnitude (dB)
-40 -40
-60 -60
-80 -80
-100 -100
0 5 10 15 20 0 5 10 15 20
Frequency (kHz) Frequency (kHz)
(a) Order n = 2.53, fundamental pitch f0 = 400 Hz. (b) Order n = 3.75, fundamental pitch f0 = 1350 Hz.
Figure 4: Magnitude spectra of the polygonal oscillator at different settings and different anti-aliasing strategies. Original (top), 2x
oversampled (second), 4x oversampled (third) and BLAMP anti-aliased (bottom). Sampling frequency fs = 44.1 kHz, fundamental and
harmonics used for SNR computation marked with ◦.
DAFX-4
DAFx-128
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-5
DAFx-129
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT
iABC iABC
In this paper, we show how to expand the class of audio circuits
that can be modeled using Wave Digital Filters (WDFs) to those i− i−
involving operational transconductance amplifiers (OTAs). Two v− − v− −
iout iout
types of behavioral OTA models are presented and both are shown i+ vout i+ vout
to be compatible with the WDF approach to circuit modeling. As v+ + v+ +
a case study, an envelope filter guitar effect based around OTAs
is modeled using WDFs. The modeling results are shown to be
accurate when to compared to state of the art circuit simulation (a) VCCS-Based OTA Symbol. (b) Alternate OTA Symbol.
methods.
Figure 1: OTA symbols
1. INTRODUCTION
DAFX-1
DAFx-130
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
i− i−
v− v−
− iout − gm vin C0 iout
vin vout vin vout
i+ + i+ +
v+ v+
gm vin R0
Ci− Ci+
2.1. Ideal VCCS Model of an OTA was proposed in [23]. To balance accuracy, physical interpretabil-
ity and complexity we make the simplification that the transcon-
An ideal OTA is a voltage dependent current source, with an ad- ductance is a linear function of iABC , as in (4).
justable gain, called transconductance gm [4]. The output current Modified Nodal Analysis (MNA) is a systematic way to keep
iout is equal to the multiplication of the differential voltage input track of physical quantities within a circuit [33]. The method be-
vin = v+ − v- and gm comes automatable by stamping each component into a MNA ma-
trix. MNA element stamps will also work with the Nodal DK-
iout = gm vin . (1) method for deriving nonlinear state-space systems [12] and in §3
we will describe how to transfer a populated MNA matrix from the
The conductance between the input terminals is assumed zero Kirchoff domain to the wave domain [22]. An ideal OTA is shown
as is the case with traditional op amps [25]. The output is as- in Figure 2 while a MNA element stamp is given by (5).
sumed to be a current source and so the output impedance is large.
This is not the case for the standard op amp which exhibits low
α β n
impedance at its output terminal. Low output impedance is of- γ 1
ten desirable when designing audio circuits and so modern bipolar
−1 (5)
δ
OTAs, such as the LM13700 [28], include controlled impedance
buffers on device. next −gm gm −1
The transconductance of a real world device is a multivariate Nodes shown in Figure 2 are α = v+ , β = v- , γ = vout , δ =
dynamic nonlinear function, dependent on temperature, device ge- ground.
ometry, the manufacturing process, etc. [27, 29]. In the ideal case
it is a simple function on temperature and a bias current, which
the device sinks through a dedicated input terminal. This bias cur- 2.2. OTA Macromodel
rent is often referred to as the Amplifier Bias Current iABC . For Real-world OTAs exhibit multiple nonidealities. Some of which
an OTA based on a bipolar transistor differential pair, the output may lead to audible effects and need to be taken into account when
current and transconductance are given by [3, 4] designing or analyzing audio circuits. Similar to the nonidealities
exhibited by standard opamps [25], real-world OTAs have finite
vin
iout = iABC tanh (2) input and output conductances and capacitances, input offset volt-
2 VT age, input bias currents, input offset current, differential and com-
diout iABC vin mon mode gain and also other OTA-specific nonidealities such as
gm = = sech2 (3)
dvin 2 VT 2 VT frequency dependent transconductance gain [26].
Choosing which effects to include in a macromodel is a trade-
In this equation the transconductance depends instantaneously on off between complexity and accuracy. Complex nonlinear macro-
the differential input voltage. It is however desirable for circuit models for CMOS OTAs exist in the literature [27] and can be
designers if the transconductance is independent on the differential adapted to bipolar OTAs by tuning the model parameters. In this
voltage input as discussed in §2. Assuming that |vin | 2VT the paper we balance complexity and accuracy by proposing a lin-
transconductance becomes ear macromodel which models input and output capacitances Ci+ ,
iABC Ci− and Co as well as a finite output conductance Ro .
gm ≈ (4)
2 VT
3. WAVE DIGITAL FILTERS
VT , the thermal voltage, is usually on the scale of tens of millivolts
and so making this assumption limits the dynamic range of the Here, we briefly elaborate on the recent theoretical developments
input. However, modern OTAs, such as the LM13700 [28], include that have allowed us to model circuits containing OTAs using WDFs
linearizing diodes that increase the input range of the differential techniques. For the sake of brevity a detailed discussion on WDF
voltage input while keeping the transconductance gain linear with theory is omitted. The interested reader is referred to the classic
respect to iABC [3]. article by Fettweis [13] and other recent work in the field [17, 34,
Methods on how to handle nonlinearities in WDFs have been 35, 36].
proposed in the past. Some have involved the introduction of ad- The scattering behavior of multiport series and parallel adap-
hoc unit delays [19, 18, 30], simplification of nonlinear devices tors has been known since the 1970s [37]. The issue with compli-
[17, 31], or ad-hoc [21] or systematic [32] global iteration. A sys- cated topologies that can not be divided into series and/or paral-
tematized method, sidestepping the aforementioned limitations, lel adaptors was first recognized by Martens and Meerkötter [38].
DAFX-2
DAFx-131
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
out
Table 1: Component values
in Input Output
Filter
Buffer Buffer
Component Value
envelope
R10 10 kΩ
Envelope R11 , R12 , R17 , R18 1 kΩ
Follower R13 , R15 , R16 , R19 , R20 22 kΩ
R14 100 kΩ
Figure 4: An abstract building block diagram of an envelope filter. C6 1 µF
C7 , C8 10 nF
DAFX-3
DAFx-132
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
vbias
R12
R18
R11 iABC R17 iABC
− R15 −
C6 R10 R16
+ +
R20
+ vin C7 + C8
−
R13 vout R19
−
vEE vEE
C C C
= = = B E
B · B
· B
step 1 step 2 step 3
E E E
Darlington Pair Ideal Realization Nullor Shuffling Ideal Buffer
Emitter Follower
Rb Rc
+ Rb + Rc
Rq
ω0 H0 = (9)
3 (Ra + Rb )
s
s Q∞
H(s) = H0 (7) Rb2 gm2
s + ωc ω0 ω02 = (10)
s2 + s + ω02 Rb Rc
Q∞ Cb2 (Rb + Rc ) ( + Rb + Rc )
|{z} | {z } | {z } Rq
Gain Highpass section Bandpass filter
v
u Rb Rc
u + Rb + Rc
Rq t Rq
Q∞ = . (11)
The transfer function this circuit realizes is essentially a 1st Rc Rb + Rc
order highpass filter, composed of components C6, R10 and R11,
cascaded with a 2nd order bandpass filter [46]. To simplify the
expression of the transfer function components with identical val-
ues are grouped together, Ra = R10 , Rb = R11 , R12 , R17 , R18 , Interestingly Rq , controlled by the range, influences all pa-
Rc = R15 , R16 , R20 , Rd = R14 , Ca = C6 and Cb = C7 , C8 . We rameters of the transfer function except ωc . The transconductance,
define Rq = Rc + rRd , where r ∈ ]0, 1] determines the range. gm , whose highest value can be set by the sensitivity knob in the
envelope follower section, only influences the center frequency.
That means that the filter is a constant-Q filter with respect to the
1 transconductance, surely a desirable trait when sweeping the fre-
ωc = (8) quency spectrum.
Ca (Ra + Rb )
DAFX-4
DAFx-133
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
+ +
R15 Ro −
P9 −
Co
− + + −
+ +
R14 S2 C7 P8 R16
− −
root
+ − + − + −
−
+
+
C6 R10 Vbias gm1 Vbias
... Ci+
R11 R1 + R17
+ − + − + + −
− vin
−
... P1
+
−
P2
+
−
+
−
P7
+
−
Ci+
− + −
− + +
S1 + C6 + −
P6 −
Ci−
gm2
R10 C6 + − + − + − − + + −
... R18
−
+ I1 Ci− + P3 R20 P4
+
C8 Vbias
− vin −
−
+
... − + + − − +
R10 R12
+ +
Vin Vbias Co −
P5 −
Ro
−
−
+
(a) Input stage equivalence. (b) Rearranging circuit to highlight Wave Digital Filter structure.
R15 Ro P9 Co
S2
R14 C7 P8 R16
root
−
−
+
P1 P2 R1 P7 Ci+
C6 P6 Ci−
S1
R18
+ Ci− R20
− P3 P4 C8 Vbias
I1
−
+
R10 R12
Vin Vbias Co P5 Ro
−
−
+
Figure 7: Setting up circuit to highlight topology, and corresponding Wave Digital Filter structure.
DAFX-5
DAFx-134
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
root: R1
200
20
S2 P2 P3 R20 P4 P6 P7 R16 P8 150
10
V Vbias C Vbias
R15 R14 Ci+ P1 Ci− bias P5 C8 C P9 C7
R12 R18 i− i+ R17 100
Phase (degrees)
Magnitude (dB)
Vbias 0
S1 Co Ro Co Ro
R11
50
I1 C6 -10
vin 0
R10
-20
Figure 8: Macromodel OTA—Filter SPQR tree -50
-30
root: R1 -100
-40
10 2 10 3 10 4
Frequency (Hz) 200
Vbias Vbias Vbias
S2 P1 R20 C8 R16 C7
R12 R18 R17 20
Vbias 150
R15 R14 S1
R11
10
I1 C6
100
Phase (degrees)
vin
Magnitude (dB)
0
R10
50
Figure 9: Ideal OTA—Filter SPQR tree. -10
0
-20
4.2. WDF Using Macromodel OTA -50
-30
By first modeling the filter using the macromodel OTA we can
show how to derive the complete WDF structure. In §4.3 we -100
simplify this WDF to model the ideal OTA case. We chose the -40
macromodel parameters Ci+ = Ci− = 5 pF, Co = 700 pF and 10 2 10 3 10 4
Ro = 50 MΩ so that a Bode plot of the macromodel-based filter Frequency (Hz)
matches a Bode plot of a component-level-based model 2 .
We proceed to model the filter by following the steps described Figure 10: Bode plot comparison of magnitude (blue) and phase
in §3.1. We limit the size of the resulting R-type adaptor by ab- (orange) spectrums. Upper plot compares the ideal OTA trans-
sorbing resistors into the bias voltage sources Vbias . Capacitors fer function (dashed lines) and WDF model (dotted lines). Lower
are discretized using the bilinear transform. The filter circuit rear- plot compares the component-level SPICE model of the LM13700
ranged to highlight the WDF structure is shown in Figure 7b while (dashed lines) with a macromodel-based model (dotted lines).
the corresponding WDF structure is shown in Figure 7c.
4.3. WDF Using Ideal OTA to two happens as the frequency approaches the Nyquist frequency
For the ideal OTA case we simply remove the components belong- where the warping effects from the bilinear transform become no-
ing to the macromodel (both Ro , both Co , both Ci− , and both ticeable. The plot of the macromodel is in good accordance with
Ci+ ) and follow the same procedure as before. Again choosing a component-level SPICE model where the magnitude spectrum
the R1 adaptor as the root, the SPQR tree for the resulting WDF matches almost exactly throughout the range of amplifier bias cur-
structure is given in Figure 9. rents and range controls. There are minor differences in the loca-
tion of critical frequencies and bandwidth between the two plots in
Figure 10.
5. RESULTS
We briefly study the circuit’s behavior under time-varying con-
Comparison of Bode plots with three amplifier bias currents iABC = ditions. The input is a 440 Hz sawtooth, r, the parameter control-
{6, 60, 600} µA and four range settings r = {0.01, 0.22, 0.6, 1.0} ling the range, is set to 0.01 and iABC is increased linearly over time
are shown in Figure 10. The input is supplied via the ideal voltage as indicated in the first row of Figure 11. In the second row of the
source, vin , and output at taken at vout in Figure 5. same figure a comparison of a SPICE simulation of the filter circuit
The ideal OTA based circuit shows excellent results when com- composed of an ideal OTA simulated using SPICE is compared to
pared to the transfer function. The only visible difference between the ideal OTA WDF. The third row compares the ideal and macro-
model WDFs to a component-level model of the LM13700 OTA
2 http://www.idea2ic.com/LM13600/ as simulated in SPICE. Despite the assumptions of (4) and ideal-
SpiceSubcircuit/LM13700_SpiceModel.html izing Darlington pairs as ideal buffers, good results are obtained.
DAFX-6
DAFx-135
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
10 -5
8
0.2
6
0.1
i ABC (A)
v in (V)
0 4
-0.1
2
-0.2
0
0.35
v out (V)
SPICE
-0.4 WDF
0.4
v out (V)
0
SPICE (component level)
WDF (macromodel)
-0.4
WDF (ideal)
The clipping behavior of the OTA (3) will have a more pronounced [3] A. Gratz, “Operational transconductance amplifiers,” Tech.
effect as the amplitude of the input is increased. This will cause Rep., 2008, Online: synth.stromeko.net/diy/
the differences between the simulations from the component-level OTA.pdf.
SPICE model and the WDFs to deviate more at higher amplitudes [4] T. Parveen, Textbook of Operational Transconductance Am-
than lower ones. plifier and Analog Integrated Circuits, I.K. International Pvt.
Ltd., 2009.
6. CONCLUSIONS [5] A. Huovilainen, “Enhanced digital models for analog mod-
ulation effects,” in Proc. Int. Conf. Digital Audio Effects
Two behavioral models of the operational transconductance am- (DAFx-05), Madrid, Spain, Sept. 2005.
plifier, a commonly found component in audio circuits, were pre-
sented. How to incorporate the models into WDF was explained [6] J. Pakarinen, V. Välimäki, F. Fontana, V. Lazzarini, and J. S.
and a WDF model of an envelope filter was derived and simulated. Abel, “Recent advances in real-time musical effects, syn-
Excellent results in the frequency domain were obtained when thesis, and virtual analog models,” EURASIP J. Adv. Signal
compared to an analytically derived transfer function of the filter Process., 2011.
section of the envelope filter while the results from a macromodel- [7] Roland Corporation, “Juno 60 service notes,” Tech. Rep.,
based WDF performed well when compared to a component-level Apr. 1983.
SPICE model of the LM13700 OTA. [8] Roland Corporation, “SH-101 service notes,” Tech. Rep.,
The circuit was briefly studied under time-varying conditions Nov. 1982.
and good results obtained when compared with state-of-the-art cir-
cuit simulation software, SPICE. In future work we hope to in- [9] Korg Electronic Laboratory Corporation, “MS-20 service
corporate the clipping behavior of the OTA into our simulations. notes,” Tech. Rep., 1978.
We hope also to further elaborate on the time-varying properties [10] O. Kröning, K. Dempwolf, and U. Zölzer, “Analysis and sim-
of WDFs in particular with respect to choice of s-to-z plane trans- ulation of an analog guitar compressor,” in Proc. Int. Conf.
form and/or numerical method as well as choice of wave variables. Digital Audio Effects (DAFx-11), Paris, France, Sept. 19–23,
2011.
7. REFERENCES [11] M. Holters and U. Zölzer, “Physical modelling of a wah-wah
pedal as a case study for application of the nodal DK method
[1] R. Marston, “Understanding and using OTA op-amp ICs, to circuits with variable parts,” in Proc. Int. Conf.Digital
part 1,” Nuts and Volts, pp. 58–62, Apr. 2003. Audio Effects (DAFx-11), Paris, France, Sept. 19–23, 2011.
[2] R. Marston, “Understanding and using OTA op-amp ICs, [12] D. T. Yeh, J. S. Abel, and J. O. Smith, “Automated physi-
part 2,” Nuts and Volts, pp. 70–74, May 2003. cal modeling of nonlinear audio circuits for real-time audio
DAFX-7
DAFx-136
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
effects—Part I: Theoretical development,” IEEE Trans. Au- [30] K. Meerkötter and T. Felderhoff, “Simulation of nonlinear
dio, Speech, Language Process., vol. 18, no. 4, pp. 728–737, transmission lines by wave digital filter principles,” in Proc.
May 2010. IEEE Int. Symp. Circuits Syst,, May 1992, vol. 2, pp. 875–
[13] A. Fettweis, “Wave digital filters: Theory and practice,” in 878.
Proc. IEEE, 1986, vol. 74, pp. 270–327. [31] A. Bernardini, K. J. Werner, A Sarti, and J. O. Smith III,
[14] S. Bilbao, Wave and Scattering Methods for Numerical Sim- “Modeling nonlinear wave digital elements using the lam-
ulation, John Wiley & Sons, Ltd., 2005. bert function,” IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 63, no. 8, pp. 1231–1242, 2016.
[15] A. Sarti and G. De Poli, “Toward nonlinear wave digital fil-
ters,” IEEE Trans. Signal Process., vol. 47, no. 6, pp. 1654– [32] T. Schwerdtfeger and A. Kummert, “A multidimensional ap-
1668, 1999. proach to wave digital filters with multiple nonlinearities,”
in Proc. 22nd European Signal Process. Conf. (EUSIPCO),
[16] S. Petrausch and R. Rabenstein, “Wave digital filters with Lisbon, Portugal, Sept. 2014, pp. 2405–2409.
multiple nonlinearities,” in Proc. 12th European Signal Pro-
cess. Conf. (EUSIPCO), Sept. 2004, pp. 77–80. [33] C. Ho, A. Ruehli, and P. Brennan, “The modified nodal ap-
proach to network analysis,” IEEE Trans. Circuits Syst., vol.
[17] G. De Sanctis and A. Sarti, “Virtual analog modeling in the
22, no. 6, pp. 504–509, June 1975.
wave-digital domain,” IEEE Trans. Audio, Speech, Language
Process., vol. 18, no. 4, pp. 715–727, 2010. [34] K. J. Werner, Virtual Analog Modeling of Audio Circuitry
Using Wave Digital Filters, Ph.D. dissertation, Stanford Uni-
[18] J. Pakarinen and M. Karjalainen, “Enhanced wave digital
versity, 2016.
triode model for real-time tube amplifier emulation,” IEEE
Trans. Audio, Speech, Language Process., vol. 18, no. 4, pp. [35] A. Bernardini, K. J. Werner, A. Sarti, and J. O. Smith III,
738–746, May 2010. “Modeling a class of multi-port nonlinearities in wave digital
structures,” in Proc. 23rd European Signal Process. Conf.
[19] A. Bernardini and A. Sarti, “Dynamic adaptation of in-
(EUSIPCO), 2015.
stantaneous nonlinear bipoles in wave digital networks,”
in Proc. 24th European Signal Process. Conf. (EUSIPCO), [36] K. J. Werner, W. R. Dunkel, and F. G. Germain, “A compu-
Aug. 2016, pp. 1038–1042. tational model of the Hammond organ vibrato/chorus using
[20] M. Karjalainen and J. Pakarinen, “Wave digital simulation wave digital filters,” in Proc. Int. Conf. Digital Audio Effects
of a vacuum-tube amplifier,” in IEEE Int. Conf. Acoust., (DAFx-16), Brno, Czech Republic, Sept. 2016, pp. 271–278.
Speech, Signal Process., May 2006. [37] A. Fettweis and K. Meerkötter, “On adaptors for wave digital
[21] S. D’Angelo, J. Pakarinen, and V. Välimäki, “New family filters,” IEEE Trans. Acoust., Speech, Signal Process., vol.
of wave-digital triode models,” IEEE Trans. Audio, Speech, 23, no. 6, pp. 516–525, Dec. 1975.
Language Process., vol. 21, no. 2, pp. 313–321, 2013. [38] G. O. Martens and K. Meerkötter, “On N-port adaptors for
[22] K. J. Werner, J. O. Smith III, and J. Abel, “Wave digital wave digital filters with application to a bridged-tee filter,” in
filter adaptors for arbitrary topologies and multiport linear Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Munich, Ger-
elements,” in Proc. Int. Conf. Digital Audio Effects (DAFx- many, Apr. 1976, pp. 514–517.
15), Trondheim, Norway, Nov. 2015. [39] D. Fränken, J. Ochs, and K. Ochs, “Generation of wave dig-
[23] K. J. Werner, V. Nangia, J. O. Smith III, and J. S. Abel, “Re- ital structures for networks containing multiport elements,”
solving wave digital filters with multiple/multiport nonlin- IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 52, no. 3, pp.
earities,” in Proc. Int. Conf. Digital Audio Effects (DAFx-15), 586–596, Mar. 2005.
Trondheim, Norway, Nov. 2015. [40] Ó. Bogason, “Modeling audio circuits containing typical
[24] A. S. Sedra and K. C. Smith, Microelectronic Circuits, Ox- nonlinear components with wave digital filters,” M.S. the-
ford University Press, New York, sixth edition, 2015. sis, McGill University, Montreal, Quebec, Canada, 2017.
[25] K. J. Werner, W. R. Dunkel, M. Rest, M. J. Olsen, and J. O. [41] G. Martinelli, “On the nullor,” Proc. IEEE, vol. 53, no. 3,
Smith III, “Wave digital filter modeling of circuits with oper- pp. 332, Mar. 1965.
ational amplifiers,” in Proc. 24th European Signal Process. [42] B. R. Myers, “Nullor model of the transistor,” Proc. IEEE,
Conf. (EUSIPCO), Budapest, Hungary, Aug. 2016. vol. 53, no. 7, pp. 758–759, July 1965.
[26] C. Acar, F. Anday, and H. Kuntman, “On the realization [43] L. T. Bruton, RC-Active Circuits, Prentice-Hall Inc, Engle-
of OTA-C filters,” Int. J. Circuit Theory Appl., vol. 21, pp. wood Cliffs, New Jersey, 1980.
331–341, 1993.
[44] S. D’Angelo and V. Välimäki, “Wave-digital polarity and
[27] H. Kuntman, “Simple and accurate nonlinear OTA macro- current inverters and their application to virtual analog au-
model for simulation of CMOS OTA-C active filters,” Int. J. dio processing,” in IEEE Int. Conf. Acoust., Speech, Signal
Electron., vol. 77, no. 6, pp. 993–1006, 1994. Process. (ICASSP), Mar. 2012, pp. 469–472.
[28] Texas Instruments, “LM13700 datasheet,” Tech. Rep., Nov. [45] A. B. Yildiz, “Modified nodal analysis-based determination
1999. of transfer functions for multi-inputs multi-outputs linear cir-
[29] R. L. Geiger and E. Sánchez-Sinencio, “Active filter design cuits,” Automatika, vol. 51, no. 4, pp. 353–360, 2010.
using operational transconductance amplifiers: A tutorial,” [46] U. Zölzer, Digital Audio Signal Processing, John Wiley &
IEEE Circuits Devices Mag., vol. 1, no. 2, pp. 20–32, Mar. Sons, Ltd., 2 edition, 2008.
1985.
DAFX-8
DAFx-137
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
In the digital simulation of non-linear audio effect circuits, the aris- We shall first provide a short introduction into the method used to
ing non-linear equation system generally poses the main challenge obtain the circuit model and the non-linear equation in particular,
for a computationally cheap implementation. As the computational focusing on the example of Figure 1 (based on “Der Birdie”1 ), as
complexity grows super-linearly with the number of equations, it is also discussed in [12], while the reader is referred to [7] for details.
beneficial to decompose the equation system into several smaller First, the equations of the individual circuit elements are rewrit-
systems, if possible. In this paper we therefore develop an approach ten in a uniform way in terms of branch voltages and currents
to determine such a decomposition automatically. We limit our- and internal states. For example, a resistor with resistance R is
selves to cases where an exact decomposition is possible, however, described as
and do not consider approximate decompositions. vR + RiR = 0, (1)
where vR and iR are the voltage across and the current through the
resistor, respectively. The discrete-time model2 of a capacitor with
1. INTRODUCTION capacitance C is derived using bilinear transform as
1 1
Digital emulation of analog circuits for musical audio processing, C 0
vC (n) + iC (n) − 21 xC (n) = 2 xC (n − 1),
like synthesizers, guitar effect pedals, or vintage amplifiers, is an 0 1 T
− T1
ongoing research topic. Various methods exist to derive a mathemat- (2)
ical model for an analog circuit in a systematic hence automatable where T denotes the sampling interval and n the current time step,
way, most notably wave digital filters [1, 2, 3], port-Hamiltonian vC and iC are again the voltage across and the current through
approaches [4, 5], and state-space based approaches [6, 7]. In the the capacitor, respectively, and the state xC corresponds to the
following, we will focus on the method of [7], as it has no limita- capacitor’s charge. Voltage sources are easily expressed using a
tions concerning the circuit topology and is general enough to also non-zero right-hand side, e.g. as
handle pathological circuits or element models (e.g. [8]). How-
ever, the underlying ideas should be equally applicable to the other vVCC + 0 · iVCC = 9 V (3)
approaches.
for the supply voltage source VCC , with a constant voltage vVCC =
The downside of such automated approaches, and of [7] in 9 V across it, driving an arbitrary current iVCC .
particular, is that they often will lead to one large system of non- The non-linear equation of non-linear elements is expressed
linear equations, collected from all the circuit’s non-linear elements. in terms of an auxiliary vector q, which is related to voltages and
If possible, however, it is usually more efficient to solve many currents (and potentially also states) through a linear equation.
small equation systems instead of a single large one. Typically, the Thus, a diode is expressed using the linear equation
convergence of small systems will be better, allowing for a smaller
number of iterations in an iterative solver. And more tangible, the 1 0 −1 0 0
complexity of solving the linear equation in e.g. the Newton method vD (n) + iD (n) + qD (n) = , (4)
0 1 0 −1 0
scales with the square of the number of involved equations, so that
for a system of size N , the asymptotic complexity per iteration is fixing qD,1 = vD as the voltage across and qD,2 = iD as the
O(N 2 ), while for N systems of size 1, it is O(N ). current through the diode, and the non-linear Shockley equation,
We therefore develop a method to decompose a non-linear equa- rewritten to implicit form as
tion system into smaller subsystems. We limit ourselves to the case q
D,1
where this is possible without resorting to approximations, although fD (qD ) = Is · e ηvT
− 1 − qD,2 = 0, (5)
this unfortunately precludes the method from being applied to many
circuits of practical relevance, especially those with global feedback 1 http://diy.musikding.de/wp-content/uploads/
paths. But while automatically deriving approximate decomposi-
2013/06/birdieschalt.pdf
tions, e.g. like those of [9, 10, 11], is beyond the scope of this 2 In [7], the equations are first derived in the continuous-time domain
paper, we are confident it is still useful by itself and furthermore, and then transformed to discrete time using the bilinear form. Equivilantly,
may form the basis for future methods to automatically find such the bilinear transform may be applied per element, which we do here for
approximate decompositions. brevity’s sake.
DAFX-1
DAFx-138
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 0 0 0 C1 T C3
0 1 0 0
0 v (n) + i (n)
0 T 1 0 T
0 0 1 0 P1
vin R1 R2
−1 0 0 0 0 R4 y
0 −1 0 0 0
+ q (n) = , (6)
0 0 −1 0 T 0
0 0 0 −1 0
Figure 1: Example treble booster circuit.
where vT and iT hold the voltages across and currents through the Table 1: Component/parameter values for the circuit of Figure 1.
base-emitter and base-collector branch in that order, respectively,
and the non-linear equation is obtained by rewriting the Ebers-Moll
equation to implicit form as comp. value comp. param. value
VCC 9 V D Is 350 pA
R1 1 MΩ D η 1.6
fT (qT ) = R2 43 kΩ T IsE 64.53 fA
qT ,1 qT ,2 R3 430 kΩ T IsC 154.1 fA
βr
IsE · (e ηE vT − 1) − 1+β IsC · (e ηC vT − 1) − qT,3 R4 390 Ω T ηE 1.06
qT ,1
r qT ,2
βf
− 1+β IsE · (e ηE vT − 1) + IsC · (e ηC vT − 1) − qT,4 R5 10 kΩ T ηC 1.10
f
P1 100 kΩ T βf 500
0 C1 2.2 nF T βr 12
= , (7)
0 C3 2.2 nF
C5 100 µF
Ti i = 0, (9) (11)
DAFX-2
DAFx-139
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-140
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
such that zm has as many entries as f˜m (q̃m ) and let Fq,m,n denote pre-computation step may be well acceptable. E.g. for the N = 10
the corresponding columns of Fq,m , that is case, at most 1023 trials would be needed.
Fq,1,1 · · · Fq,1,M
3.3. Dimensionality reduction of the input vector
Fq = ... ..
..
. . (25)
After the decomposition, we can obtain zm from f˜m (q̃m ) = 0,
Fq,M,1 · · · Fq,M,M which depends on x, u, and z1 , . . . , zm−1 , a potentially high num-
such that ber of input values. This would pose a major problem if one would
like to tabulate precomputed values in a look-up table. However,
q̃m = Dq,m x + Eq,m u + q0,m + Fq,m,1 z1 + · · · + Fq,m,M zM . the method proposed in [12] can be easily extended to not only
(26) treat x and u as inputs, but also z1 , . . . , zm−1 , to find an index
Now, let the element groups be chosen such that Fq,m,n = 0 vector pm of minimal dimension that can be used as input.
for n > m. Then The idea is to apply a rank factorization to the matrix
q̃1 = Dq,1 x + Eq,1 u + q0,1 + Fq,1,1 z1 (27) Dq,m Eq,m Fq,m,1 ··· Fq,m,m−1
so that z1 can be obtained by solving f˜1 (q̃1 ) = 0. With z1 known, = Qm · D̂q,m Êq,m F̂q,m,1 ··· F̂q,m,m−1
z2 can then be obtained by solving f˜2 (q̃2 ) = 0 and so forth up (28)
to M . Thus, if Fq is such that a partitioning with Fq,m,n = 0 for
n > m (and M > 1) exists, we can decompose the non-linear such that D̂q,m Êq,m F̂q,m,1 ··· F̂q,m,m−1 has mini-
equation into individually solvable subequations, where in general, mal number of rows. Then
the m-th subequation depends on the solution of the m−1 previous
subequations. pm = D̂q,m x+ Êq,m u+ F̂q,m,1 z1 +· · ·+ F̂q,m,m−1 zm−1 (29)
Of course, we may rarely be lucky and find Fq to be suitable is the index vector of minimal dimension to be used in
for this decomposition. However, [7] leaves some freedom in the
exact choice of the coefficient matrices. In particular, only the space q̃m = q0,m + Qm pm + Fq,m,m zm . (30)
spanned by Fq z is of importance, which does not change if we
substitute Fq ← Fq R for a regular matrix R. (This will, of course,
4. EXAMPLES
change the resulting z, so Fv , Fi , and C have to be updated in the
same way.) Now, finding an R such that the upper right subblocks 4.1. Treble booster
of Fq become zero is similar to bringing FqT into upper triangular
form, but with respect to the rectangular subblocks Fq,m,n . Thus if As a first, relatively trivial example, we consider the treble booster
the chosen decomposition into subequation groups allows a suitable circuit of Figure 1 with the component values given in Table 1. The
choice of Fq , it can be found using standard tools of linear algebra, circuit contains two non-linear elements, a diode and a transistor.
e.g. Gaussian elimination. But note that the formers sole purpose is to protect the circuit
against connecting the power supply with wrong polarity. Usually,
3.2. Identification of a suitable grouping one would omit the diode from the simulation as it has no influence
on the output signal. Here, we include it to verify that we can then
The remaining question is how to determine a suitable grouping. If eliminate it algorithmically.
fq and q were ordered in correspondence with the yet-to-be-found We choose to put the diode first, so that q1 has two elements
grouping, i.e. fulfilling (21) and (22), we could just determine R to and f1 (q1 ) has one (see (5)), and q2 has four elements and f2 (q2 )
eliminate the maximum number of elements in the upper right part has two (see (7)). Using the ACME implementation of [7], we find
of Fq and examine the zero-structure thus obtained. Unfortunately,
the number of possible permutations of the entries in fq and q grows 0 0 0
D1
too fast with N to make trying all of them feasible. E.g. for N = 10 1 0 0
non-linear elements, we would need to examine N ! = 3 628 800 0 1 0
Fq =
permutations. 0 0 1
0 −2.917 × 10−4 −5 T1
We can do a little better than that by greedily trying to sepa- 9.705 × 10
rate a single subgroup by trying all 2N − 1 non-empty subsets of 0 9.705 × 10 −5
−1.054 × 10 −4
DAFX-4
DAFx-141
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Vb Vcc
1 MΩ Vb
470 kΩ Vcc Vb Vcc
10 kΩ
Q1 10 kΩ 1 MΩ
4.7 kΩ 33 kΩ 10 nF Q2
10 kΩ 20 kΩ
100 kΩ 1 kΩ
47 nF
−
47 nF D2 D3 + 1 µF
47 nF
27 nF 1 µF
18 nF D1 22 kΩ
− 10 kΩ
+
vin 2.2 MΩ 10 kΩ 18 nF 470 Ω 4.7 kΩ 100 kΩ y
Vb
10 kΩ
Figure 3: Power supply circuitry omitted from Figure 2 for (a) main
supply voltage, (b) bias voltage, (c) simplified bias voltage.
4.2. Overdrive
As a second example, we consider the more complex overdrive left to right, and the ACME implementation of [7] yields
circuit of Figure 2 (based on “Der Super Over”4 ). Again, there
is a diode anti-parallel to the supply voltage source as shown in 0 0 0 0 0 0 0 0
D4
Figure 3a which we include in the model. However, we simplify 1 0 0 0 0 0 0 0
the bias voltage Vb generation; while the original circuit contains a 0 1 0 0 0 0 0 0
voltage divider and a stabilizing capacitor (see Figure 3b), we en- 0 0 1 0 0 0 0 0
Q1
force a constant bias voltage by directly connecting an ideal voltage 0 ∗ ∗
0 0 0 0 0
source (see Figure 3c). We assume the operational amplifiers ideal, 0 ∗ ∗ 0 0 0 0 0
leading to a model with six non-linear elements: the protective 0 0 0 −1 −1 0 0 0
D
diode anti-parallel to the supply voltage, three diodes in the feed- 0 0 0 0 0 1 0 0 2
Fq = (32)
back around the first operational amplifier, and two transistors. The 0 0 0 1 0 0 0 0
0 ∗ ∗ D1
non-linear equation system f (q) = 0 therefore comprises eight ∗ 0 1 0 0
equations (one per diode, two per transistor), while vector q has 16
0 0 0 0 1 0 0 0
0 0 0 D3
entries (two per diode, four per transistor). 0 0 1 0 0
0 0 0
0 0 0 1 0
We again choose to put the protective diode first, then proceed 0 0 0 0 0 0 0 1
Q2
0 0 0 0 0 0 ∗ ∗
0 ∗ ∗ ∗ 0 0 ∗ ∗
DAFX-5
DAFx-142
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
can again be extracted without problems. Continuing by just con- of iterations needed to 1.460, 2.924 and 1.696 for the three sub-
sidering the remaining matrix, the first transistor can likewise be equations, respectively. The needed processing time however is
extracted without the need to modify Fq . Now looking at the re- reduced insignificantly to 6.79 s. Nevertheless, the advantage of
maining submatrix making lookup tables feasible remains.
Unfortunately, the obtained results depend on the simplifica-
−1 −1 0 0 0 tion of fixing the bias voltage Vb . With the original circuitry, all
D2
0 0 1 0 0 nodes connected to Vb influence each other, effectively creating a
1 0 0 0 0 global feedback path, and the only decomposition possible is the
D1 extraction of the protective diode of the main power supply. In
∗ 0 1 0 0
terms of the Fq matrix, this manifests itself in additional entries in
0 1 0 0 0
D3 the fourth and last column that cannot be canceled and preclude
0 0 1 0 0
further decomposition.
0 0 0 1 0
0 0 0 0 1
Q2
0 0 0 ∗ ∗
5. IMPLEMENTATION ASPECTS
∗ 0 0 ∗ ∗
The proposed method has been implemented as part of the ACME
for the three diodes and the second transistor, none of the three framework. One of the main challenges encountered is the condi-
diodes can be extracted by itself. Each of the three row pairs tion Fq,m,n = 0 which, when using floating point arithmetic, will
belonging to the three diodes obviously forms a matrix of rank 2, usually only be fulfilled approximately. The obvious approach then
so we cannot possibly find a regular matrix with which to multiply is to treat entries with very small absolute value as zero. However,
from the right to zero out all but one of the columns. Likewise, determining an appropriate threshold proves to be anything but triv-
the four lowest rows, belonging to the transistor, obviously form a ial, as the scale of the entries in both q and z can differ by orders of
rank-3 matrix, from which we cannot cancel all but two columns. magnitude from each other. So instead, we employ exact arithmetic,
As no single element can be extracted, the next thing to try is to using rationals of arbitrary precision integers. This is possible as
extract pairs of elements. Again, none of the six possible pairs the whole model derivation process only needs a bounded number
turn out to be extractable. But continuing with groups of three, of additions, subtractions, multiplications, and divisions, so that
the three diodes can be extracted as one group, without requiring the numerators and denominators may grow very large, but are
modifications of Fq . still bounded. Of course, for the simulation itself, the values are
We thus arrive at the decomposition converted to floating point for efficiency.
Unfortunately, the benefits reaped from the decomposition in
0 0 0 0 0 0 0 0 terms of run-time turn out to be smaller than expected; sometimes,
D4
1 0 0 0 0 0 0 0 the decomposed model may even run slightly slower than the orig-
0 1 0 0 0 0 0 0 inal one. The reason seems to be constant overhead that may
0 0 1 0 0 0 0 0 dominate over the asymptotic behavior for small problem sizes. In
Q1
0 ∗ ∗ 0 0 0 0 0
an earlier implementation, where LAPACK routines were used for
0 ∗ ∗ 0 0 0 0 0 linear equation solving, this effect was even more pronounced. The
constant calling overhead for the LAPACK routines is significant:
0 0 0 −1 −1 0 0 0
D2 Solving a system of size eight takes less than five times as long
0 0 0 0 0 1 0 0
Fq = , (33) as solving a system of size one. We hope for a future, optimized
0 0 0 1 0 0 0 0
D1 implementation to further reduce these overheads, however.
0 ∗ ∗ ∗ 0 1 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0 D3
6. CONCLUSION
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 The proposed method is able to decompose a system of non-linear
∗ ∗ Q2
0 0 0 0 0 0
equations into a number of smaller subsystems if that is possible
0 ∗ ∗ ∗ 0 0 ∗ ∗ in an exact way. This may lead to more efficient simulations in
terms of computational load (when using iterative solvers) or mem-
comprising four subsystems of one, two, three, and two equations, ory requirements (when using lookup tables). However, the time
respectively. Unlike the first example, the sub-diagonal blocks needed to find this decomposition during model creation grows
do contain non-zero entries, so the solution of earlier equations exponentially with the number of non-linear circuit elements. For-
is required for the later ones (with the exception of the protective tunately, typically modeled audio effect circuits do not contain
diode). Looking at the index vectors obtained with the method enough non-linear elements to make the approach infeasible.
of [12], p1 again has no entries, so the solution to f˜1 (q̃1 ) = 0 The present paper does not, however, tackle the more challeng-
can be pre-computed off-line. Furthermore, p2 and p4 both have ing problem of automatically finding an approximate decomposi-
two entries and p3 just one, greatly simplifying the construction of tion, which may either give a solution of sufficient accuracy for the
lookup tables compared to a p with five entries for the original (not complete system directly, or could at least be used to find a good
decomposed) system. initial solution for an iterative solver then applied to the complete
Performing the same performance evaluation as for the treble system. We hope this paper to be a valuable step in that direction,
booster example above, the processing time of the original model however. E.g. for the overdrive circuit, numerical analysis could re-
is determined as 6.82 s with 2.965 Newton iterations needed on veal that the bias voltage is almost constant, and that in fact fixing it
average per sample. Decomposing the model reduces the number at a constant does not change the output in a significant way. Based
DAFX-6
DAFx-143
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
7. REFERENCES
DAFX-7
DAFx-144
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT reduced to the diode bridge and surrounding filter components and
turned into a WDF structure. Section 4 assesses transfer functions
The voltage-controlled low-pass filter of the Korg MS-50 syn-
and nonlinear behaviors of a WDF implementation in the RT-WDF
thesizer is built around a non-linear diode bridge as the cutoff fre-
framework [17] and compares them with a SPICE [18] simulation.
quency control element, which greatly contributes to the sound of
It also evaluates the performance of the iterative Newton-Raphson
this vintage synthesizer. In this paper, we introduce the overall fil-
Solver [4] and gives an estimate on the expected real-time require-
ter circuitry and give an in-depth analysis of this diode bridge. It
ments. Section 5 concludes the work presented here.
is further shown how to turn the small signal equivalence circuit of
the bridge into the necessary two-resistor configuration to uncover
the underlying Sallen-Key structure. 2. KORG MS-50 FILTER CIRCUIT
In a second step, recent advances in the field of WDFs are
used to turn a simplified version of the circuit into a virtual-analog As mentioned above, in contrast to the MS-20, whose VCF is an-
model. This model is then examined both in the small-signal linear alyzed in great detail in [19], the MS-50 features a rather non-
domain as well as in the non-linear region with inputs of different standard VCF low-pass circuit based on a biased diode bridge
amplitudes and frequencies to characterize the behavior of such as the voltage controllable element to vary the cutoff frequency.
diode bridges as cutoff frequency control elements. This configuration has been used in Korg’s 700, 700S, 800DV and
900PS synthesizers before and was covered by patent US 4039980
1. INTRODUCTION [20]. A contemporary reinterpretation can be found in the AM-
Synths Eurorack module AM8319 Diode Ring VCF [21].
Modeling of electrical circuits used in music and audio devices is Figure 1 shows the original schematic of the filter [15]. The
a topic of ongoing research in the audio signal processing domain. diode bridge is based on the RCA CA 3019 diode array [22], which
One particular method for modeling such circuits is the Wave Dig- features a total of six matched silicon diodes. Left of the diode
ital Filter (WDF) approach [1]. Recent work sought to expand bridge is the biasing circuitry and input stage. Starting from the
the class of circuits that can be modeled using WDFs, focusing very left of the figure, voltages from external cutoff control, man-
mainly on topological aspects of circuits [2, 3], iterative and ap- ual cutoff control and a temperature compensation are added and
proximative techniques for non-linearities [4, 5], methods for op- conditioned by IC 7 into a symmetric positive and negative biasing
amp-based circuits [6, 7] and diode clippers [8, 9]. voltage, which is fed into the two adjacent ends of the bridge. The
These developments enabled modeling a large number of cir- input signal is buffered by a unity gain amplifier made of IC 6 and
cuits that were previously intractable, including guitar tone stacks capacitively coupled into the input node of the bridge. Right of the
[2], transistor-based guitar distortion stages [3], transistor [2] and bridge is the actual low-pass filter circuitry. The output signal of
triode [10] amplifiers, the full Fender Bassman preamp stage [11], the diode bridge is fed into a Sallen-Key filter structure [16] with
and the bass drum from the Roland TR-808 [12]. controllable resonance, high gain and amplitude limiting clipping
In this paper, we examine the low-pass filter of the MS-50 diodes D5 –D12 built around IC 4 . The output of this op-amp is
synthesizer manufactured by Korg. This synthesizer was released then fed back into two additional taps in the diode bridge, which
in 1978 and intended as a companion for the more famous MS- forms the necessary feedback path. This structure will be ana-
20 model. The MS-50’s filter is known primarily for exhibiting lyzed in greater detail in Section 2.2. In the rightmost part of the
strongly nonlinear behaviors. These behaviors are due to the us- schematic, the signal is then differentially taken from two points in
age of a diode ring (similar to that seen in ring-modulators) as the the output circuitry of the filter and passed through a DC blocking
cutoff controlling element in the circuit, which generates a more stage to the signal output. The nested configuration of diode bridge
and more complex spectrum with rising input signal level. non-linearities in the feedback path of the op-amp IC 4 makes the
Diode rings in the modulator context have been previously ex- MS-50 filter well suited for a topologically preserving modeling
amined as a simplified digital model with static non-linearities [13] approach with WDFs.
and as a WDF [14]. Such configurations within a filter structure
have not been studied, and it is that which motivates this work. 2.1. Diode Bridge
The following paper is structured as follows: Section 2 de-
scribes the Korg MS-50 voltage-controlled low-pass filter (VCF) The diode bridge around IC 5 in the filter structure acts as a con-
circuit [15] and it’s core element, the diode bridge (Sec. 2.1) which trollable impedance element for both the input signal and the feed-
is used to control the cutoff frequency. Based on a small signal lin- back path. This technique is already pointed out in the compo-
earization of this diode bridge, the Sallen-Key structure [16] of the nent’s application note [23], and is consistent with the use of such
circuit is pointed out (Sec. 2.2). In Section 3, the original circuit is circuits as a signal multiplier in radio applications.
DAFX-1
DAFx-145
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
iD ≈ Ib · (1 + ) − IS = Ib + − IS
vsig vx1 n · vT n · vT
vs
= Ib + − IS (3)
vin v+ rd
−
+
DAFX-2
DAFx-146
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
rC
vin
rA
rA ||rB rC ||rD IC 4 vout
−
1 vin v+
R50
vx1 C13
rd2 rd3 VR10
rB
rD
2
R52
C11 ||C12
vin v+ vx2 R48
(b)
rd4 rd5 R46 R67
vx2 C15
vx
3
vin v+ Figure 4: The filter’s simplified, linearized Sallen-Key Structure.
rd6
rA ||rB rC ||rD
DAFX-3
DAFx-147
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
R49 D1 C11 −
−
+
Vb+ IC 4
VR 10p
+
R50
vsig C16 D2 D3
R53 R54
−
+
VR 10n
R51 D4 D5 C13 C52
−
+
Vb-
D6 C12
R48 R52
− C17 R56
Rs- IC 6
R47 R46 R67 +
RL
C15
-20
Rs+ 50 Ω R53 4.3 kΩ C16 100 µF
Mag./dB
4. RESULTS
formed. The filter was given a resonance setting of 0.85, which
The performance of the model is verified in three ways. Firstly, produces self-oscillation. The filter input was then driven with a
its linear behavior is measured by taking a small-signal impulse of ramp waveform, peaking at 2 V, and the changing character of the
50 mV and processing it with both the WDF model and an equiv- self-oscillation observed. Fig. 9 shows the results of this process.
alent SPICE model of the circuit from Fig. 5 in LTspice. This The instantaneous frequency and amplitude of the output are esti-
impulse produces a voltage on the right end of C16 of around mated using a Hilbert transform of the output signal. The estimates
0.5 mV n · vT , which keeps the diodes in an approximately are smoothed to remove higher frequency components that relate
linear region (see Sec. 2.1). The produced response thus charac- to the waveshape of the self-oscillation rather than the fundamen-
terizes the filter behavior in the linear region. Fig. 6 shows the tal. The instantaneous frequency of the output is seen to decrease
magnitude responses of several such impulses for constant reso- as the input ramp increases in voltage and thus re-biases the diode
nance setting and sweeping cutoff frequency. The match between bridge, starting at around approximately 250 Hz and falling to ap-
the described WDF model and SPICE is good, with the excep- proximately 200 Hz. The instantaneous amplitude of the output
tion of the presence of some small anomalies in the SPICE model, drops as the input voltage increases, with the self-oscillation al-
caused by the resampling algorithm used. The sampling rate of the most completely damped as the input gets close to 2 V. Agreement
WDF simulation was 176.4 kHz. between the SPICE and WDF models is close by these measures.
Secondly, the nonlinear behavior of the model is tested by ex-
amining its resonant behavior when driven by signals of varying 4.1. Performance
amplitude. A sawtooth signal of 50 Hz is chosen for this purpose.
Fig. 8 shows the output of the model and the equivalent SPICE The maximum count of iterations for the six-dimensional non-
model. Clearly visible with increasing input amplitude is the ex- linear system in the Newton Solver with a stopping criteria of
aggeration of the initial transient of the sawtooth waveform while kFk ≤ 1.5e−9 did not exceed 2 at any time during calculation
the following resonant behavior stays relatively constant. The out- of the results presented here. No damping steps were needed. An
put signals of the presented WDF model and SPICE are very close. increase of iterations was mostly noticed on sharp transients in the
In order to further examine the variation in resonance fre- input signal and zero crossings of the output signal, both of which
quency and amplitude with input level, a further test was per- cause switching between the diodes.
DAFX-4
DAFx-148
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
−
+
Vb-
D1 D2 D3 D4 D5 D6 Rs-
R52 P3 R47
C52 C14
S1
IC 4 IC 6
R54 P6 R45
+ +
−
R1 −
R48 P2 P4 R49
Vb+
VR10n
VR10p
C15
P1 Rs+
−
+
R46 R67 C11 C12 C13 R44 R50
RL C16
S4 S2
2
0
-2
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Output/V
2
0
-2
-4
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
5
0
-5
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Time/s
Figure 8: Output of model when driven by a 50 Hz sawtooth waveform of peak-to-peak amplitude of top: 1 V middle: 2 V, bottom: 3 V.
SPICE results are shown with dots. Resonance is set to 0.8, and the bias voltage is 0.2 V. The input sawtooth is shown as a reference.
In terms of computational load, the implementation in RT- ined, with close agreement between the SPICE reference model
WDF [17] consumes approx. 80 % of one core at a sampling rate of and the WDF model shown. The model was implemented using
176.4 kHz on a laptop computer from 2013 with Intel i7 2.4 GHz the RT-WDF C++ framework, and is able to be run in real-time.
CPU (4 cores) and 8 GB RAM running OS X 10.11.4. The wd- Future research should analyze the original filter circuit behavior
fRenderer [17] was built with Apple LLVM 7.1 at optimization including the amplitude limiting clipping diodes and take advan-
level -O3 and ran with a single rendering thread without any other tage of the simplified diode bridge structure from Sec. 2 for a sim-
considerable applications in the background or performance tweak- plified model.
ing.
6. ACKNOWLEDGMENTS
5. CONCLUSION
Many thanks to Ross W. Dunkel for delightful initial discussions!
The circuit of the Korg MS-50 voltage controlled filter was ex-
Also, thanks to all reviewers for detailed comments.
amined, and shown to fundamentally be a form of the Sallen-
Key topology, controlled by a highly non-linear diode bridge. A
Wave Digital Filter model of a simplified circuit was presented to 7. REFERENCES
highlight the influence of the diode bridge on the filter behavior
and shown to match in most parts a reference implementation in
[1] A. Fettweis, “Wave digital filters: Theory and practice,”
SPICE. For small signal linear frequency analysis, small differ-
Proc. IEEE, vol. 74, no. 2, Feb. 1986.
ences are mostly seen in terms of frequency warping effects in the
WDF and aliasing artifacts in the SPICE results. Nonlinear fre- [2] K. J. Werner, J. O. Smith III, and J. S. Abel, “Wave digital
quency and amplitude variations exhibited by the filter are exam- filter adaptors for arbitrary topologies and multiport linear
DAFX-5
DAFx-149
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2
vin /V
1
0
0 0.5 1 1.5 2 2.5
2
vout /V
0
-2
0 0.5 1 1.5 2 2.5
Inst. Freq./Hz Inst. Amp./V
3
2
1
0
0 0.5 1 1.5 2 2.5
250
125
0
0 0.5 1 1.5 2 2.5
Time(/s)
Figure 9: Output of models when set to self-oscillate with resonance of 0.85 and bias voltage of 0.2 V, driven by a slow 2 V ramp input.
The input ramp vin , model output vout , and the estimated instantaneous frequency and amplitude of vout are shown. SPICE results are
shown with dots.
elements,” in Proc. Int. Conf. Digital Audio Effects (DAFx- Conf. Digital Audio Effects (DAFx-16), Brno, CZ, Sept. 5–9
15), Trondheim, NOR, Nov. 30 – Dec. 3 2015. 2016.
[3] K. J. Werner, V. Nangia, J. O. Smith III, and J. S. Abel, “Re- [12] K. J. Werner, Virtual Analog Modeling of Audio Circuitry
solving wave digital filters with multiple/multiport nonlin- Using Wave Digital Filters, Ph.D. dissertation, Stanford Uni-
earities,” in Proc. Int. Conf. Digital Audio Effects (DAFx-15), versity, Stanford, CA, USA, Dec. 2016.
Trondheim, NOR, Nov. 30 – Dec. 3 2015.
[13] J. Parker, “A simple digital model of the diode-based ring-
[4] M. J. Olsen, K. J. Werner, and J. O. Smith III, “Resolving modulator,” in Proc. Int. Conf. Digital Audio Effects (DAFx-
grouped nonlinearities in wave digital filters using iterative 11), Paris, FR, Sept. 19–23 2011.
techniques,” in Proc. Int. Conf. Digital Audio Effects (DAFx-
16), Brno, CZ, Sept. 5–9 2016. [14] A. Bernardini, K. J. Werner, A. Sarti, and J. O. Smith III,
[5] A. Bernardini, K. J. Werner, A. Sarti, and J. O. Smith, “Mod- “Modeling nonlinear wave digital elements using the Lam-
eling a class of multi-port nonlinearities in wave digital struc- bert function,” IEEE Trans. Circuits Syst. II: Reg. Papers,
tures,” in Proc. 23th Europ. Signal Process. Conf. (EU- vol. 63, no. 8, Aug. 2016.
SIPCO), Nice, FR, Aug. 31 – Sept. 4 2015. [15] KORG, Model MS-50 Circuit Diagram, Nov. 1978.
[6] K. J. Werner, W. R. Dunkel, M. Rest, M. J. Olsen, and J. O. [16] R. P. Sallen and E. L. Key, “A practical method of designing
Smith III, “Wave digital filter modeling of circuits with op- RC active filters,” IRE Trans. Circuit Theory, vol. 2, no. 1,
erational amplifiers,” in Proc. 24th Europ. Signal Process. Mar. 1955.
Conf. (EUSIPCO), Budapest, HU, Aug. 29 – Sept. 2 2016.
[17] M. Rest, W. R. Dunkel, and K. J. Werner, “RT-WDF—a
[7] R. C. D. Paiva, S. D’Angelo, J. Pakarinen, and V. Valimaki, modular wave digital filter library,” 2016–2017, Available at
“Emulation of operational amplifiers and diodes in audio dis- https://github.com/RT-WDF, accessed Mar. 12 2017.
tortion circuits,” IEEE Trans. Circuits Syst. II: Exp. Briefs,
vol. 59, no. 10, Oct. 2012. [18] A. Vladimirescu, The SPICE Book, John Wiley & Sons,
New York, NY, USA, 1994.
[8] K. J. Werner, V. Nangia, A. Bernardini, J. O. Smith, III, and
A. Sarti, “An improved and generalized diode clipper model [19] T. Stinchcombe, “A study of the Korg
for wave digital filters,” in Proc. 139th Conv. Audio Eng. Soc MS10 & MS20 filters,” Online, Aug. 30 2006,
(AES), New York, NY, USA, Oct. 29 – Nov. 1 2015. http://www.timstinchcombe.co.uk/synth/MS20_study.pdf,
[9] A. Bernardini and A. Sarti, “Biparametric wave digital fil- accessed Mar. 12 2017.
ters,” IEEE Trans. Circuits Syst. I: Reg. Papers, 2017, To be [20] Y. Nagahama, “Voltage-controlled filter,” Aug. 2 1977, US
published, DOI: 10.1109/TCSI.2017.2679007. Patent 4,039,980.
[10] M. Rest, W. R. Dunkel, K. J. Werner, and J. O. Smith, “RT- [21] AMSynths on ModularGrid, “AM8319,” Online, Oct 30
WDF—a modular wave digital filter library with support for 2012, https://www.modulargrid.net/e/amsynths-am8319, ac-
arbitrary topologies and multiple nonlinearities,” in Proc. Int. cessed June 20 2017.
Conf. Digital Audio Effects (DAFx-16), Brno, CZ, Sept. 5–9
2016. [22] RCA, CA-3119 Ultra-Fast Low-Capacitance Matched
Diodes, Datasheet. RCA Corporation, Mar. 1970.
[11] W. R. Dunkel, M. Rest, K. J. Werner, M. J. Olsen, and J. O.
Smith III, “The Fender Bassman 5F6-A family of preampli-
fier circuits—a wave digital filter case study,” in Proc. Int.
DAFX-6
DAFx-150
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[23] B. Brannon, ICAN-5299 Application of the RCA-CA3019 808 cymbal: a physically-informed, circuit-bendable, digital
Integrated-Circuit Diode Array, RCA Linear Integrated Cir- model,” in Proc. Int. Comput. Music Conf. (ICMC), Athens,
cuits and MOSFETS Applications. RCA Corporation, 1983. EL, Sept. 14–20 2014.
[24] A. S. Sedra and K. C. Smith, Microelectronic Circuits, Ox- [27] M. Verasani, A. Bernardini, and A. Sarti, “Modeling Sallen-
ford University Press, New York, NY, USA, 5th edition, Key audio filters in the wave digital domain,” in Proc. IEEE
2004. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), New
[25] J. Parker and S. D’Angelo, “A digital model of the Buchla Orleans, LA, USA, Mar. 5–9 2017.
lowpass-gate,” in Proc. Int. Conf. Digital Audio Effects [28] S. M. Sze, Physics of semiconductor devices, John Wiley &
(DAFx-13), Maynooth, IE, Sept. 2–5 2013. Sons, New York, NY, USA, 2nd edition, 1981.
[26] K. J. Werner, J. S. Abel, and J. O. Smith III, “The TR-
DAFX-7
DAFx-151
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT model [8], and the MEXTRAM model [9]. These models have
not yet featured in the field of VA, leaving the question of what
The Ebers-Moll model has been widely used to represent Bipolar difference they may make.
Junction Transistors (BJTs) in Virtual Analogue (VA) circuits. An
investigation into the validity of this model is presented in which The aim of this paper is to provide an analysis of GBJTs within
the Ebers-Moll model is compared to BJT models of higher com- audio circuits for the purpose of VA modelling. This analysis con-
plexity, introducing the Gummel-Poon model to the VA field. A sists of two primary sections: firstly the characterisation of models
comparison is performed using two complementary approaches: based on measurements, followed by the analysis of each model
on fit to measurements taken directly from BJTs, and on appli- within the context of an audio circuit model. The analysed mod-
cation to physical circuit models. Targeted parameter extraction els include the Ebers-Moll model and models similar to Gummel-
strategies are proposed for each model. There are two case stud- Poon, considering both additional DC and AC effects. Procedures
ies, both famous vintage guitar effects featuring germanium BJTs. for DC parameter extraction from measurements are discussed for
Results demonstrate the effects of incorporating additional com- all models. We revisit the case studies already presented: the Dal-
plexity into the component model, weighing the trade-off between las Rangemaster Treble Booster and the Dallas Arbiter Fuzz-Face.
differences in the output and computational cost. In order to focus the analysis firmly on the differences between the
BJT models, comparisons are made only between circuit models,
1. INTRODUCTION as any separate circuit measurement would be subject to a range of
further system variances and errors.
Recent developments in processing power have allowed for real- The rest of the paper is structured as follows: Section 2 de-
time simulation of many audio circuits at a physical level, prompt- scribes the compared BJT models, Section 3 discusses extraction
ing the search for component models that achieve the highest ac- procedures for the DC parameters, Section 4 covers the case stud-
curacy whilst maintaining real-time compatibility. In the field of ies, the methodology, and results of the BJT model comparison,
VA modelling, circuits featuring BJTs have been modelled suc- and Section 5 concludes with suggestions for modellers working
cessfully with a variety of different component models. Simplified on circuits featuring GBJTs.
models of the BJT are useful in applications featuring many BJTs.
A notable case is exemplified in the modelling of the Moog Ladder
filter in which current gain is presumed infinite, reducing a stage 2. BJT MODELS
of two BJTs to one nonlinear equation [1]. In circuits featuring
fewer BJTs, the large-signal Ebers-Moll model has been used [2].
This section describes the BJT models used in the analysis. Both
This component model has been used in circuit models of com-
GBJTs that are studied are PNP, which is reflected in the descrip-
plete guitar pedals, including wah-wah [3] and phaser effects [4].
tion of the models. The difference with an NPN model is only in
Despite having been replaced by their silicon counterpart in
notation, not behaviour.
most areas of electronics, vintage germanium BJTs (GBJTs) have
remained consistently popular in guitar pedals, particularly fuzz In this work we define the term ‘external’ to refer to behaviour
effects. In previous work, two circuits featuring GBJTs have been modelled by additional components i.e. resistors and capacitors.
studied: the Fuzz-Face [5] and Rangemaster [6] guitar pedals, both ‘Internal’ will refer to the remaining terms, modelled as voltage
using the Ebers-Moll BJT model. However, each differs in how controlled current sources. This is illustrated by the differences
the parameters are derived: the Fuzz Face using parameters ex- between Figure 1 (a) and (b), in which the BJT in (b) is mod-
tracted from a datasheet, and the Rangemaster using parameters elled by (a). External components values are modelled as being
extracted through an optimisation procedure based on input/output independent of the BJT bias point i.e. constant. Combination of
data. Comparisons between the output of the circuits and the mod- the internal and external components will result in three models:
els demonstrate a good fit in certain regions of operation, though the Ebers-Moll model, a DC Gummel-Poon model, and an AC
there are errors present in both models which remain unattributed. Gummel-Poon (including capacitances).
The component with the most complex behaviour in each circuit Table 1 provides a reference for the name of each parameter in
is the BJT which suggests it as a likely source of error. each model. The effect of changes in each parameter value will be
A step further in complexity from the Ebers-Moll model is discussed through the explanation of the extraction procedure in
a possible solution to this issue. Significant improvements have Section 3. Only necessary discussion is included about each BJT
been published, including: the Gummel-Poon model [7], the VBIC model; for a more comprehensive description see e.g. [10].
DAFX-1
DAFx-152
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-2
DAFx-153
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: List of all parameters and extracted values from both the OC44 and AC128; constraints used in the intermediate optimisation
stages; initial values used for parameters that were not found through direct extraction.
DAFX-3
DAFx-154
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
− Vec − Vec
+
−
+ + Vec
Ib − V − V
+ eb cb
+
Figure 2: Measurement circuits for parameter extraction. Common-Emitter (left), Forward Gummel (middle), Reverse Gummel (right).
DAFX-4
DAFx-155
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 5: Optimisation strategy used to extract the parameters of The schematics of the Dallas Rangemaster and Dallas Arbiter Fuzz-
the Gummel-Poon model. The arrow indicates the flow of stages of Face can be found in Figures 8 and 9 respectively. These case
extraction. The resultant parameters of each stage are passed on studies were selected because each biases the BJT in different re-
to the next. gions, exhibiting different behaviour. Three tests were performed
on each case study: informal listening tests, waveform compar-
isons, and a comparison of the computational requirements. Each
reduces the effects of Is , Vaf , Var , Rc , Nf , Nr , reducing the pa- circuit was modelled using the Nodal DK-method (for reference
rameter set and thus the number of dimensions in the optimisation. see e.g. [3]) although it should be noted that the use of a different
A second stage was used to further tune a subset of these param- simulation technique would yield similar results for both audio and
eters to the Gummel plots. Finally, all parameters were optimised waveform comparisons. Computation time would however require
against all of the data sets. A weighting was applied to the objec- evaluation for different simulation techniques. For each test all po-
tive function in the final optimisation, making the objective value tentiometers of both case studies were set to maximum. Other po-
of the common-emitter characteristic 10× higher than that of the tentiometer settings were tested but are omitted as they illustrate
Gummel plots. no substantial difference from those presented.
3.4. Results
4.1. Informal listening tests
Multiple GBJTs were used in the comparison: 4 AC128 and 3
Several guitar signals were processed by both case studies at 8×
OC44 BJTs. Figures 6 and 7 show the results of the best fit of
oversampling as a means of comparing each model. Listeners
the DC Gummel-Poon to the measurements as determined by the
agreed that differences could be heard between each model, with
objective function of the final optimisation. As these models had
the Ebers-Moll model having the most high frequency content due
the best fit to the data, they were used in the comparison in the case
to distortion and the AC Gummel-Poon having the least. Audio
studies. See Table 1 for the parameters of each model.
examples can be found on the first author’s website1 .
Thermal effects were noticed in the measurements despite con-
siderations to reduce the effects. Post processing was used to re-
duce the observable effect of these; however there remains a possi- 4.2. Waveform comparison
bility that some thermal error remains in the measurements which
An objective comparison of each BJT model is achieved here using
would affect the extracted parameters.
time-domain waveforms. Sinusoids at different frequencies and
In the high-current region of the Gummel-plots the model de-
amplitudes were processed by both case studies and each model.
viates from the measurements. During the measurement stage, a
To remove transient behaviour from the results, the final period
current limit was enforced to prevent damage to components or
of each of these signals are shown in Figure 10 and 11. Plots at
the measurement system, set slightly above the maximum current
1200 Hz show the largest difference for the AC effects, illustrat-
specified by the GBJT datasheets. This limited the amount of high-
ing the low-pass type behaviour of the capacitances. Differences
current data that could be gathered. Further, the common-emitter
due to the increased DC complexity are most prominent at lower
plots were taken at low currents to ensure they were close to that
amplitudes.
of the ‘ideal’ region of operation, which reduces the amount of
data about the high-current region. If the measurements were to 1 http://bholmesqub.github.io/DAFx17/
DAFX-5
DAFx-156
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Ib = 8, 25, 38, 50 μA
200 10
Ic (mA)
10-2 10-2 15
5 Sim.
Meas.
150
Current (A)
Current (A)
0
Current Gain
Current Gain
10-4 10
0 2 4
-4
10 Vec (V)
100
Mod. Ic 0
Mod. Ib
10-6 5
Meas. Ic
Ic (mA)
10-6 Meas. Ib 50
Mod. f
-0.5
Meas. f
-8
10
0 0
0.2 0.4 0.6 0.2 0.4 0.6 -5 -4 -3 -2 -1 0
Veb (V) Vcb (V) Vec (V)
Figure 6: Forward and reverse Gummel plots and the common-emitter characteristic of the OC44 BJT.
120 6 100
Ic (mA)
100 5 50 Mod.
Current (A)
0
Current Gain
Current Gain
80 4
0 1 2 3 4 5
Vec (V)
60 -4 3
10
10-4 Mod. Ic Mod. Ie 0
Mod. Ib 40 Mod. Ib 2
Ic (mA)
Meas. Ic Meas. Ie
Meas. Ib Meas. Ib
Mod. f 20 10-6
Mod. r 1
-6 Meas. f
Meas. r -5
10
0 0
0.1 0.2 0.3 0.4 0.5 0.2 0.4 0.6 -5 -4 -3 -2 -1 0
Veb (V) Vcb (V) Vec (V)
Figure 7: Forward and reverse Gummel plots and the common-emitter characteristic of the AC128 BJT.
Fuzz
1 kΩ
C3 C1
10 nF 22 µF
Vol R3
62 kΩ
R1 10 kΩ
C1 470 kΩ
−
+ Vcc Rin C3
Rin 4.7 nF 2.2 µF
1 kΩ 9V 1 kΩ
Vcc −
9V + R4 R1 C2
1 MΩ 11 kΩ 0.1 µF Vol
500 kΩ
− R2 − C5
+ +
Vin 68 kΩ C2 Vin 6.8 nF
R3 47 µF
R4 R2
3.9 kΩ 36 kΩ 680 kΩ
Figure 8: Schematic of the Rangemaster circuit. Figure 9: Schematic of the Fuzz Face circuit.
DAFX-6
DAFx-157
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-158
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Vout (V)
Vout (V)
0
DC Gummel-Poon 0
-1
AC Gummel-Poon -2
2 4 4
Vout (V)
Vout (V)
Vout (V)
2 2
0
0 0
-2 -2 -2
1.995 1.996 1.997 1.998 1.999 2 1.998 1.9985 1.999 1.9995 2 1.9992 1.9994 1.9996 1.9998 2
Vout (V)
Vout (V)
2
2 2
0 0 0
-2 -2 -2
1.995 1.996 1.997 1.998 1.999 2 1.998 1.9985 1.999 1.9995 2 1.9992 1.9994 1.9996 1.9998 2
Time (s) Time (s) Time (s)
Figure 10: Single cycles of the Rangemaster output’s response to a sinusoidal input signal.
Vout (V)
DC Gummel-Poon 0 0
AC Gummel-Poon
-0.05 -0.05
1.998 1.9985 1.999 1.9995 2 1.9992 1.9994 1.9996 1.9998 2
Vout (V)
Vout (V)
0 0 0
(f) A = 100 mV, F = 200 Hz (g) A = 100 mV, F = 500 Hz (h) A = 100 mV, F = 1200 Hz
0.2 0.2 0.2
Vout (V)
Vout (V)
Vout (V)
0 0 0
1.995 1.996 1.997 1.998 1.999 2 1.998 1.9985 1.999 1.9995 2 1.9992 1.9994 1.9996 1.9998 2
Time (s) Time (s) Time (s)
Figure 11: Single cycles of the Fuzz Face output’s response to a sinusoidal input signal. The fuzz control of the circuit was set to maximum.
DAFX-8
DAFx-159
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-160
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
nisms, such as the use of semantic descriptors or simplified graphi- the dynamic and spectral characteristic of the processed sounds,
cal interfaces. Loviscach [7] for instance describes a control method since both are affected by DRC. To this end, we measure peak-to-
for equalisation that allows drawing points or using free hand curves RMS ratio and also use a simple baseline model of audio similarity
instead of setting parameters. Subsequent work [8] provides a cre- consisting of a Gaussian Mixture Model (GMM) trained on Mel
ative method to map features to shapes. However, the shapes of Frequency Cepstrum Coefficients(MFCCs) [15]. The Kullback-
this system do not link directly with the "meaning" of settings, Leibler (KL) divergence is a robust method to measure the sim-
rather they are classifications of the settings. Cartwright and Pardo ilarity between single Gaussian distributions [16][17]. However,
[9] outline a control strategy using description terms such as warm the divergence between multiple Gaussian models is not analyt-
or muddy, and demonstrate a method of applying high-level se- ically tractable, therefore we use the approach proposed in [18]
mantic controls to audio effects. Wilmering et al. [10] describe a based on variational Bayes approximation. In the next section, we
semantic audio compressor that learns to associate control parame- discuss regression model training and DRC parameter estimation.
ters with musical features such as the occurrence of chord patterns.
This system provides intelligent recall functionality. 3. METHODS
The idea of cross adaptive audio effects [2] was introduced
in the context of automatic multitrack mixing. In this scenario, 3.1. Training procedure
audio features extracted from multiple sources are utilised and au-
tomatic control is applied to optimise inter-channel relations and This study uses a single-channel, open-source dynamic range com-
the coherence of the whole mix. These systems often follow ex- pressor developed in the SAFE project [19]. In the interest of
pert knowledge obtained from the literature about music produc- brevity, we do not discuss the operation of the compressor and as-
tion practice, or interviews with skilled mixing engineers. In [3], sume the reader is familiar with relevant principles[20, 6]. We con-
the authors explored how different audio engineers use the com- sider the estimation of the most common parameters: threshold,
pressor on the same material to train an intelligent compressor. ratio, attack and release time from reference audio and leave other
An automatic multitrack compressor based on side-chain feature parameters e.g. make up gain and knee width for future work. To
extraction is presented in [6] providing automatic settings for at- form an efficient regression model, we need to choose the most
tack/release time, knee, and make up gain based on low-level fea- relevant features first. Section 3.1.1 describes the features derived
tures and heuristics. An alternative implementation of DRC us- from a series of experiments. The system is then discussed in Sec-
ing non-negative matrix factorisation (NMF) is proposed in [11]. tion 3.1.2 outlining the data flow and system structure.
Here, the authors consider raising the NMF activation matrix to R 1
to obtain a compressed signal with ratio R after re-synthesis. This 3.1.1. Feature extraction
is viewed as a compressor without a threshold parameter.
The goal of our work is substantially different from these solu- Audio features are selected or designed for each specific effect
tions. At this initial stage, we focus on individual track compres- parameter. Since DRC affects perceptual attributes in terms of
sion, rather than trying to fit the signal into a mix, therefore we loudness and timbre, six statistical features are selected for all
do not assume the presence of other channels. We do not aim to four parameters. The RMS features reflect energy, which is re-
incorporate expert knowledge into the system or automate control lated to loudness, while the spectral features reflect the spectral
parameters in a time varying manner, instead, we assume an audio envelope, which is related to timbre. The statistical features are
example, and aim to configure the compressor to yield an output calculated frame-wise, with a frame size of 1024 samples and a
that sounds similar to the example. This requires the prediction of 50% overlap. For spectral features, we use 40 frequency bins up to
each parameter. 11kHz. We assume this bandwidth is sufficient for the control of
selected DRC parameters. We define the magnitude spectrogram
To estimate the settings of a compressor with hidden param-
Y (n, k) = |X(n, k)| with n ∈ [0 : N − 1] and k ∈ [0 : K]
eter values, Bitzer et. al. [12] proposes the use of purposefully
where N is the number of frames and k is the frequency index
designed input signals. Although this provides insight into pre-
of the STFT of the input audio signal with a window length of
dicting DRC parameters, in most cases, including ours, only the
M = 2(K + 1). We extract the spectral features described in
audio signal is available and the original compressor and its set-
Equations 1 - 4 as follows:
tings are not available for black-box testing. A reverse engineer-
ing process is proposed for the entire mix in [13]. The authors PK−1
k=0 k · Y (n, k)
estimate a wide range of mixing parameters including those of SCmean = E[ PK−1
], (1)
audio effects. This work however focusses on the estimation of k=0 Y (n, k)
time-varying gain envelopes associated with dynamic non-linear PK−1
effects rather than their individual parameters. A recent work pro- k=0 k · Y (n, k)
SCvar = V ar[ P ], (2)
poses the use of deep neural networks (DNN) for the estimation of K−1
k=0 Y (n, k)
gain reduction [14]. The use of DNN in intelligent control of au-
dio effects is quite novel, but this work targets only the mastering SVmean = E[(E[Y (n, k)2 ] − (E[Y (n, k)])2 )1/2 ], (3)
process and only considers the ratio factor.
Our work focuses on comparing linear and non-linear regres-
sion models to map low-level audio features discussed in Section SVvar = V ar[(E[Y (n, k)2 ] − (E[Y (n, k)])2 )1/2 ], (4)
3 to common parameters of the dynamic range compressor. We
propose using a reference audio example as target and evaluate
the regression models in terms of how close the processed sounds where SC stands for spectral centroid, and SV stands for spectral
get to the target in overall dynamic range and audio similarity. variance. The mean and variance in the equations are calculated
This is motivated by the need to analyse and compare changes in across all M length frames.
DAFX-2
DAFx-161
Text
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
perceptual perceptual
feature extraction feature extraction
Features Features
GMM GMM
KL-
divergence
Result
Figure 1: Workflow of initial system with access to the reference sound and its corresponding unprocessed version (see Section 3.2.2)
We also extract the following temporal features described in threshold. This relates to the speed of compressor to reach the de-
Equations 5 - 6: sired compression ratio, which is controlled primarily by its attack
time. The same process is applied at the end of the note to extract
M −1
1 X how fast the ratio curve drops back to one. The design of these
RMSmean = E[( x(m)2 )1/2 ], (5) features was motivated by visual inspection of the signal. They are
M m=0
shown to improve the ability of the regression model to predict the
parameters (see Section 4).
N −1
1 X
RMSvar = V ar[( x(m)2 )1/2 ], (6)
N m=0 3.1.2. Regression model training
This section outlines the datasets and training procedure for the
where x(m) represents the magnitude of audio sample m within regression models that are used to map features to effect parameter
each M length frame, and the mean and variance are calculated settings. In the first stage of our research, we consider two types of
across all the N time frames as with the previous spectral features. instruments: snare drum and violin. The former is one of the most
We designed four types of time domain features related to the common instruments that requires at least a light compression to
attack and release of the notes as well as the speed of the com- even out dynamics. The drum samples are typically short and,
pressor. The attack and release times TA = TendA − TstartA and considering a typical energy envelope, exhibit only the attack and
TR = TendR − TstartR are calculated using the RMS envelope release (AR) part. The violin recordings typically consist of a long
through a fixed threshold method (c.f. [21]) that determines start note with fairly clear attack, decay, sustain and release (ADSR)
and end times of the attack and release parts of the sound. The phase. All audio samples in our work are taken from the RWC
end of the attack is considered to be the first peak that exceed 50% isolated note database [22].
of the maximum RMS energy. The RMS curve is smoothed by a Table 1 describes the four violin note datasets denoted A, ..., D
low-pass filter with a normalised cut-off frequency of 0.47 rad/s. that are used for training. In each dataset, one parameter of the
We also extract the RMS amplitude at the end of the attack and the effect is varied while the others are kept constant. The number
start of the release rms(TendA ) and rms(TstartR ) respectively, of training samples in each dataset equals to the number of notes,
as well as the mean amplitude during the attack and release parts i.e., 60 in case of the violin dataset, times the number of grid points
of the sound. (subdivisions) for each changing parameter. In this study, we use
TendA 50 settings for threshold and ratio, and 100 settings for attack and
1 X release time as it is shown in the first column of Table 1. The
Aatt = rms(n), (7)
TA n=TstartA
same process is applied to 12 snare drum samples to form the drum
dataset. Each training set A, ..., D is used for predicting a specific
parameter.
TendR
1 X
Arel = rms(n), (8)
TR Conditions
n=TstartR Training sets (size)
Thr(dB) Ratio Att(ms) Rel(ms)
where Tstart and Tend are indices of the start and end of the at- A (60*50) 0:1:49 2 5 200
tack or release. Finally, we compute a feature related to how fast B (60*50) 37.5 1:0.4:20 5 200
the compressor operates. We first calculate the ratio between the C (60*100) 37.5 2 1:1:100 200
time-varying amplitudes of input or original sound and the refer- D (60*100) 37.5 2 5 50:10:1000
ence sound s(n) = rmsref (n)/rmsorig (n). We then calculate
the amount of time for s(n) to reach a certain value using a fixed Table 1: Training set generation
DAFX-3
DAFx-162
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-163
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Reference Note
Npro
Features Features
GMM GMM
KL-
divergence
Result
Figure 3: Realistic workflow with only the input and reference sound available (see Section 3.2.3)
constraint of mono-timbre, i.e., only a single instrument is sound- system can predict the compressor parameters from drums better
ing at a time, we assume that the statistical properties of the un- than for violins. In subsequent studies, we will therefore focus on
derlying features remain constant. Therefore, the rest of the algo- the more complex case of similarity measurement for violin notes
rithm remains the same. A possible method to apply our algorithm and loops.
to loops is to consider them as notes, with the attack of the first
note in the loop and release of the last. It is a reasonable simpli- Violin LR RF
fication in practice, noting that the rest of the attack and release error 3.756 2.601
parts of notes are likely to be overlapping. As before, the results Threshold(dB)
scaled error 1.860 1.731
are reported and discussed in Section 4.3. error 2.065 1.583
Ratio
scaled error 0.110 0.091
4. EVALUATION error 15.503 0.719
Attack(ms)
scaled error 1.012 0.686
4.1. Direct assessment of parameter estimation error 210.43 13.973
Release(ms)
scaled error 78.913 10.583
Using the test procedure outlined in Section 3.2.1, here we report
the accuracy of direct parameter estimation using random sub- Table 2: Numerical test using linear and random forest regression
sampling validation. Two regression models are compared and model for violin notes
evaluated: simple Linear Regression (LR) and Random Forest re-
gression (RF) [23] available in [24]. Table 2 & 3 show the abso- Snare Drum LR RF
lute errors for both instruments and regression models. Since the error 1.185 0.800
observed feature values are relatively small, we linearly scale the Threshold(dB)
scaled error 0.408 0.345
feature values to [0, 1] and compare the errors. The highlighted
error 1.571 0.999
values in the table show that the smallest error is always observed Ratio
scaled error 0.669 0.305
when using the scaled features and the random forest regression
error 6.867 0.860
model. For completeness, the range for the four parameters are Attack(ms)
scaled error 2.260 0.017
(0,50] dB for threshold, [1,20] for ratio, (0,100] ms for attack time,
error 23.045 0.999
and (0,1000] ms for release time. Scaling is reasonable in this pilot Release(ms)
scaled error 40.960 6.851
experiment, because the test data (reference) is selected randomly
from the database mentioned above, which means all Npro has its
Table 3: Numerical test using linear and random forest regression
original N available. However, in a real world scenario, the ref-
model for snare drum samples
erence sound is not taken from the database prepared to train the
regression models. It is more likely to be a produced sound or 4.2. Results of similarity assessment between notes
track without access to its unprocessed version. Therefore scaling
will not be used in the subsequent studies. Please note that we do In this section, we evaluate the changes in estimated similarity us-
not overfit the models, because in the evaluation the reference note ing the system outlined in the right hand side of Figure 1 and Fig-
and the corresponding original is excluded from the training set of ure 3. Firstly, we extract the crest factor, i.e., peak-to-RMS ratio as
the regression model. the similarity feature because it is correlated with the overall dy-
The results also illustrate that the prediction accuracy for drums namic range of the signal. Based on our design, the crest factor of
is better than for violins in all cases. One reason is that drum sam- the reference note Npro should be closer to the output note T than
ples are shorter and exhibit a simpler structure - short sustain, fol- the input note R. An example of this test is given in Figure 4 with
lowed by release and there is no pitched content. It shows that the 25 test cases. The crest factor of the input signal is represented by
DAFX-5
DAFx-164
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
the constant at the top of the figure and the crest factor of a series results on different instruments. In this test, the distance reduction
of reference notes are depicted by the blue curve at the bottom. achieved by the system is larger for violins, i.e., the violin notes
The crest factor of the output signal from the system is shown in exhibit better results than the snare drums. This is possibly due to
the middle (green curve). It is consistently brought closer to the the fact that the MFCC features used here do not model the drum
reference which fits our expectation here. sounds well enough. Finding a better feature representation for
percussive instruments constitutes future work.
Violin Threshold Ratio Attack Release Violin Threshold Ratio Attack Release
DCrest (Npro , R) 60.31 94.13 104.93 85.31 DCrest (Npro , R) 60.31 94.13 104.93 85.31
DCrest (Npro , T )LR 12.53 39.72 46.76 48.62 DCrest (Npro , T )LR 34.86 29.37 68.74 75.94
DCrest (Npro , T )RF 15.27 38.24 45.23 47.19 DCrest (Npro , T )RF 30.11 25.36 67.82 54.02
Table 4: Average of Crest factor difference - Violin Table 8: Average of Crest factor difference - Violin
Snare Drum Threshold Ratio Attack Release Snare Drum Threshold Ratio Attack Release
DCrest (Npro , R) 49.14 70.04 50.47 70.68 DCrest (Npro , R) 49.14 70.04 50.47 70.68
DCrest (Npro , T )LR 27.99 43.85 27.28 44.63 DCrest (Npro , T )LR 38.01 66.18 21.37 44.05
DCrest (Npro , T )RF 29.33 43.45 27.20 43.44 DCrest (Npro , T )RF 42.90 45.24 27.37 42.75
Table 5: Average of Crest factor difference - Snare drum Table 9: Average of Crest factor difference - Drum
Next, we discuss the resuts of similarity assessment as de- The result using the MFCC-based similarity model is illus-
scribed in Section 3.2.2. An this stage, we use a simple audio sim- trated in Figure 5 & 6, with D(Npro , R) on the left, D(Npro , T )LR
ilarity model to test the efficiency of the system. We use MFCC in the middle and D(Npro , T )RF on the right of each subplot. In
coefficients as features and fit a GMM on the MFCC vectors. An Figure 5 for violin notes, the average divergence is not as promis-
approximation of the symmetrised KL divergence is then calcu- ing, especially when comparing with the results in Table 6, but
lated and used as a distance measure. Using the same procedure it is clear that even if the given reference sounds have a large di-
as in the previous part, the compressor settings provided by this versity, the algorithm reduces this significantly, and shows a very
algorithm should bring the output note T closer to the Npro com- stable improvement in the similarity result. On average, the ran-
pared to the input note R. Thus it is reasonable to assume that dom forest regression performs better than linear regression in all
D(Npro , R) > D(Npro , T ) holds and the performance of the re- cases except when predicting threshold. This shows the benefit of
gression models can be tested. In this experiment, we select 50 modelling non-linearities. Therefore we will build our system us-
Npro and change one parameter at a time. Table 6 & 7 indicate the ing random forest regression. Figure 6 shows that the output of the
efficiency of the algorithm. Since the similarity algorithm theoreti- system did not manage to achieve a dramatic reduction in the sim-
cally captures the timbre information as well, it will yield different ilarity distance in case of the snare drum. As explained before, we
DAFX-6
DAFx-165
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 5: Similarity for four parameters in the second workflow, 5. CONCLUSION AND FUTURE WORK
assuming the origin note N is not available. - Violin
In this paper, we proposed a method to estimate dynamic range
compressor settings using a reference sound. We demonstrate the
first steps towards a system to configure audio effects using sound
examples, with the potential to democratise the music production
workflow. We discussed progress from using a linear regression
model to a random forest model and from a designated test case
to a real world scenario. The evaluation shows a promising trend
in most cases and provides an initial indication of the utility of
our system. Our current research focuses on simple audio mate-
rial, notes and mono-timbral loops. The study discussed in Section
3.2.4 considers loops as signal notes, which ignores a lot of infor-
mation. In future work, we will assess the use of an onset event
detection to identify notes from loops or more complex recordings
and measure their attributes using the method applied to individual
notes. A useful intelligent audio production tool will be devised for
Figure 6: Similarity for four parameters in the second workflow,
audio tracks that are more complex, while polyphonic tracks will
assuming the origin note N is not available. - Snare drum
also be considered in future research.
The algorithm itself may also need improvement. The features
4.3. Results of similarity between loops
we chose to train the regression model can be extended. Thor-
We can extend the study by changing the audio material from ough research on how to design audio features specific to com-
mono-timbral notes to mono-timbral loops. The final experiment pressor and other audio effects parameters would be beneficial. At
tests the efficiency of the workflow in Figure 3 without using the the same time, the regression models employed in this work may
original notes, and replaces both R and Npro with violin loops. be improved using optimisation techniques that take the similarity
The results displayed in Table 10 and Figure 7 correspond to 50 features into account. An improved similarity model may be used
violin loops which have the duration of 3-5 seconds. A promising as an objective function, rather than being used only during eval-
trend is observed using the average divergence. However, unlike uation. We note however that the currently employed technique
in the previous studies, the divergence does not drop very signif- in the assessment of similarity is only considered a starting point.
icantly. These results indicate that the algorithm works better for More realistic and complex auditory models will be applied in fu-
attack/release time, but the performance is not yet satisfactory for ture work. Additionally, we plan to evaluate the system using real
threshold. A possible reason is that we selected six features for human perceptual evaluation, i.e., using a listening test. In con-
threshold/ratio but ten for attack/release while the features used clusion, this paper provides an innovative and feasible method for
may not be sufficient or accurate enough in this more generic case. intelligent control of dynamic range compressor. However, this re-
search is still in early phase and further considerations are needed
Threshold Ratio Attack Release to optimise feature design and the regression model.
DCrest (Npro , R) 545.48 580.10 574.33 569.69
DCrest (Npro , T )LR 301.59 237.24 326.68 314.11
6. ACKNOWLEDGEMENTS
DCrest (Npro , T )RF 301.75 209.06 325.08 321.23
This work has been part funded by the FAST IMPACt EPSRC
Table 10: Average of Crest factor difference - Loops
Grant EP/L019981/1 and the European Commission H2020 re-
search and innovation grant AudioCommons 688382.
DAFX-7
DAFx-166
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
7. REFERENCES [16] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas, “The
earth mover’s distance as a metric for image retrieval,” In-
[1] Sean McGrath, Alan Chamberlain, and Steve Benford, ternational journal of computer vision, vol. 40, no. 2, pp.
“Making music together: An exploration of amateur and pro- 99–121, 2000.
am grime music production,” in Proceedings of the Audio
[17] Beth Logan and Ariel Salomon, “A music similarity function
Mostly 2016. ACM, 2016, pp. 186–193.
based on signal analysis.,” in ICME, 2001.
[2] Joshua D Reiss, “Intelligent systems for mixing multichan-
nel audio,” in 17th International Conference on Digital Sig- [18] John R Hershey and Peder A Olsen, “Approximating the
nal Processing (DSP). IEEE, 2011, pp. 1–6. kullback leibler divergence between gaussian mixture mod-
els,” in 2007 IEEE International Conference on Acous-
[3] Zheng Ma, Brecht De Man, Pedro DL Pestana, Dawn AA tics, Speech and Signal Processing-ICASSP’07. IEEE, 2007,
Black, and Joshua D Reiss, “Intelligent multitrack dynamic vol. 4, pp. IV–317.
range compression,” Journal of the Audio Engineering Soci-
ety, vol. 63, no. 6, pp. 412–426, 2015. [19] Ryan Stables, Sean Enderby, BD Man, György Fazekas,
Joshua D Reiss, et al., “Safe: A system for the extraction and
[4] Roger B Dannenberg, “An intelligent multi-track audio edi-
retrieval of semantic audio descriptors,” 15th International
tor,” in Proceedings of international computer music confer-
Society for Music Information Retrieval Conference (ISMIR
ence (ICMC), 2007, vol. 2, pp. 89–94.
2014).
[5] Yuxiang Liu, Roger B Dannenberg, and Lianhong Cai, “The
intelligent music editor: towards an automated platform for [20] Udo Zolzer, DAFX: Digital Audio Effects, Wiley Publishing,
music analysis and editing,” in Advanced Intelligent Com- 2nd edition, 2011.
puting Theories and Applications. With Aspects of Artificial [21] Geoffroy Peeters, “A large set of audio features for sound
Intelligence, pp. 123–131. Springer, 2010. description (similarity and classification) in the cuidado
[6] Dimitrios Giannoulis, Michael Massberg, and Joshua D project,” 2004.
Reiss, “Parameter automation in a dynamic range compres- [22] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and
sor,” Journal of the Audio Engineering Society, vol. 61, no. Ryuichi Oka, “Rwc music database: Music genre database
10, pp. 716–726, 2013. and musical instrument sound database,” Proceedings of the
[7] Jörn Loviscach, “Graphical control of a parametric equal- 4th International Conference on Music Information Retrieval
izer,” in Audio Engineering Society Convention 124. Audio (ISMIR 2003), pp. 229–230, 2003.
Engineering Society, 2008. [23] Leo Breiman, “Random forests,” Journal of Machine Learn-
[8] Philipp Kolhoff, Jacqueline Preub, and Jorn Loviscach, “Mu- ing Research, vol. 45, no. 1, pp. 5–32, Oct. 2001.
sic icons: procedural glyphs for audio files,” in 2006 19th [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Brazilian Symposium on Computer Graphics and Image Pro- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
cessing. IEEE, 2006, pp. 289–296. V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
[9] Mark Brozier Cartwright and Bryan Pardo, “Social-eq: M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn:
Crowdsourcing an equalization descriptor map.,” in Proceed- Machine learning in Python,” Journal of Machine Learning
ings of the 14th International Conference on Music Informa- Research, vol. 12, pp. 2825–2830, 2011.
tion Retrieval (ISMIR), 2013.
[25] Danilo P Mandic and Vanessa Su Lee Goh, Complex valued
[10] Thomas Wilmering, György Fazekas, and Mark B Sandler, nonlinear adaptive filters: noncircularity, widely linear and
“High-level semantic metadata for the control of multitrack neural models, vol. 59, John Wiley & Sons, 2009.
adaptive digital audio effects,” in Audio Engineering Society
[26] Xavier Serra and Julius Smith, “Spectral modeling synthesis:
Convention 133. Audio Engineering Society, 2012.
A sound analysis/synthesis system based on a deterministic
[11] Ryan Sarver and Anssi Klapuri, “Application of nonnega- plus stochastic decomposition,” vol. 14, no. 4, pp. 12–24,
tive matrix factorization to signal-adaptive audio effects,” in 1990.
Proc. DAFx, 2011, pp. 249–252.
[27] Sebastian Heise, Michael Hlatky, and Jörn Loviscach, “A
[12] Joerg Bitzer, Denny Schmidt, and Uwe Simmer, “Parameter computer-aided audio effect setup procedure for untrained
estimation of dynamic range compressors: models, proce- users,” in Audio Engineering Society Convention 128. Audio
dures and test signals,” in Audio Engineering Society Con- Engineering Society, 2010.
vention 120. Audio Engineering Society, 2006.
[28] Roger B Dannenberg, “Music representation issues, tech-
[13] Daniele Barchiesi and Joshua Reiss, “Reverse engineering
niques, and systems,” in Computer Music Journal. 1993,
of a mix,” Journal of the Audio Engineering Society, vol. 58,
vol. 17, pp. 20–30, JSTOR.
no. 7/8, pp. 563–576, 2010.
[14] Drossos K. Virtanen T. Mimilakis, S.I. and G. Schuller, [29] Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola, “A
“Deep neural networks for dynamic range compression in matlab toolbox for music information retrieval,” in Data
mastering applications,” in Audio Engineering Society Con- analysis, machine learning and applications, pp. 261–268.
vention 140. Audio Engineering Society, 2016. Springer, 2008.
[15] Jean Julien Aucouturier and Francois Pachet, “Music sim- [30] Petri Toiviainen and Carol L Krumhansl, “Measuring and
ilarity measures: What’s the use?,” in Proceedings of the modeling real-time responses to music: The dynamics of
3th International Conference on Music Information Retrieval tonality induction,” vol. 32, no. 6, pp. 741–766, 2003.
(ISMIR), 2002, pp. 13–17.
DAFX-8
DAFx-167
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
François G. Germain
Center for Computer Research in Music and Acoustics (CCRMA),
Stanford University, Stanford, CA, USA
francois@ccrma.stanford.edu
DAFX-1
DAFx-168
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.2. System discretization method) has poor stability properties as its stability region is re-
duced to the imaginary axis in the s-plane [1]. We focus instead on
The exact solution to Eqs. (1a) and (1c) is theoretically given by the implicit midpoint method whose implicit update equation is
Z t
xn+1 = xn + T f 12 (xn+1 + xn ), 21 (un+1 + un ) . (8)
x(t) = x0 + f (x(τ ), u(τ ))dτ. (2)
0
For conciseness, we will refer to the implicit midpoint method
but we can also express the solution at times tn = nT (with a simply as “midpoint method” in the rest of the paper.
fixed interval T ) by solving iteratively the problem
Z 3.3. Stability analysis and pole mapping
tn+1
x(tn+1 ) = x(tn ) + f (x(τ ), u(τ ))dτ. (3) In the following sections, we use the scalar version of Eq. (4) to
tn
simplify the notation, as the results extend readily to the multidi-
In typical cases, solving Eq. (2) or (3) analytically is intracta- mensional case using diagonalization and multivariable calculus.
ble. One can however compute a discrete-time series xn approxi- A typical way of studying the stability of discretization meth-
mating x(tn ). To do so, we use numerical integration methods to ods is by studying the solution to linear time-invariant ordinary
approximate the integral in Eq. (3) using some N past input values differential equations (ODEs) of the form
and approximated state values. The update equation of xn is then xt (t) = λx(t), λ ∈ C, (9)
xn+1 = xn + T f (xn+1 , . . . , xn−N , un+1 , . . . , un−N ). (4) whose update equation can typically be written in the form
N
X
Coupled with Eq. (1b), we can then form a sequence of ap-
proximated output values for our system of interest. Note that x(tn+1 ) = am (λ)x(tn−m ), a0 (λ), . . . , aN (λ) ∈ C. (10)
m=0
when f depends on xn+1 , Eq. (4) becomes implicit and cannot
always be solved analytically. If we denote λm (λ) the N + 1 roots of the polynomial
N
X
3. NUMERICAL INTEGRATION METHODS p(z) = z N +1 − am (λ)z N −m , (11)
m=0
Numerical integration methods aims at approximating the value of
the integral of a function using a finite number of function eval- the stability region of a method is then defined as the set of λ such
uations. As stated earlier, we can apply those methods to Eq, (3) that ∀m ∈ {0, . . . , N }, we have |λm (λ)| < 1 (i.e., λm (λ) is
form a discretized update equation as in Eq. (4). In this paper, we inside the unit sphere). For both the trapezoidal method and the
discuss specifically two common methods, the trapezoidal method midpoint method, we have a single root λ written as
and the implicit midpoint method. 1 + T λ/2
λ(λ) = , (12)
1 − T λ/2
3.1. Trapezoidal method
so that the stability region corresponds to left half of the complex
The trapezoidal method approximates the value of f in the interval plane {λ, Re(λ) < 0}. This means that the two methods have
[t, t + T ] as the average of the function values at the start and end the desirable property of being A-stable [23]. This also means
point of the integral to compute the integral as [22] that, for linear systems, the two methods map the system poles
Z and zeros exactly the same way. We then expect both methods to
t+T
exhibit similar qualitative behavior, such as resonant peaks near
f (τ )dτ ≈ T
2
(f (t + T ) + f (t)). (5) the Nyquist frequency for stiff systems (i.e., systems with strongly
t
damped poles) and frequency warping [14].
Eq. (3) is then discretized to form the implicit update equation
for the trapezoidal method as 3.4. Discretization error and order of accuracy
x n+1
=x + n T
2
f (x n+1
,u n+1
)+ T
2
n n
f (x , u ). (6) The discretization error of a method is typically characterized us-
ing the equation xt (t) = f (x(t)) by deriving the error between
x(tn+1 ) solution of
3.2. Implicit midpoint method
Z tn+1
The midpoint method approximates the value of f in the interval x(tn+1 ) = x(tn ) + f (x(τ ))dτ (13)
[t, t + T ] as its value at midpoint of the integral to compute the tn
integral as [22]
and x n+1
solution of the discretized equation
Z t+T xn+1
= x(tn ) + T f (xn+1 , x(tn ), . . . , x(tn−N )). (14)
f (τ )dτ ≈ T f t + T2 . (7)
t
This error is typically computed in terms of a polynomial in T
To use this approach in Eq. (3), several implementations are using the Taylor expansion of x(tn+1 ) − xn+1 [1] so that
possible. It can be implemented as the explicit midpoint method, +∞
but that method requires a way to compute the state and input val- X
x(tn+1 ) − xn+1 = (tn )T n , (15)
ues at time tn+ 1 . Additionally, this method (also called leapfrog n=0
2
DAFX-2
DAFx-169
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(f )2 fxx + f (fx )2
tr
3 = − and
12
(16)
(f )3 fxxx + 4(f 2 )fx fxx + f (fx )3
tr
4 = − ,
24
and, for the midpoint method, as
(f )2 fxx − 2f (fx )2
md
3 = and
24
(17)
(f )3 fxxx − (f )2 fx fxx − 4f (fx )3 Figure 1: Solution sequence behavior regions for xn+1 as a func-
md
4 = .
48 tion of xn (in green) with reference to the point (xe , xe ).
We then see that depending on the nonlinear function charac-
teristics, different patterns of constructive or destructive interfer-
ence between the different terms will lead to different behaviors of α, so that if α < 0, the sign of xn − xe near the equilibrium al-
between the two methods. ternates at each iteration, creating an oscillation that is not present
in the original x(tn ) − xe which remains of the same sign accord-
ing to Eq. (19). This oscillatory behavior matches the well-known
3.5. First-order fixed-point behavior
oscillations exhibited by the solution of stiff systems discretized
Many systems of interest usually tend towards a steady-state solu- with the trapezoidal rule [14].
tion after infinite time. Note however that not all systems behave
this way (e.g., relaxation oscillators [24, 25]). Those steady-state
3.6. General fixed-point behavior
(equilibrium) solutions xe solve the implicit equation:
Further from an equilibrium, the first-order behavior from Eq. (21)
f (xe ) = 0. (18) may be insufficient to understand the behavior of the system in
some regimes. Another tool we can use is to compute the transition
A typical analysis of the equilibrium is to look at the value
equation xn+1 as a function of xn . Once that transition function
fx (xe ) to find if the equilibrium is stable (fx (xe ) < 0) or unsta-
is known, we can deduce regions indicating whether the method is
ble (fx (xe ) > 0). Physical systems composed with passive and
converging or diverging, i.e., (respectively):
dissipative components typically present one or more stable equi-
libria due to the energy of the system being dissipated over time.
|xn+1 − xe | < |xn − xe | or |xn+1 − xe | > |xn − xe | (23)
For such equilibria, as the state variable of a system approaches
xe , the Hartman-Grobman theorem [26] guarantees they will be- and whether the method is oscillating or not, i.e., (respectively):
have similarly as the solutions of the linearized system around xe
which follow the exponentially decaying profile |xn+1 − xe | · |xn − xe | < 0 or |xn+1 − xe | · |xn − xe | > 0 (24)
x(t) ≈ x(0) exp(tfx (xe )) + xe . (19)
Those properties translate graphically as shown in Fig. 1. In
Sampled at times tn , the update formula is expressed as that representation, an quasi-exponential decay such as the one de-
scribed in Eq. (21) for α > 0 corresponds to a linear section of
x(tn+1 ) ≈ (x(tn ) − xe ) exp(T fx (xe )) + xe . (20) curve in the convergent non-oscillating region. An oscillating ex-
ponential decay (α < 0) corresponds to a linear section of curve
By construction, the equilibria of the discretized system using in the convergent oscillating region.
either the trapezoidal rule or the midpoint rule are identical to those
of the original system as they also verify Eq. (18). In the vicinity
of those equilibrium, both methods then behave as the linearized: 3.7. Discretization error with variable input
When considering the influence of the input variables, the equation
xn+1 ≈ α(xn − xe ) + xe , (21)
of interest becomes xt (t) = f (x(t), u(t)). The error terms are
with α related to fx (e) following Eq. (12), meaning then expressed as function of the partial derivatives fxk ul of f .
Expectedly, with the added input variables, both methods re-
1 + T fx (xe )/2 main second-order accurate (0 = 1 = 2 = 0). The 3rd-order
α= . (22) error term for the trapezoidal method becomes
1 − T fx (xe )/2
The stability properties of both methods guarantees that stable
tr 2 2
3 = − (f ) fxx + 2ut f fxu + (ut ) fuu /12
equilibria (fx (xe ) < 0) are necessarily stable for the discretized
sequences (|α| < 1). However, we have no guarantee on the sign − ((f fx + ut fu )fx + utt fu )/12, (25)
DAFX-3
DAFx-170
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
and one for the midpoint method becomes • Generate a new sample xn+1
md using either the (recurrent)
equation:
md
3 = (f )2 fxx + 2ut f fxu + (ut )2 fuu /24 xn+1
md = 2xtr
n+1
− xn
tr (29)
− ((f fx + ut fu )fx + utt fu )/12. (26) or the (direct) equation:
n+1
Here again, we see how the two methods will present different xn+1 n+1
md = xtr + T
2
f xn+1
tr , utr . (30)
patterns of interference altering their behavior as a function of f
as the influence of the input on the error terms differ as well. We also need to use the appropriate initial conditions for the
trapezoidal-rule system based on the initial conditions of the de-
4. IMPLEMENTATION CONSIDERATIONS sired midpoint sequence. In a typical case, those initial conditions
correspond to x0md = x(0) for the state variables and u0md =
Many publications have presented various implementations of the u(t0 ) for the input variables, typically based on a specified input
trapezoidal method for the simulation of audio lumped systems so function u(t) defined for t ≥ 0. In addition to specifying the
we refer the reader to those articles for more details [3–10]. We de- initial state value x0tr , the trapezoidal-rule system requires to also
tail here approaches regarding the implementation of the midpoint define u0tr which depends on the unspecified u−1 0
md . Note that xtr
method in typical audio circuit simulation frameworks. Generally,
0
and utr cannot be chosen independently. We can pick any initial
the implementation of the midpoint method can be derived in two condition pair (x0tr , u0tr ) as long as it verifies
ways: either through simple modifications of the system obtained
x0tr + T
f (x0tr , u0tr ) = x0md . (31)
using the more typical trapezoidal method or directly from the out- 2
put of that trapezoidal-rule system with a simple transformation of A typical assumption is that the system was in a steady-state
the output as detailed below. condition before the simulation started, so that the initial condition
are given as a function of x0md by solving
4.1. Direct implementation of the midpoint method ( 0
xtr = x0md ,
A direct implementation of the midpoint method computes the (32)
xn+1
md by solving the implicit equation given in Eq. (8), similarly
0 = f (x0md , u0tr ).
as what we do for the trapezoidal method when solving Eq. (6).
Depending on the equation, similar strategies can be exploited, us- 4.3. Trapezoidal method using the midpoint rule
ing analytical inverse functions when available or numerical root-
finding methods otherwise. One potential downside of the mid- A similar principle can be used to generate a simulation based on
point method is that the expressions relative to the time step tn+1 the trapezoidal rule using a system designed to simulate that same
and the time step tn are grouped together inside the nonlinear func- system using the midpoint rule using the procedure:
tion f which may complicate the derivation and use of analytical • Generate a new sample un+1
md from the midpoint rule sys-
inverse functions. On the other hand, notice that the update equa- tem using the input sample defined recursively as
tion does not depend on two independent input samples un+1 md and
un 1 n+1 n
md but on the combined 2 (umd +umd ) so that the dimensional- un+1 n+1
md = 2utr − un
md . (33)
ity of the update equation with respect to the input can be reduced
compared to the trapezoidal method. • Generate a new sample xn+1
tr using either
n
xn+1 = 12 xn+1
md + xmd (34)
4.2. Midpoint method using the trapezoidal rule tr
The midpoint method and the trapezoidal method are conjugate or solving the implicit equation
methods [11]. More precisely, if the sequence xn md verifies Eq. (8) n+1
for the input sequence u n
, we can show that the sequence xn xn+1 + T
f xn+1
tr , utr = xn+1
md . (35)
tr =
tr 2
n−1
md
1 n
xmd + xmd verifies Eq. (6) for the modified input sequence
2
n−1
Finding the initial conditions is less complex in that case. The
un 1 n
tr = 2 umd + umd [27]. This means that if we want to im-
initial source sample u0md can be chosen arbitrarily. If the input
plement the midpoint rule to simulate the state equation
function u(t) is known for t ≥ 0, a possible choice would be
xt (t) = f (x(t), u(t)), (27) u0md = u T2 . An alternative option is u0md = 12 u1tr + u0tr ,
which does not require explicit knowledge of u(t). The initial
we can do so by simulating that same equation using the trape- state x0md is obtained as
zoidal rule replacing the input u(t) by 12 (u(t) + u(t − T )). As
a consequence, if a system has been designed to generate a simu- x0md = x0tr + T
2
f (x0tr , u0tr ). (36)
lated sequence xn tr of equation (27) using the trapezoidal rule, we
can obtain a simulated sequence xn md for the input sequence umd
n
4.4. Considerations for typical audio systems
iteratively as follows:
A large share of the audio systems presented in the literature are
• Generate a new sample xn+1
tr from the trapezoidal rule sys- characterized by having the dynamical elements only be linear
tem using the input samples defined as (e.g., capacitor, inductor), and having the nonlinear elements only
( n+1 n be memoryless (e.g., diode, transistor, operational amplifier). Sev-
utr = 21 un+1md + umd , eral simulation frameworks for audio systems (e.g., wave digital
n−1
(28)
un 1 n
tr = 2 umd + umd .
filters [9, 28], nodal DK method [29], generalized state space [8])
DAFX-4
DAFx-171
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
R 200
Trapezoidal
100 Midpoint
e C v Reference
-100
Figure 2: Diode clipper circuit.
-200
segregate the system between sets of memoryless nonlinear equa- -400 -200 0 200 400
tions and sets of linear equations between the system variables. 0
For the midpoint method, discretizing the linear equations is
the same process as for the trapezoidal method, so the discrete
update equations are identical. Memoryless nonlinear equations -50
(e.g., voltage–current characteristics) are often of the form b(t) =
h(a(t)) (e.g., with a voltages and b currents for a voltage–current
characteristic). On the contrary to the trapezoidal method where
n -100
the update equations remain memoryless as b = h(an ), the mid- -400 -200 0 200 400
point method equations become
n+1 n (a) e = 0 V
b = −b + 2h 12 an+1 + an . (37)
This transformation would then be applied to the nonlinear
elements at the root of a wave digital filter tree [9, 28], or to the Trapezoidal
nonlinear characteristic equations in the nodal DK approach [3] 400 Midpoint
and in the generalized state-space [8]. Reference
Alternatively, we can use the process described in Sec. 4.2 300
to exploit simulations that use the trapezoidal method, applying
the appropriate transformation to the input sequence (following
200
Eq. (28)), and adding the conversion equations (following Eqs. (29)
and (30)) to get the state variables for the midpoint simulation.
100
5. CASE STUDY
-400 -200 0 200 400
0
We study the diode clipper [14, 19, 20, 30] as shown in Fig. 2 to
illustrate the concepts developed in the previous sections, as we
study and compare the behavior of both methods in several scenar-
ios and provide tools to understand and forecast such behavior. -50
DAFX-5
DAFx-172
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
400 400 20
20
200 200
0 0
0 -20 0 -20
-200 -200
-40 -40
-400 -400
-500 0 500 -500 0 500
Figure 4: Error (in dB) for both methods. The equilibria are indicated in red.
5.2. Simulation parameters the first-order behavior around the system equilibrium is a non-
oscillating decaying exponential. Furthermore, the transition func-
We adopt typical parameter values for fixed-rate virtual analog tions are quasi-linear functions for a large region of state values
simulations as shown in Tab. 1. The response of the reference sys- around the equilibrium, where we have a quasi-exponential decay
tem is approximated using the MATLAB adaptive solver ode45 of the state sequence once a state value falls within that region. For
set to work with a target error at machine precision. The equilibria high state voltages however, while the midpoint method stays close
and the implicit equations for both numerical methods are com- to the reference in the convergent non-oscillating region, the trape-
puted using the MATLAB root finder fzero set to work with a zoidal method drops sharply in the oscillating region. This means
target error at machine precision. that a state sample in that region will iterate to a sample with a
much lower voltage than the true solution (i.e., overshoot the true
5.3. Constant-input study voltage) before entering the convergent non-oscillating regime.
In Fig. 3b, we see clearly that the first-order behavior near
Many typical inputs are characterized by a constant-input regime. the equilibrium is substantially different for the reference on the
For example, an impulse response will lead to a constant zero input one hand and the two methods on the other hand. The reference
after the impulse. Similarly, a step response will lead to a constant follows again a non-oscillating exponential decay in that region,
non-zero input after the transient. In such a case, the system can while the two methods present an oscillating exponential decay as
be interpreted as an ODE (i.e., as a system without input), with the shown by the negative slope of the transition function at the equi-
input voltage becoming a constant in the equation such that librium. Beyond that region, the behavior of the two methods are
actually very different. The midpoint method shows a decaying
vt (t) = h(v(t)) = f (v(t), e(t)). (41) oscillating behavior on a large section of state values both above
and below the equilibrium and only match the non-oscillating ex-
First, we investigate the error made by each method on a sin- ponential decay behavior of the reference solution for low state
gle simulation step, i.e. when computing the new state v n+1 (or values (. −100 mV). On the other hand, the trapezoidal method
v(tn+1 )) from a current state v n (or v(tn )). Fig. 3 show the new presents a non-oscillating quasi-exponential decay close to the ref-
state v n+1 and the error |v(tn+1 ) − v n+1 | as a function of the erence for almost all state values below the equilibrium, with small
current state v n (or v(tn )) for two constant voltage source values oscillation only close to the equilibrium. However, the behavior
(respectively e = 0 and 0.5V). The plots show that while the two above the equilibrium is again characterized by a transition func-
methods expectedly compare with one another and the reference tion dropping again sharply in the oscillating region.
for lower v(tn ) values where the system behaves quasi-linearly, Fig. 5 shows the behavior of both methods for a wide range
the error for the trapezoidal rule rises sharply for higher values of e and v n values. Knowing that the reference is always con-
(roughly 200 mV and higher). That region corresponds to the re- verging non-oscillating (i.e., light gray), its behavior is matched
gion where the system is highly non-linear and stiff. Those obser- by the trapezoidal method only if the state variable does not ex-
vations extend readily to other choices of input voltage and state ceed a value with a similar order of magnitude across the tested
values as shown in Fig. 4 where we display the error as a function e values. The method becomes however oscillating above those
of the current state v(tn ) and the constant input value e. values, becoming even diverging oscillating for very high v n . The
midpoint method is convergent non-oscillating over a wide range
5.4. Fixed-point analysis of values for e and v n values. As hinted in Fig.3b, we also see that
high values of v n , for which the trapezoidal method is oscillating,
Using the analysis described in Secs. 3.5 and 3.6 for Fig. 1, we keep a converging non-oscillating behavior if the input value e is
further understand the behavior for the two methods and the refer- low. However, the midpoint method is oscillating for an increasing
ence. In Fig. 3a, we see that for both methods and the reference, range of high v n values as the input value e becomes high.
DAFX-6
DAFx-173
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6. CONCLUSION
7. REFERENCES
DAFX-7
DAFx-174
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
100 300
Trapezoidal Trapezoidal
Midpoint 250 Midpoint
80 Reference Reference
200
60 150
100
40
50
20 0
-50
0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25
260
200
250
150 240
230
100
0 0.05 0.1 0.15 0.2 0.25 0 0.05 0.1 0.15 0.2 0.25
(c) v(0) = 0.1 V and e = 0.5 V (d) v(0) = 0.3 V and e = 0.5 V
n
Figure 6: Step response simulation x and reference x(t) for different initial conditions and constant input values.
[13] A. Falaize and T. Hélie, “Passive guaranteed simulation of analog [22] E. Süli and D. F. Mayers, An introduction to numerical analysis,
audio circuits: A port-Hamiltonian approach,” Appl. Sci., vol. 6, no. Cambridge Univ. Press, 2003.
10, 2016. [23] G. G. Dahlquist, “A special stability problem for linear multistep
[14] F. G. Germain and K. J. Werner, “Design principles for lumped model methods,” BIT Numer. Math., vol. 3, no. 1, pp. 27–43, 1963.
discretisation using Möbius transforms,” in Proc. 18th Int. Conf. Dig- [24] P. Howoritz and W. Hill, The Art of Electronics, Cambridge Univ.
ital Audio Effects, 2015. Press, 2nd edition, 1989.
[15] D. T. Yeh, J. S. Abel, and J. O. Smith, “Simulation of the diode lim- [25] M. J. Olsen, K. J. Werner, and F. G. Germain, “Network variable pre-
iter in guitar distortion circuits by numerical solution of ordinary dif- serving step-size control in wave digital filters,” in Proc. 20th Int.
ferential equations,” in Proc. 10th Int. Conf. Digital Audio Effects, Conf. Digital Audio Effects, 2017.
2007. [26] E. A. Coayla-Teran, S.-E. A Mohammed, and P. R. C. Ruffino,
[16] T. Stilson, Efficiently-variable non-oversampled algorithms in “Hartman-Grobman theorems along hyperbolic stationary trajecto-
virtual-analog music synthesis, Ph.D. diss., Stanford Univ., 2006. ries,” Discrete Contin. Dyn. Syst., vol. 17, no. 2, pp. 281–92, 2007.
[17] F. G. Germain and K. J. Werner, “Joint parameter optimization of [27] G. G. Dahlquist and B. Lindberg, “On some implicit one-step meth-
differentiated discretization schemes for audio circuits,” in Proc. 142 ods for stiff differential equations,” Tech. Rep. TRITA-NA-73-02,
Conv. Audio Eng. Soc., 2017. The Royal Institute of Technology, Stockholm, Sweden, 1973.
[18] F. G. Germain and K. J. Werner, “Optimizing differentiated dis- [28] K. J. Werner, W. R. Dunkel, and F. G. Germain, “A computational
cretization for audio circuits beyond driving point transfer functions,” model of the Hammond organ vibrato/chorus using wave digital fil-
in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., 2017. ters,” in Proc. 19th Int. Conf. Digital Audio Effects, 2016.
[19] P. Daly, “A comparison of virtual analogue Moog VCF models,” [29] D. T. Yeh, J. S. Abel, and J. O. Smith, “Automated physical mod-
M.S. thesis, Univ. of Edinburgh, Edinburgh, UK, 2012. eling of nonlinear audio circuits for real-time audio effects—part I:
[20] R. C. D. Paiva, S. D’Angelo, J. Pakarinen, and V. Välimäki, “Emula- Theoretical development,” IEEE Trans. Audio, Speech, Language
tion of operational amplifiers and diodes in audio distortion circuits,” Process., vol. 18, no. 4, pp. 728–37, 2010.
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp. 688–92, [30] J. Macak and J. Schimmel, “Nonlinear circuit simulation using time-
2012. variant filter,” in Proc. 12th Int. Conf. Digital Audio Effects, 2009.
[21] K. M. Hangos, J. Bokor, and G. Szederkényi, Analysis and control of [31] A. S. Sedra and K. C. Smith, Microelectronic circuits, New York:
nonlinear process systems, Springer, 2006. Oxford University Press, 1998.
DAFX-8
DAFx-175
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Kurt James Werner Michael Jørgen Olsen Maximilian Rest Julian D. Parker
The Sonic Arts Research Centre (SARC) Center for Computer Research E-RM Erfindungsbüro Native Instruments GmbH
School of Arts, English and Languages in Music and Acoustics (CCRMA) Berlin, Germany Berlin, Germany
Queen’s University Belfast, UK Stanford University, CA, USA m.rest@e-rm.de julian.parker@
k.werner@qub.ac.uk mjolsen@ccrma.stanford.edu native-instruments.de
DAFX-1
DAFx-176
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
change of variables
cuit modeling since the most common problematic arrangement +
of diodes—series combinations of diodes and anti-parallel com- C11
binations of diodes in series—can be reasonably and intuitively C12 C21 Nonlinearities
modeled as a single one-port device in many circumstances [35]. C22
In this paper, we extend a WDF formalism involving grouped + f (·)
nonlinearities at the root of an SPQR tree [11–13] to accommodate ac bc
a wider variety of independent–dependent variable descriptions of
consolidated multiplies
nonlinear elements, enabling simulation of the two problematic bi ai xc yc
classes of circuits mentioned above. For the first class of circuits,
R scattering
+ +
we exploit the proposed generalization to accommodate whatever S11 F
pair of independent–dependent variables is required for each non- S12 S21 E N
linearity. For the second class of circuits, we explain the mean-
ing of the forbidden topological connections and variable choices S22 M
+ +
in terms of graph and network theory, give principles for choos-
ing proper independent–dependent variable pairs, and show how
ae be ae be
to implement these principles using the proposed generalization.
In this way, the class of problematic circuits which have been ac- (a) Signal flow graph. (b) Loop resolved.
commodated in the State Space and Port Hamiltonian context may
be accommodated in the Wave Digital Filter context as well. Figure 1: Proposed method signal flow graphs. ae is supplied by
The rest of the paper is structured as follows. §2 outlines the and be delivered to classical WDF subtrees below.
proposed generalization to the WDF approach. §3 explains the
first problematic class of circuits and how they can be accommo-
dated using the proposed generalization. §4 explains the second connection criteria between the scattering matrix below and
problematic class of circuits and how they can be accommodated the multiport nonlinear element y c = f (xc ). This is ex-
using the proposed generalization. To support a general search for pressed mathematically by ac = bi and ai = bc .
problematic cutsets and loops, §5 reviews two graph theorems. • Scattering between ai and bi and “external” port wave vari-
ables ae and be . This is expressed mathematically by the
2. WDFS WITH GROUPED NONLINEARITIES scattering matrix S with partitions S11 , S12 , S21 , and S22 .
Because the circuit topology described by this scattering is
In this section we review the method of grouped nonlinearities in usually a complicated R-type topology [25], this is labeled
WDFs, simultaneously extending it to handle nonlinearities and “R scattering.”
nonadaptable linear circuit elements expressed with a wide variety • The rest of the WDF structure which is fed reflected waves
of independent and dependent variables. be by the root topology and provides incident waves ae .
The rest of the structure has the form of a standard WDF
2.1. Overview “connection tree” [11, 37] and is not shown in the diagram.
The method of grouped nonlinearities in WDFs is outlined dia- These relationships are shown as a vector signal flow graph in
grammatically in Fig. 1. In this formulation there are five concep- Fig. 1a, where delay-free loops are represented by red dashed ar-
tual relationships to consider: rows. The root is described by the system of equations
• All nonlinearities of the circuit grouped together into a non- Nonlinearities y = f (xc ) (1)
linear multiport element at the root of a tree structure (la- ( c
xc = C11 y c + C12 ac (2)
beled “nonlinearities”). The behavior of the group of non- change of variables
linearities is expressed mathematically by y c = f (xc ), bc = C21 y c + C22 ac (3)
(
where xc is a vector of independent network port variables ac = bi (4)
and y c is a vector of dependent network port variables. compatibility
a = bc (5)
• A change of variables xc and y c to the vectors of incident ( i
bi = S11 ai + S12 ae (6)
and reflected voltage waves ac and bc (labeled “change scattering
of variables”), the subscript c denoting “converter.”2 This be = S21 ai + S22 ae . (7)
change of variables is expressed mathematically by the ma- Some of the delay-free loops can be resolved using matrix al-
trix C with partitions C11 , C12 , C21 , and C22 . gebra [11], yielding a consolidated version of (1)–(7)
• A compatibility relationship between ac and bc and “inter-
nal” port wave variables ai and bi , simply enforcing port y c = f (xc ) (8)
xc = Eae + Fy c (9)
2 This relates to earlier approaches to interfacing State Space and WDF
systems [36]. be = Mae + Ny c (10)
DAFX-2
DAFx-177
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
− −
vin
+
vin
+
− −
R1
iin + +
− −+ −+
−
vin
iout
vout in out in out
+ + +
C1 + R1 − −
R2 R1 R2
iin R2
root R1 R1
R1
R3 + +
C1 − −
R3 C1 R3
R1 C1 R3 R2
(a) Schematic. (b) Schematic rearranged. (c) SPQR tree. (d) Wave Digital Filter.
Figure 2: Relaxation oscillator schematic and derivation of Wave Digital Filter structure.
with Table 1: t11,n , t12,n , t21,n , t22,n for different dependent and inde-
pendent variables at a port n.
E = C12 (I + S11 HC22 ) S12 , F = C12 S11 HC21 + C11
independent variable dependent variable
M = S21 HC22 S12 + S22 , N = S21 HC21
xn t11,n t12,n yn t21,n t22,n
−1 vn 1/2 1/2 in 1/(2Rn ) −1/(2Rn )
where I is the identity matrix and H = (I − C22 S11 ) . This
in 1/(2Rn ) −1/(2Rn ) vn 1/2 1/2
system of equations is shown as a vector signal flow graph in an 1 0 bn 0 1
Fig. 1b; notice that one delay-free loop through F and f (·) still
remains. This delay-free loop can be resolved, e.g., by Newton–
Raphson iteration [13], table lookup / the K method [12, 38], cus- be linear combinations of an and bn is expressed by
tom nonlinear solvers [33], or a final matrix inversion in the case
that all nonadaptable elements are linear [15]. xn t t12,n an
= 11,n . (11)
yn t21,n t22,n bn
2.2. Populating S Solving for xn and bn yields an equation in the form of C
To use the method just outlined, matrices S and C must be popu- xn c c12,n yn
lated. S can be found using the techniques of [11], which applies = 11,n , (12)
bn c21,n c22,n an
Modified Nodal Analysis [39] to the R-type topology where in-
stantaneous Thévenin equivalents represent the incident wave at where
each port, combining with the WDF wave definition to solve for " t12,n t11,n t22,n −t12,n t21,n
#
S. If the R-type adaptor contains no absorbed non-reciprocal ele- c11,n c12,n
= t22,n
1
t22,n
t . (13)
ments, the method of [40] may be used to solve for S. c21,n c22,n t
− t21,n
22,n 22,n
DAFX-3
DAFx-178
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
I1
− +
C2 + I1 D1 D1 D2 C2 − D1
−+ +−
R1 C1 root S1 S2 S1
+ C1 +
S2
−
P1
−
S1 C1 P1
vin D1
− + +
P1
+
− C2 vout +− −+
I1 S2
D2 R1 D2 R1 D2
− vin vin
−
+
+
C2 C1 vin + R1
(a) Schematic. (b) Schematic rearranged. (c) SPQR tree. (d) Wave Digital Filter.
Figure 3: Series clipper schematic and derivation of Wave Digital Filter structure.
Table 2: Determinant of (I − C22 S11 ) for different variable choices for series clipper (Fig. 3). Γ = R1 + R2 + R3 .
port 2 variables: x2 −→ y2
v2 → i2 v2 → b 2 i2 → v2 i2 → b2 a2 → v2 a2 → i2 a2 → b2
v1 → i1 0 2R1 /Γ 4R1 /Γ 2R1 /Γ 4R1 /Γ 0 2R1 /Γ
port 1 variables: x1 −→ y1
h (xc ) = 0 is solved (i.e., finding a suitable xc ) using Newton– ized by four network variables: vin , iin , vout , and iout . In some
(0)
Raphson iteration [13] by first providing an initial guess xc and circuits, op amps may exhibit non-ideal voltage clipping behav-
iterating according to ior, which may be modeled with a tanh (·)-based function [4]. In
−1 some circuits op amps are configured to act as voltage compara-
x(k+1)
c = xc(k) − J xc(k) f xc(k) (19) tors, which may be modeled with a sgn (·)-based function. As
nonlinear two-ports, op amps are modeled by
where index k is the iteration count and J(·) is the Jacobian opera- !
(0)
tor. A typical initial guess at time step n is xc (n) = xc (n − 1); f1 (iout , vin )
vout =
i
= f out . (20)
an alternative is given in [13]. iin vin
f2 (iout , vin )
Given a set of incident waves ae on the root, (18) can be solved
using Newton–Raphson iteration [13] to yield xc . y c = f (xc ) is yc xc
evaluated to find y c , and y c and ae are finally used to find be .
In the new generalized framework, elements of the vectors Typically we have f2 (iout , vin ) = 0, but f1 (iout , vin ) can take a
of independent root variables xc and dependent root variables y c number of forms, e.g.
may be voltages, currents, or any linear combination. Earlier we
discussed the implications of that for forming C. Of course, this clipper: vout = Vmax tanh (Avin ) (21)
also affects the definition of f (·). comparator: vout = Vmax sgn (vin ) , (22)
DAFX-4
DAFx-179
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
I1
− +
C2 + I1 D1 D1 D2 C2 − D1
−+ +−
R1 C1 root R1 S1
+ C1 +
S1
−
P1
−
R1 C1 P1 R1
vin D2
− + +
P1
+
− C2 vout +− −+
D1 I1 S2
R1 D2 R1 D2
− vin vin
−
+
+
C2 C1 vin + R1
(a) Schematic. (b) Schematic rearranged. (c) SPQR tree. (d) Wave Digital Filter.
Figure 4: Parallel clipper schematic and Derivation of Wave Digital Filter structure.
Table 3: Determinant of (I − C22 S11 ) for different variable choices for parallel clipper (Fig. 4). ∆ = G1 + G2 + G3 and Gx = 1/Rx .
port 2 variables: x2 −→ y2
v2 → i2 v2 → b 2 i2 → v2 i2 → b2 a 2 → v2 a 2 → i2 a2 → b2
v1 → i1 4G3 /∆ 2(G2 + G3 )/∆ 4G2 /∆ 2(G2 + G3 )/∆ 4G2 /∆ 4G3 /∆ 2(G2 + G3 )/∆
port 1 variables: x1 −→ y1
using the stamps from Tab. 1 and procedure from §2.3. Using (13) To fix this class of circuit, start with the standard choice of
this yields the appropriate C matrix independent and dependent network variables for each nonlinear-
ity. Identify each loop and cutset in the circuit by hand or using
−Rin 0 1 0 the graph theorems discussed in §5. If any problematic cutsets or
C11 C12 0 −1/Rout 0 1/Rout loops exist, choose new dependent variables for some of the non-
C= = .
C21 C22 −2Rin 0 1 0 linearities that make up that cutset or loop to avoid the issue.
0 2 0 −1
This choice of C in the proposed general framework enables the 4.1. Examples of Second Class
simulation of the relaxation oscillator.4
To discuss these issues, we consider three circuits: the “series
diode clipper” shown in Fig. 3a, the “parallel diode clipper” shown
4. SECOND CLASS OF PROBLEMATIC CIRCUITS in Fig. 4a, and the “series–parallel diode clipper” shown in Fig. 5.
Consider the series diode clipper and its WDF structure deriva-
A second class of circuits that require the generalization presented
tion shown in Fig. 3. Notice that inside the root topology S1 in
in this paper are circuits that include cutsets composed entirely
Fig. 3b, we have a node with only diodes connected to it. If the
of nonlinearities (and current sources). In circuit theory, it is for-
diodes were written in the form i = f (v), a current-source-only
bidden to have any cutset in a circuit graph composed entirely of
cutset is created. A remedy for this circuit would be to write either
current sources [41]. The presence of a cutset of current sources in
or both of the diodes in the form v = f (i), or more broadly in
a circuit creates the potential for a violation of Kirchhoff’s current
any form that does not have current as the dependent variable. In
law: the sum of currents entering a node must equal the sum of
Tab. 2, the determinant of matrix (I − C22 S11 ) is shown for the
currents leaving that node. A dual restriction is that no loop in a
49 different possible choices of xn and y n at port 1 (diode D1 )
circuit graph may be composed entirely of voltage sources [41].
and port 2 (diode D2 ). Notice that for the combinations that are
The presence of a loop of all voltage sources creates the poten-
forbidden according to circuit theory (both diodes with current as
tial for a violation of Kirchhoff’s voltage law: the sum of voltages
the dependent variable), the matrix which needs to be inverted be-
around any loop in the circuit must equal zero. This is a circuit
comes singular (its determinant is 0)—hence the circuit cannot be
theoretic argument but it is also expressed in the mathematics of
simulated using those variables.
the root topology. In the case of violating either the loop or the cut-
It is not always a valid solution to choose voltage as the de-
set criteria, it appears that in the WDF context the consequence is
pendent variable. Consider the parallel diode clipper and its WDF
that the matrix (I − C22 S11 ) which needs to be inverted is made
structure derivation shown in Fig. 4. Notice that inside the root
singular. Examples will be given in the next section.
topology R1 in Fig. 4b there is a loop composed only of diodes.
4 Anti-aliasing and multi-rate methods can be used to improve the sim- If the diodes were both written in the form v = f (i), a voltage-
ulation of the relaxation oscillator; these are described in Olsen et al. [33]. source-only loop is created, (I − C22 S11 ) becomes singular, and
DAFX-5
DAFx-180
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
R1 C1
+ R1 C1
D2 vc
vin + e2
Electr. Edge
vin
+
− C2 vout D2
D1
+
vout vin + R1 e1 vb e3 e4 e5
− C2
D3 C1 e2
− D1 C2 e3 e1
− D1 e4
va
D2 e5
Figure 5: Series–parallel clipper schematic. (a) (c)
(b)
5. ENUMERATING LOOPS AND CUTSETS ON GRAPHS Consider the parallel diode clipper whose schematic is shown in
Fig. 6a. Using the mapping between electrical component and
In the previous section we discussed circuits whose cutsets and graph edges shown in Fig. 6b, a graph G of the parallel clipper
loops were easily identifiable at a glance. Unfortunately in gen- is formed in Fig. 6c. G has 3 nodes {va , vb , vc } and 5 edges
eral circuits may be very complex and involve too many loops and {e1 , e2 , e3 , e4 , e5 }, i.e., v = 3 and e = 5.
cutsets to enumerate at a glance. To handle problematic circuits To enumerate all loops on G we choose a tree T = {e2 , e3 }
in general it is necessary to examine all loops and cutsets in the and corresponding co-tree T 0 : {e1 , e4 , e5 }. Combining each edge
circuit graph and choose the dependent variables of the nonlinear- c ∈ T 0 with other edges chosen from T yields e − v + 1 = 3
ities to avoid current-source-only cutsets and voltage-source-only fundamental loops {ce1 , ce4 , ce5 } encoded in a matrix Bf
loops. Here we review dual graph theorems for enumerating loops
vc
and cutsets in a graph (an alternative to identifying them by vi-
sual inspection as in the previous section) and demonstrate their
e2
e1 e2 e3 e4 e5
ce1 1 1 1 0 0 c1
application to two circuits discussed in §4. vb ce1 e3 ce4 e4 ce5 e5
Bf = ce4 0 0 1 1 0 c2 .
e1 ce5 0 0 1 0 1 c3
5.1. Find All Loops in a Circuit
va
Here we review a theorem for enumerating the set of all loops in We take all possible ring sums among the rows of Bf to find the
an electrical circuit and give example applications of the theorem intermediate matrix
for the parallel diode clipper.
The theorem is stated as follows [42, p. 50]: e1 e2 e3 e4 e5
1. Let the circuit be represented by a nonoriented, connected ce1 1 1 1 0 0 c1
graph G with v vertices and e edges, where each vertex rep- ce4
0 0 1 1 0
c
2
resents a node in the circuit and each edge represents a one- ce5 0 0 1 0 1 c3
port electrical element in the circuit. B1 = ce1 ⊕ ce4
1 1 0 1 0 c4 .
ce1 ⊕ ce5
1 1 0 0 1 c5
2. Choose a tree T on G. Form a set of fundamental loops with
ce4 ⊕ ce5
0 0 0 1 1
c
6
respect to T by reinstating each edge of the cotree T 0 one at
ce1 ⊕ ce4 ⊕ ce5 1 1 1 1 1 redundant
a time to create fundamental loops each involving one edge
and a tree path formed by some or all of the branches of T .
The fundamental loops are represented mathematically by The last ring sum represents an invalid loop, so it is discarded in
a matrix Bf where each of the e − v + 1 fundamental loops forming Ba . The final loop matrix Ba and graphical representa-
is a row, and the e edges of G are the columns. tion of the six loops {c1 , c2 , c3 , c4 , c5 , c6 } is given by
DAFX-6
DAFx-181
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-182
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
other benefits of flexibility in dependent variable choice for nonlin- [20] T. Schwerdtfeger A. Kummert, “Newton’s method for modularity-
earities, the insights of this paper could be used in the other WDF preserving multidimensional wave digital filters,” in Proc. IEEE Int.
formulations [18–23] which currently use wave variables only but Work. Multidimensional Syst., Vila Real, Portugal, 2015.
[21] T. Schwerdtfeger, Modulare Wellendigitalstrukturen zur Simulation
also currently should not produce problematic cutsets or loops.
hochgradig nichtlinearer Netzwerk, Ph.D. thesis, Bergischen Univer-
sität Wuppertal, Wuppertal, Germany, Dec. 2016.
7. REFERENCES [22] A. Bernardini and A. Sarti, “Dynamic adaptation of instantaneous
nonlinear bipoles in wave digital networks,” in Proc. 24th European
[1] D. T. Yeh, J. S. Abel, and J. O. Smith III, “Automated physical mod- Signal Process. Conf., Budapest, Hungary, 2016.
eling of nonlinear audio circuits for real-time audio effects—Part I: [23] A. Bernardini and A. Sarti, “Biparametric wave digital filters,” IEEE
Theoretical development,” IEEE Trans. Audio, Speech, Language Trans. Circuits Syst. I: Reg. Papers, 2017, to be published, DOI:
Process., vol. 18, no. 4, pp. 728–737, 2010. 10.1109/TCSI.2017.2679007.
[2] D. T. Yeh, “Automated physical modeling of nonlinear audio circuits [24] K. J. Werner, V. Nangia, J. O. Smith III, and J. S. Abel, “A general and
for real-time audio effects—Part II: BJT and vacuum tube examples,” explicit formulation for wave digital filters with multiple/multiport
IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 4, pp. nonlinearities and complicated topologies,” in Proc. IEEE Work.
1207–1216, 2012. Appl. Signal Process. Audio Acoust., New Paltz, NY, 2015.
[3] M. Holters and U. Zölzer, “Physical modelling of a wah-wah ef- [25] D. Fränken, J. Ochs, and K. Ochs, “Generation of wave digital struc-
fect pedal as a case study for application of the nodal DK method to tures for networks containing multiport elements,” IEEE Trans. Cir-
circuits with variable parts,” in Proc. 14th Int. Conf. Digital Audio cuits Syst. I: Reg. Papers, vol. 52, no. 3, pp. 586–596, Mar. 2005.
Effects, Paris, France, 2011. [26] S. Petrausch and R. Rabenstein, “Wave digital filters with multiple
[4] M. Holters and U. Zölzer, “A generalized method for the derivation nonlinearities,” in Proc. 12th European Signal Process. Conf., Vi-
of non-linear state-space models from circuit schematics,” in Proc. enna, Austria, 2004.
23rd European Signal Process. Conf., Nice, France, 2015. [27] Ó. Bogason and K. J. Werner, “Modeling circuits with operational
[5] A. Falaize-Skrzek and T. Hélie, “Simulation of an analog circuit of transconductance amplifiers using wave digital filters,” in Proc. 20th
a wah pedal: a port-Hamiltonian approach,” in Proc. 135th Conv. Int. Conf. Digital Audio Effects, Edinburgh, UK, 2017.
Audio Eng. Soc., New York, NY, 2013. [28] Ó. Bogason, “Digitizing analog circuits containing op amps
[6] A. Falaize and T. Hélie, “Passive guaranteed simulation of analog using wave digital filters,” Blog Post, Mar. 20 2016, On-
audio circuits: A port-Hamiltonian approach,” Appl. Sci., vol. 6, no. line: http://obogason.com/emulating-op-amp-
10, 2016, Article #273. circuits-using-wdf-theory/.
[7] A. Falaize, Modélisation, simulation, génération de code et correc- [29] M. Verasani, A. Bernardini, and A. Sarti, “Modeling Sallen-Key
tion de systèmes multi-physiques audios: Approache par réseau de audio filters in the wave digital domain,” in Proc. IEEE Int. Conf.
composants et formulation Hamiltonienne à ports, Ph.D. thesis, Uni- Acoust., Speech Signal Process., New Orleans, LA, 2017.
versité Pierre et Marie Curie, Paris, France, July 2016. [30] M. Rest, W. R. Dunkel, K. J. Werner, and J. O. Smith, “RT-WDF—a
[8] A. Falaize and T. Hélie, “Passive simulation of the nonlinear port- modular wave digital filter library with support for arbitrary topolo-
Hamiltonian modeling of a Rhodes piano,” J. Sound Vibration, vol. gies and multiple nonlinearities,” in Proc. 19th Int. Conf. Digital
390, pp. 289–309, 2017. Audio Effects, Brno, Czech Republic, 2016.
[9] A. Fettweis, “Wave digital filters: Theory and practice,” Proc. IEEE, [31] W. R. Dunkel, M. Rest, K. J. Werner, M. J. Olsen, and J. O. Smith
vol. 74, no. 2, pp. 270–327, 1986. III, “The Fender Bassman 5F6-A family of preamplifier circuits—a
[10] G. De Sanctis and A. Sarti, “Virtual analog modeling in the wave- wave digital filter case study,” in Proc. 19th Int. Conf. Digital Audio
digital domain,” IEEE Trans. Audio, Speech, Language Process., vol. Effects, Brno, Czech Republic, 2016.
18, no. 4, pp. 715–727, May 2010. [32] M. Rest, J. D. Parker, and K. J. Werner, “WDF modeling of a KORG
[11] K. J. Werner, J. O. Smith III, and J. S. Abel, “Wave digital filter adap- MS-50 based non-linear diode bridge VCF,” in Proc. 20th Int. Conf.
tors for arbitrary topologies and multiport linear elements,” in Proc. Digital Audio Effects, Edinburgh, UK, 2017.
18th Int. Conf. Digital Audio Effects, Trondheim, Norway, 2015. [33] M. J. Olsen, K. J. Werner, and F. G. Germain, “Network variable pre-
[12] K. J. Werner, V. Nangia, J. O. Smith III, and J. O. Abel, “Resolving serving step-size control in wave digital filters,” in Proc. 20th Int.
wave digital filters with multiple/multiport nonlinearities,” in Proc. Conf. Digital Audio Effects, Edinburgh, UK, 2017.
18th Int. Conf. Digital Audio Effects, Trondheim, Norway, 2015. [34] K. J. Werner, Virtual Analog Modeling of Audio Circuitry Using
[13] M. J. Olsen, K. J. Werner, and J. O. Smith III, “Resolving grouped Wave Digital Filters, Ph.D. diss., Stanford Univ., Stanford, CA, USA,
nonlinearities in wave digital filters using iterative techniques,” in Dec. 2016.
Proc. 19th Int. Conf. Digital Audio Effects, Brno, Czech Republic, [35] K. J. Werner, V. Nangia, A. Bernardini, J. O. Smith III, and A. Sarti,
2016. “An improved and generalized diode clipper model for wave digital
[14] K. J. Werner, W. R. Dunkel, M. Rest, M. J. Olsen, and J .O. Smith III, filters,” in Proc. 139th Conv. Audio Eng. Soc., New York, NY, 2015.
“Wave digital filter modeling of circuits with operational amplifiers,” [36] S. Petrausch and R. Rabenstein, “Interconnection of state space struc-
in Proc. 24th European Signal Process. Conf., Budapest, Hungary, tures and wave digital filters,” IEEE Trans. Circuits Syst. II: Expr.
2016. Briefs, vol. 52, no. 2, pp. 90–93, February 2005.
[15] K. J. Werner, W. R. Dunkel, and F. G. Germain, “A computational [37] A. Sarti and G. De Sanctis, “Systematic methods for the implementa-
model of the Hammond organ vibrato/chorus using wave digital fil- tion of nonlinear wave-digital structures,” IEEE Trans. Circuits Syst.
ters,” in Proc. 19th Int. Conf. Digital Audio Effects, Brno, Czech I: Reg. Papers, vol. 56, no. 2, pp. 460–472, Feb. 2009.
Republic, 2016. [38] G. Borin, G. De Poli, and D. Rocchesso, “Elimination of delay-free
[16] A. Fettweis, “Pseudo-passivity, sensitivity, and stability of wave dig- loops in discrete-time models of nonlinear acoustic systems,” IEEE
ital filters,” IEEE Trans. Circuit Theory, vol. 19, no. 6, pp. 668–673, Trans. Speech Audio Process., vol. 8, no. 5, pp. 597–605, Sept. 2000.
Nov. 1972. [39] C.-W. Ho, A. E. Ruehli, and P. A. Prennan, “The modified nodal
[17] K. Meerkötter and R. Scholz, “Digital simulation of nonlinear cir- approach to network analysis,” IEEE Trans. Circuits Syst., vol. 22,
cuits by wave digital filter principles,” in IEEE Int. Symp. Circuits no. 6, pp. 504–509, June 1975.
Syst., Portland, OR, 1989. [40] G. O. Martens and K. Meerkötter, “On N-port adaptors for wave
[18] T. Schwerdtfeger and A. Kummert, “A multidimensional signal pro- digital filters with application to a bridged-tee filter,” in Proc. IEEE
cessing approach to wave digital filters with topology-related delay- Int. Symp. Circuits Syst., Munich, Germany, 1976.
free loops,” in IEEE Int. Conf. Acoust., Speech Signal Process., Flo- [41] L. O. Chua, Computer-Aided Analysis of Electronic Circuits,
rence, Italy, 2014. Prentice-Hall, 1975.
[19] T. Schwerdtfeger and A. Kummert, “A multidimensional approach [42] S.-P. Chan, Introductory Topological Analysis of Electrical Net-
to wave digital filters with multiple nonlinearities,” in Proc. 22nd works, Holt, Rinehart and Winston, Inc., New York, NY, 1969.
European Signal Process. Conf., Lisbon, Portugal, 2014.
DAFX-8
DAFx-183
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-184
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 2: Signal flow graph of a nonlinear block. Blend stage is omitted in the second nonlinear block.
DAFX-2
DAFx-185
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The nonlinear block of the digital model, corresponding to the The method used in this work to measure the linear part of the
power stage of the guitar amplifier is slightly different than the reference system is the same as described in [6]. An exponential
first nonlinear block. Instead of using a polynomial mapping func- sine sweep is sent through the reference device and the recorded
tion, a concatenation of three hyperbolic tangents is used, which output is convolved with an inverse filter. The resulting impulse
allows to shape positive and negative half-waves separately. It was response contains the linear impulse response as well as different
already used in [9,11] to model distortion audio circuits. The blend impulse reponses for higher order harmonics. In this work only
stage was omitted in this nonlinear block. the impulse response corresponding to the linear part of the circuit
is used.
To adapt the filters of the model several steps are used. First
the small signal impulse response is measured with an exponen-
3. MEASUREMENT SETUP tial sine sweep from 10 Hz to 21 kHz and an amplitude of 0.01 V.
This yields the filter hlow (n). Afterwards the same measurement
In this work, all measurements are done with a digital audio inter- is repeated, but with an amplitude of 1 V, resulting in hhigh (n).
face. First, the interface is calibrated with a digital oscilloscope. The low amplitude sweep is not exposed to the nonlinear be-
The output gain was altered until a sine wave with a digital ampli- havior of the reference system and contains the influence of all its’
tude of ±1 corresponded to a voltage of ±1 V at the output of the filters. The high amplitude sine sweep gets distorted and the in-
interface. fluence of some of the reference systems’ filters is removed by the
As Fig. 4 illustrates, Output 1 of the interface is connected to nonlinear parts of the reference system. This behavior is depicted
the input of the guitar amplifier under test and the output of the in Fig. 5. The preceding filter H(z) alters the amplitude of a sine
amplifier is connected to a power attenuator, which matches the wave which then passes through a nonlinear ‘block’ and is ampli-
impedance of the amplifier output and provides a line-out, which fied back to the maximum amplitude, thus negating the influence
is connected to input 1 of the audio interface. The direct con- of the preceding filter. The high amplitude sweep gets distorted
and contains the influence of the last filter of the reference system.
The obtained impulse responses are transformed into frequency-
Figure 4: Measurement Setup. domain with a discrete Fourier transform using 16384 samples to
have a high frequency resolution. Afterwards, the small signal fre-
quency response is divided by the large signal frequency response,
DAFX-3
DAFx-186
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
chosen empirically. The optimization algorithm approximates the different frequencies is minimal and the signal has a flat
derivative of the model output with respect to each parameter (in power-spectrum [13]. The cost function in this case calcu-
this case the 256 coefficients) using finite differences, leading to lates the linear spectrogram of both signals and only com-
time-consuming calculations. pares the magnitude spectrogram, disregarding the phases.
The spectrogram is initially calculated with the Fourier trans-
4.2. Nonlinear Blocks form, but the frequency bins are merged into a semitone-
spectrum, starting from f0 = 27.5 Hz. Additionally the
After the small signal frequency response has been adapted, the magnitude spectrogram is weighted with the inverted ab-
filter coefficients of the FIR filter are not changed anymore, only solute threshold of hearing. Afterwards both spectrograms
the parameters of the nonlinear blocks can be altered. are subtracted from another and all values are squared and
Each nonlinear block features a multiplication with a variable summed up to calculate the error value.
gain (pre- and post-gain) of input signal and output signal. The in-
• For guitar amplifiers with nonlinear behavior it has been
tensity of the bias-point shift mentioned in Section 2.1 is adaptable
beneficial to add a last step, where the same cost function
for each nonlinear block, as well as the ‘blend’ stage, where dry
(spectrogram) is used but a recorded guitar track is used as
and wet signal can be mixed by an adaptable parameter.
input, which consists of a combination of guitar tones and
The first nonlinear block of the digital model also uses the
chords to further refine the parameters of the digital model.
parameters mentioned in Section 2.1. To limit the number of pa-
rameters, only the first 40 overtones k0 – k40 can be adapted. The
optimization routine only alters the kn parameters from which the 5. RESULTS
polynomial coefficients an are computed. If the polynomial coef-
ficients are used as parameters, too many unsuitable (or unstable) Evaluating the quality of the adapted model is not a trivial task.
solutions would be possible and the optimization routine would not There are very few perceptually motivated objective scores and
converge. The typical fundamental frequency region of an electric they are neither suited for virtual analog modeling evaluation nor
guitar in standard tuning ranges from 80 Hz to 1100 Hz, depend- are they available for free. There exist perceptually motivated
ing on the number of frets. 40 harmonics do not cover the whole scores like PEAQ or PEMO-Q [14, 15], but they were designed
frequency region for the tones with the lowest pitch, but usually for a different purpose and therefore are not suited for quality as-
the contribution to the overall spectrum of the 40th harmonic is sessment of virtual analog modeling.
negligible. For this reason the adapted digital model is rated with differ-
The nonlinear parameters are adapted for different input sig- ent methods, first it is evaluated with objective measures. If these
nals (of different complexity) and with different cost functions to objective metrics are close to zero, the result of the modeling pro-
assure convergence of the parameters into their global minimum. cess is always good. But in some cases the error is relatively high,
The used algorithm is always the Levenberg–Marquardt optimiza- but the quality of the adapted model is quite good from a percep-
tion routine, as described in [9] with different cost functions. tual point of view. This is why a listening test was conducted to
assess the quality of the adapted models perceptually. The files
• At first, a grid search for the pre- and post-gain of the power- which were used in the listening test were also used to calculate
amplifier nonlinearity is performed because these parame- the objective scores.
ters have the most influence on the shape of the output en- The results are evaluated for different amplifier models and
velope of the digital model. The cost function calculates for different guitar signals. Some amplifiers are tested in multi-
the difference of the envelopes of the output signals and the ple settings, creating distortion with the pre-amplifier, the power-
gain combination with the lowest error is chosen. The en- amplifier or both at once. Other amplifiers are tested in an artist
velope is calculated by low-pass filtering the absolute value preferred setting, where the user controls of the amplifiers are not
of the signal. The cut-off frequency of the used low-pass altered from the settings the artists used in the rehearsal room.
filter is fc = 10 Hz. Figure 6 shows the amplifiers in the artist preferred setting. All
• Afterwards all nonlinear parameters are adapted at the same amplifiers produced very little distortion in the output signal. The
time. The cost function, however, was designed differently first amplifier (top), the Ampeg VT-22, did not feature separate
in this optimization step. It calculates the sum of squares controls for pre-amplifier and power-amplifier and the reverb was
between digital model and reference system, turned off. The Fender Bassman 100 (middle) and Fender Bass-
man 300 (bottom) were set up to introduce almost no distortion, as
C(p) = (y(n) − ŷ(n, p))2 , (4) can be seen by the gain and volume controls.
Different input signals are used for each amplifier. Input sig-
with y(n) as the (digitized) output of the reference system
nals from three guitars with different pick-ups are tested:
and ŷ(n, p) as the output of the digital model. p is the
parameter vector of the digital model. In this optimization 1. single coil pick-up (SC)
step, all filters except the input filter H1 are turned off dur- 2. humbucker pick-up with medium output (HM1)
ing optimization. The chosen input signal is a 1000 Hz sine
wave with amplitudes from 1 V down to 0.001 V. This first 3. humbucker pick-up with high output (HM2)
step helps to find a set of parameters which can be used Only the amplifiers which were modeled in the ‘artist preferred’
as initial parameters in the next optimization step where all setting, were set up to have a clean sound, introducing very little
filters in the digital model are turned back on. distortion in the output signal. The amplifiers which introduced
• The next optimization step is done with a multi-frequency a lot of distortion in the output signal were modeled in multiple
sine wave. The phase shift between each frequency is cho- settings:
sen in such a way that the peak-factor of the sum of the 1. High gain and low volume (pre-amp distortion)
DAFX-4
DAFx-187
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 2: Objective scores for the VT-22 with very little distortion.
Figure 6: Settings for amplifiers in clean setting. 1.) Ampeg VT-22
(top) 2.) Fender Bassman 100 (middle) 3.) Fender Bassman 300
(bottom)
ered as random variables and the correlation coefficient is calcu-
lated according to,
DAFX-5
DAFx-188
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(a) Fender Bassman 100 (clean setting). (b) Ampeg VT-22 (clean setting).
(c) Fender Bassman 300 (clean setting). (d) Madamp A15Mk2 with different distortion settings.
5.2. Listening Test sound to the reference, where 100 represents no detectable differ-
ence between the item and the reference and 0 represents a very
A listening test was conducted to see how well the adapted models annoying difference. One of the test items was a hidden reference,
perform for a human test subject. The listening test aimed at rating which was the same audio-file as the reference item.
the adapted model in relation to the analog reference device. The The listening test featured 20 listening examples with hidden
test subjects were presented with a reference item and two test reference and digital model output and is currently still ongoing.
items. The items should be rated according to how similar they So far 15 participants have taken the test from which 8 were ex-
DAFX-6
DAFx-189
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Madamp A15Mk2 ESR ρ electric-bass, there were minor audible differences in the output
Single coil (SC) 0.4372 0.7762 signal. These results are in agreement with the comments from the
Humbucker medium (HM1) 0.3709 0.8104 participants. Several stated, that they could not perceive any dif-
Humbucker high (HM2) 0.3145 0.8393 ference when the amps were in a ‘clean’ or ‘almost clean’ setting.
The results of the optimization routine for distorted reference
Table 3: Objective scores for the A15Mk2 with power-amp distor- amplifiers are not as good as the results for the clean ones. The
tion (low gain, high volume). more nonlinear the amplifier becomes (more distortion), the higher
is the perceivable difference between digital model and analog ref-
erence device. This assumption was already made, based on the
Madamp A15Mk2 ESR ρ objective scores from Section 5.1, but is confirmed by the results
Single coil (SC) 0.8727 0.5383 of the listening test.
Humbucker medium (HM1) 0.6596 0.6552 The clipping power-amplifier of the Madamp A15Mk2 is rated
Humbucker high (HM2) 0.8236 0.5570 worse than the clipping pre-amplifier, as shown by the single-coil
(SC) items in Fig. 8d. This does not agree with the objective scores,
Table 4: Objective scores for the A15Mk2 with pre-amplifier dis- since the error energy for the clipping power-amplifier is twice as
tortion (high gain, low volume). low as the error energy for the clipping pre-amplifier. In the listen-
ing test, the clipping pre-amplifier was rated with ≈ 90 (median)
and the clipping power-amplifier with ≈ 50 (median), in com-
Marshall JCM900 ESR ρ parison with the hidden reference, which had a median of 100 in
Single coil (SC) 1.5277 0.4603 both cases. This suggests that these objective scores are not suit-
Humbucker medium (HM1) 1.5022 0.2486 able for modeling amplifiers with a lot of distortion and a psycho-
Humbucker high (HM2) 1.5488 0.2237 acoustically motivated cost-function would drastically improve the
virtual analog modeling results for distorted amplifiers.
Table 5: Objective scores for the JCM900 with maximum distor- The listening test results for the last amplifier confirm the as-
tion (high gain, high volume). sumption, that a reference device with high nonlinear behavior is
not identified as well as a system with little nonlinear behavior.
The Marshall JCM 900 was rated worse if both pre- and power-
amplifier were at high values, in comparison to the first 2 test items
perienced listeners, 4 were musicians and 2 were unexperienced were only the pre-amplifier was set to a high value. A common
listeners. At the end of the test, each participant had the option to comment from the participants was, that a difference in the noise
comment on the test. The framework for the listening test was the floor between digital model and reference device made it possible
‘BeaqleJS’ framework, described in [17]. It features example con- to distinguish the reference from the model.
figurations for ABX and Mushra style listening tests. The Mushra
configuration was adapted to fit the needs for model – reference 6. CONCLUSION
comparison.
The test results have been cleaned by deleting the ratings where This work presents an approach for modeling guitar amplifiers
the hidden reference was rated with a score lower than 75, but with system identification methods. Input – output measurements
only one test subject was removed from the evaluation completely, are made on a reference device and a digital model, consisting of
because for 13 of 20 items, the hidden reference was rated with filters and nonlinear mapping functions, is adapted to recreate the
scores much lower than 75. Figures 8a – 8e show the results of the characteristics of the reference device. The results showed that
listening test. The bar displays the 50% quantile (median) for each this method performs very good for reference amplifiers in a clean
item. The lower and upper bounds of the box represent the 25% setting with almost no harmonics. If the amplifier introduces dis-
quantile or the 75% quantile respectively. Outliers are depicted as tortion, the modeling process does not perform as well.
crosses. It is possible to tune the digital model by hand, although it
Figures 8a and 8b show the results for the reference amplifiers is not recommended. This is an indication that the model is able
in clean setting and the adapted models. The results show that to recreate also highly nonlinear systems. Therefore a psycho-
the digital model is always rated in the same range as the ana- acoustically motivated cost-function for the iterative optimization
log reference device for the test items ‘Bass’, ‘Single Coil (SC)’, routine needs to be developed to improve the results for highly
‘Humbucker 1 (HM1)’ and ‘Humbucker 2 (HM2)’. These results nonlinear systems.
confirm that the model is very well adapted as the objective scores, All signals were recorded while the amplifier was not con-
mentioned in Section 5.1, suggest. nected to a cabinet. The influence of a cabinet could lead to re-
The results for the Ampeg VT-22 are similar to the results of duced high frequency content in the output signal, which could
the Fender Bassman 100. In some cases there was an unwanted lessen the perceived difference between reference device and dig-
‘crackling’ noise in the recording of the reference amplifier, which ital model.
was not reproduced by the digital model. This made it possible to The amplitude of the sine sweep to measure the small sig-
identify the difference between the hidden reference and the digital nal frequency response of the reference device was set to 0.01 V,
model output. which might be too high for some amplifiers and lead to a distorted
The last amplifier which is in an almost clean setting was output. This was not the case for the tested amplifiers but to ensure
the Fender Bassman 300 (Fig. 8c). Nevertheless, the HM2 (hum- a correct modeling result a total harmonic distortion measurement
bucker with high output voltage) test item had a nearly identical should be performed and the amplitude of the sweep should be
rating as the hidden reference. Only for the input signal from an adapted accordingly to avoid faulty measurements.
DAFX-7
DAFx-190
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-191
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-192
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
R1,1 +6V
R1,2 R1,3 C
-
+ V1 +Vs
R F2 I1 R1
–6V - R2 I2 I3 R3
Vin
R2,1 +6V +15V + Vo Vk
R2,2 R2,3
-
- Vs
+ V2 +
–6V Vout
–15V Figure 2: Circuit diagram for a single folding cell, cf. Fig. 1.
R3,1 +6V
R3,2 R3,3
-
+ V3 folding branches. Component values for the circuit are given in Ta-
–6V R7 ble 1. Indices have been used to indicate branch number, e.g. R5,2
Vin
+6V denotes resistor R2 in the fifth branch.
R4,1
R4,2 R4,3
-
DAFX-2
DAFx-193
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-194
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4
3.1. Filtering Stage
2
The filter at the output of the system is a one-pole lowpass filter.
(V)
by
′
-2
wc
H(s) = , (19)
-4
-10 -8 -6 -4 -2 0 2 4 6 8 10
s + wc
Vin (V) where wc = 2πfc and fc represent the cutoff frequency in radians
(a) and Hz, respectively [44, 45]. From Fig. 1 the cutoff of the filter is
5 derived as
1
fc = ≈ 1.33 kHz. (20)
2.5 2πRF2 C
(V)
Vin
the filter is simply to act as a fixed tone control, attenuating the per-
′
-2.5 SPICE
Model ceived brightness of the output by introducing a gentle 6-dB/octave
-5
1 1 3
roll-off. Equation (19) can be discretized using the bilinear trans-
0 2f0
Time (s)
f0 2f0
form, which results in the z-domain transfer function
(b)
b0 + b1 z −1
H(z) = , (21)
Figure 4: Comparison of the input–output relationship of a SPICE 1 + a1 z −1
simulation of the Buchla 259 timbre circuit and the proposed dig- where
ital model for (a) a DC voltage sweep and (b) 100-Hz sinusoidal wc T wc T − 2
input with 5-V peak gain. b0 = b1 = and a1 = .
2 + wc T wc T + 2
Due to the low cutoff parameter, the warping effects of the bilinear
means that we can derive a digital model using discrete memory- transform can be neglected. This transfer function can be imple-
less mappings of the voltage relationships derived in the previous mented digitally, e.g. using Direct Form II Transposed [44].
section. First, we define our discrete-time sinusoidal input as
3.2. Antialiasing
Vin [n] = A sin(2πf0 nT ), (12)
Given the highly nonlinear nature of wavefolding, audio-rate im-
were n is the sample index, A is peak amplitude, f0 is the funda- plementations of the proposed model using (12)–(18) will suffer
mental frequency and T is the sampling period, i.e. T = 1/fs . from excessive aliasing distortion. This problem can be attributed
From (9) we can then define explicit discrete-time expressions to the corners or edges introduced by the folding cells of the sys-
for the output of each folding branch. To facilitate their imple- tem (cf. Fig. 4). These corners indicate that the first derivative
mentation, terms containing resistor values have been evaluated of the signal is discontinuous and, as such, has infinite frequency
and replaced for their corresponding approximate scalar values: content. In the discrete-time domain, frequency components that
exceed the Nyquist limit will be reflected into the audio band as
aliases.
0.8333Vin [n] − 0.5000s[n] |Vin [n]| > 0.6000
V1 [n] = (13) To ameliorate this condition we propose the use of the
0, otherwise,
BLAMP method, which has previously been used in the context
0.3768Vin [n] − 1.1281s[n] |Vin [n]| > 2.9940 of ideal nonlinear operations such as signal clipping and recti-
V2 [n] = (14) fication [36, 37]. This method consists of replacing the corners
0, otherwise,
with bandlimited versions of themselves. It is an extension of
0.2829Vin [n] − 1.5446s[n] |Vin [n]| > 5.4600 the bandlimited step (BLEP) method used in subtractive synthe-
V3 [n] = (15)
0, otherwise, sis [33–35], which is in turn based on the classic bandlimited im-
pulse train (BLIT) synthesis method [32].
0.5743Vin [n] − 1.0338s[n] |Vin [n]| > 1.8000 The BLAMP function is a closed-form expression that models
V4 [n] = (16)
0, otherwise, a bandlimited discontinuity in the first derivative of a signal. It is
derived from the second integral of the bandlimited impulse [35],
0.2673Vin [n] − 1.0907s[n] |Vin [n]| > 4.0800
V5 [n] = (17) or sinc, function and is defined as
0, otherwise,
1 1 cos(πfs t)
where s[n] = sgn(Vin [n]). From these branches we can then de- RBL (t) := t + Si(πfs t) + , (22)
fine a global summing stage: 2 π π 2 fs
DAFX-4
DAFx-195
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1/6 1/6
1/π 2 1/π 2
Rk,1
Rk,1
Rk,2 Vs
Rk,2
0 0
-2T −T 0 T 2T -2T −T 0 T 2T
V ′ [n]
Time (s) Time(s) 0
k
(a) (b)
DAFX-5
DAFx-196
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Rk,1
Rk,1 5k 0
Rk,2 Vs
Rk,2
0
-10
Rk,1
R 4k
− Rk,1 Vs
Rk,2
k,2
-20
0 t1 t2 t3 t4 1/f0
Frequency (Hz)
(a) 3k
|Vout | (dB)
-30
d = 0.12 d = 0.83
µ 2k
-40
0
1k
−µ -50
d = 0.81 d = 0.15
0 t1 t2 t3 t4 1/f0
Time (s) 0 -60
(b) 0 1 2 3 4 5 6 7 8 9 10
A (gain)
Amplitude
DAFX-6
DAFx-197
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
A
Ve1 [n] 12.000
w1 ()
sin(2⇡f0 nT )
Ve2 [n] 27.777 1 b0
w2 () + + + Veout [n]
5.000
Figure 9: Block diagram of the proposed digital Buchla 259 wavefolder. The LPF block is the lowpass filter at the output of the system.
tal model of the wavefolder is derived using nonlinear memoryless more specifically in its two-point polynomial form. This method
mappings based on the input–ouput voltage relationships within reduces the oversampling requirements of the system, allowing us
the circuit. In an effort to minimize the high levels of aliasing to accurately process sinusoidal waveforms with fundamental fre-
distortion caused by the inherent frequency-expanding behavior of quencies up to 5 kHz at a sample rate of 352.8 kHz, which is eight
the system, the use of the BLAMP method has been proposed, times the standard audio rate. The proposed model is free from
perceivable aliasing and can be implemented as part of a real-time
digital music synthesis environment.
Magnitude (dB)
1 0
0.5 -20
6. ACKNOWLEDGMENTS
Vout
0 -40
-60
-0.5
-80
-1 -100 The work of Fabián Esqueda is supported by the Aalto ELEC
0 1/f0 2/f0 0 5k 10k 15k 20k Doctoral School. Part of this work was conducted during Julian
(a) (b)
Parker’s visit to Aalto University in March 2017.
Magnitude (dB)
1 0
0.5 -20
Vout
0 -40
-60 150
-0.5
-80
-1 -100
0 1/f0 2/f0 0 5k 10k 15k 20k
120
(c) (d)
Magnitude (dB)
1 0
0.5 -20 90
Vout
0 -40
-60
SNR (dB)
-0.5
-80
-1 -100 60
0 1/f0 2/f0 0 5k 10k 15k 20k
(e) (f)
Magnitude (dB)
1 0 30
0.5 -20
Vout
0 -40
-60
-0.5 8xOS + polyBLAMP
-80 0
-1 -100 64xOS
0 1/f0 2/f0 0 5k 10k 15k 20k Audio Rate + polyBLAMP
Time (s) Frequency (Hz) Audio Rate
(g) (h) -30
100 200 400 1k 2k 5k
Fundamental Frequency (Hz)
Figure 10: Waveform and magnitude spectrum of an 890-Hz
sinewave (A = 5) processed using the proposed model (a)–(b) Figure 11: Measured SNRs for the proposed wavefolder model
at audio rate (fs = 44.1 kHz), (c)–(d) at audio rate with the under sinusoidal input (A = 5) implemented trivially (fs =
two-point polyBLAMP method, (e)–(f) using 64 times oversam- 44.1 kHz), with the two-point polyBLAMP method, with oversam-
pling (fs = 2.82 MHz) and (g)–(h) with 8 times oversampling pling by factor 64 (fs = 2.82 MHz) and with oversampling by
(fs = 352.8 kHz) and the two-point polyBLAMP method. factor 8 and the two-point polyBLAMP method (fs = 352.8 kHz).
DAFX-7
DAFx-198
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-199
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-200
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
variable step-size simulation on an equivalent circuit where the Figure 1: RLC Circuit Schematic.
inductor was replaced with a network equivalent gyrator and ca-
pacitor pair. They stated that the equivalent circuit corresponded
to discretization using the Gauss (midpoint) method and demon- follows to convert wave quantities a, b from sampling period Tk
strated that it exhibited the expected behavior of the reference cir- and port resistance Rk to new wave quantities â, b̂ with sampling
cuit. Since their method depends on the replacement of inductors period Tk+1 and port resistance Rk+1 . From the definition of volt-
with network equivalent devices it is unclear whether the method age waves (Equation (1)), voltage and current are given in terms of
will be guaranteed to work for all possible reference circuits. the original wave quantities and port resistance by:
In this section we introduce a new technique which allows a+b
WDFs discretized with any passive discretization technique, in
v =
2
particular the trapezoidal method, to be simulated with variable
step-sizes. The development of the new technique was motivated i= a−b
by the fact that it was not obvious how the technique from [28] can 2Rk
be used with the relaxation oscillator circuit presented in the case We then define the new wave quantities in terms of voltage, current
study in Section 3. In Section 2.1, the general technique will be and the new port resistance by:
introduced and described. Later, in Section 3.4, it will be shown (
how varying the step-size fails to work with the example circuit â = v + Rk+1 i
and how the new technique can be used to mitigate the encoun- b̂ = v − Rk+1 i
tered issues.
Combining these two sets of equations in matrix form yields the
following conversion matrix:
2.1. Variable Step-Size WDF Simulation Technique
â 1 Rk + Rk+1 Rk − Rk+1 a
The variation of step-size in a WDF discretized with the trape- = (3)
b̂ 2Rk Rk − Rk+1 Rk + Rk+1 b
zoidal rule (or other passive discretization technique) requires care-
ful consideration due to the intricate relationship between the sam- In practice, however, b̂ does not get used as the reflected waves
pling rate and reactive network elements. Looking at the definition of both inductors and capacitors only depend on â. Through this
of wave variables (Equation (1)) and the port resistance values for transformation, both the power and the energy stored in the unit
reactive elements (Table 1) it is clear that the wave variables of delays is preserved across step-sizes.
these elements are directly coupled to the sampling rate. As a demonstration of the validity of this method, it was ran on
The stored energy of a passive network is given by [28]: the example RLC circuit used in [28] (Figure 1). The inductor in-
T cident waves from the simulation are shown in Figure 2 computed
wk+1 Gwk+1 − wkT Gwk ≤ xTk Gx xk , (2) without correcting the wave quantities (as shown in the original
paper), computed with our technique and also computed with a
where wk is the vector of state values at time k, G is the diagonal fixed step-size. As seen in the fixed step-size simulation trace, the
matrix of the corresponding port conductances, xk is the vector of traces should be exponentially decaying sinusoids. This correct
source values and Gx is the diagonal matrix of source port con- behavior is seen in the trace of the simulation using the proposed
ductances. In the case of zero input, Equation (2) implies that the technique. The trace is exponentially growing, however, when the
stored energy will monotonically decrease towards zero. While wave quantities are not corrected.
Equation (2) is guaranteed to hold for passive networks under a To reiterate the process of correcting the wave variables, the
fixed step-size, potential issues arise when one varies the step-size following steps must be taken any time the sampling rate is altered:
which changes the corresponding port resistances and the meaning
of the stored energy. 1. Store the wave variables for all reactive elements and ele-
Therefore, when changing the step-size it is not sufficient to ments that appear above them in shared subtrees.
just update the sampling rate, port resistances and matrices relat- 2. Readapt those network elements with the port resistance
ing to the R-type adaptor. It is also necessary (unless the technique corresponding to the new sampling rate.
from [28] can be used) to convert the wave variables to the repre- 3. Convert the stored wave variables using the conversion ma-
sentation that is consistent with the network variables and the new trix of Equation (3).
step-size before continuing the simulation.
4. Update the relevant network elements with the converted
The new step-size variation technique is based on maintaining
wave variables.
the consistency of the network variables across step-size changes.
Doing so ensures that the stored energy and the instantaneous power 5. If the structure contains an R-type adaptor, update S, H,
remain consistent as well. To do so, after changing the step-size E, F, M and N using the updated port resistance values.
and updating the port resistances, we then convert the stored en- 6. If wave quantities need to be output, convert them to a sin-
ergy by performing a wave variable transformation. Similar to the gle unified representation (such as the one relating to audio
development of generalized C matrices in [29], the process is as rate).
DAFX-2
DAFx-201
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0
The comparator is configured as a Schmitt trigger by means of the
positive feedback path between vout and iin . Assuming that the op
-1 amp is powered with a rail voltage of ±Vmax and that it begins with
the op amp in positive saturation, the capacitor begins charging to-
-2 wards +Vmax with the time constant defined by R1 C1 . When the
capacitor reaches Vmax /2, the op amp flips over to negative satu-
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ration and the capacitor begins discharging towards −Vmax . This
Time (ms) behavior is illustrated in Figure 4 starting at sample number 50.
The period and frequency of oscillation are given by the fol-
Figure 2: Inductor incident waves from RLC circuit [28]. lowing equations [31]:
T0 = 2 log(3)R1 C1
3. RELAXATION OSCILLATOR CIRCUIT CASE STUDY (4)
F0 = 1/T0
3.1. Circuit Description
The component values used during all simulations are given in
The circuit to be modeled is called a relaxation oscillator [30] and Table 2. The capacitor value was used to control the frequency of
is also known as an astable multivibrator circuit [31]. It contains oscillation and thus set based on the desired simulation frequency
the following components: a resistor R1 , a capacitor C1 , a voltage F0 per Equation (4).
divider (R2 and R3 ) and an op amp. Throughout this paper it is
assumed that R2 = R3 and that the op amp operates from rail-to-
rail. The circuit schematic is shown in Figure 3. 3.1.1. Op Amp Modeling
function vout = Vmax sgn(vin ) and the second is the no input cur-
0 rent criteria iin = 0. We call vout , iin the dependent variables and
vin , iout the independent variables and let
-5
xC = [vin iout ]T and y C = [iin vout ]T . (5)
-10
0 50 100 150 From this we define the nonlinear relationship as
Time (Samp)
0
Figure 4: Traces of the op amp input voltage, op amp output volt- y C = f (xC ) = . (6)
Vmax sgn(vin )
age and the capacitor voltage.
This particular comparator model was chosen to enforce an instan-
The basic operation of the circuit is to behave as a comparator taneous transition between ±Vmax which exactly matches the step
whose output swings high or low based on the whether the input discontinuity which the polyBLEP method is designed to work
voltage to the op amp is above or below a certain threshold value. with (see Section 3.6).
DAFX-3
DAFx-202
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
3.2. WDF Model of the Circuit however, the nonlinear relationship is modeled as a step disconti-
nuity for which iterative techniques struggle to converge quickly if
We form the WDF model using the techniques from [3, 4]. Ei- at all. Through inspection of the WDF equations representing the
ther by visual inspection or via graph theoretic techniques, the nonlinearity it was found that they are formed in such a way that a
schematic is rearranged as in Figure 5a so that all series and par- quasi-analytic solution can be determined.
allel subtrees are separated from the two-port nonlinearity by the From Equation (7), we get the following complete description
remaining complex connections which are collectively called the of the nonlinear system of equations to be solved:
R-type adapter.
The techniques of [3] are used to determine the scattering be- xC = EaE + Ff (xC ) (8)
havior of the R-type adaptor which is represented by the matrix S.
The following system of equations fully describes the relationship where, for this particular circuit, E is a 2 × 4 matrix and F is a
between the R-type adaptor and the two-port nonlinearity:
2 × 2 matrix. We use domain knowledge of the circuit to sim-
E = C12 (I + S11 HC22 )S12 plify the nonlinear system of equations. Since a3 , a4 and a6 are
y C = f (xC )
F = C12 S11 HC21 + C11 the incident waves relating to the resistors, which are always zero,
xC = EaE + Fy C with (7)
M = S21 HC22 S12 + S22 and since iin is also zero, the following simplified expression is
bE = MaE + Ny C
N = S21 HC21 obtained:
where vin e a + f12 Vmax sgn(vin )
= 13 5 .
iout e23 a5 + f22 Vmax sgn(vin )
S11 S12
S= , H = (I − C22 S11 )−1
S21 S22 The values of vout and iout are dependent on the value and sign of
vin and the constant values from the E and F matrices. Therefore,
and aE = [a3 a4 a5 a6 ] is the vector of incident waves from the the nonlinearity can be solved algorithmically using the following
R-type adaptor. pseudocode:
DAFX-4
DAFx-203
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Simulation Simulation
4.5 4.5
Desired Desired
WDF
Voltage
WDF
Voltage
4 4
3.5 3.5
5 10 15 5 10 15
Error Error
0.1
0
Voltage
Voltage
0.05
-0.05
0
-0.1
5 10 15 5 10 15
Time (samp) Time (samp)
(a) Capacitor voltage error when simulation frequency is less than the (b) Capacitor voltage error when simulation frequency is greater than
desired frequency. the desired frequency.
the simulation runs at a higher frequency than desired. In both where Fs is the sampling rate. Equation (10) gives the worst case
Figure 4 and Figure 6, the simulation was run at 44.1 kHz sam- frequency error estimation as determined by the largest possible
pling rate with a desired period of 101 samples with the average rounding error occurring in the simulation period.
period of the simulations being 102 and 100 for Figures 6a and 6b, A portion of the resulting curve is shown in Figure 7a. The
respectively. minimum points on the curve corresponding to an error of one
sample occur at odd integer periods which are equal distance from
the nearest even integer periods. From those points, the error curve
3.3.2. Simulation Frequency increases in either direction approaching a theoretical limit of two
It was determined experimentally that the overshoot/undershoot samples (indicated by open circles). Finally, at the even integer
error identified in Section 3.3.1 causes the fundamental period T0 period values, there is no error in the period which is notated by
of the simulation to become “phase-locked” to an even integer the solid black dots. The dotted red line shows the maximum error
length period. In the lucky case that T0 was an even integer, it limit which was used in the computation of the frequency error
was found that the fundamental period of simulation Tb0 was equal curves shown in Figure 7b.
to T0 and so the simulation did not contain any frequency error. Hypothetically assuming that a frequency error of six cents is
imperceptible at all frequencies, the worst case error curves (Fig-
For any other choice of T0 , however, two possible values of Tb0
ure 7b) given an idea of the level of oversampling necessary for
were observed:
the fixed step-size simulation error to fall below that level. It is
also evident that the maximum possible simulation error is fre-
Tb0+ = floor(T0 ) − (floor(T0 ) mod 2) (9a)
quency dependent and clearly increasing with frequency. Thus,
Tb0− = ceil(T0 ) + (ceil(T0 ) mod 2) (9b) the amount of oversampling necessary to mitigate the worst case
error is coupled to the desired frequency of simulation.
The phase-locking of the frequency that occurred from these errors
occurred immediately such that Tb0 was the observed period of os- 3.4. Variable Step-size Simulation of the Oscillator WDF
cillation for every period of the simulation beginning with the first
complete period. Variable step-size ODE solvers typically utilize an automatic step-
Since it is not possible to derive an analytic expression for size selection algorithm [28, 32]. The step-size is determined by
the actual simulation frequency error, we derive an expression for estimating the amount of simulation error that would result from
the worst case error in order to investigate the general behavior of continuing to use the current step-size and then either increasing
the maximum possible frequency error. The worst case frequency or decreasing the step-size based on the result of that error calcu-
error curve in cents can be derived using the following formula: lation.
! !! Since the majority of the simulation error in the fixed step-size
F̂0− F̂0+ simulation of the relaxation oscillator occurs in the region of the
∆Fmax = 1200 max log2 , log2 (10) discontinuities in the simulation voltages and the accuracy of the
F0 F0
simulation frequency is coupled to the amount of oversampling
performed, it would be highly desirable to limit oversampling to
where ( just those regions and to also have control over the rate of over-
Fb0+ = Fs /Tb0+ sampling. This would have the benefit of reducing the computa-
Fb0− = Fs /Tb0− tional cost as compared to just oversampling the simulation with a
DAFX-5
DAFx-204
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 7: Worst Case Period and Frequency Error. It should be noted that anti-aliasing is not performed at the
decimation stage. This is because the anti-alias filtering of vout will
change the dynamics of the system, through modification of the
output voltage, affecting the energy stored in the capacitor there-
fixed step-size and would allow the user to control the amount of
fore affecting the frequency of oscillation. In order for the simula-
maximum possible frequency error in the simulation. To accom-
tion frequency to be as accurate as possible, which corresponds to
plish this, domain knowledge of the circuit is used to predict the
the simulation error being as small as possible, we allow the alias-
occurrence of discontinuities and oversampling is performed only
ing to occur within the simulation. How to suppress the aliasing
in that region.
inherent in the output voltage is the subject of Section 3.6.
DAFX-6
DAFx-205
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
dxK
2 dx = +d
Amplitude
K
0
where i and K are as defined in Algorithm 2.
Corrected
Uncorrected Depending on the chosen polyBLEP method, a number of vout
-2
Fixed Step-size samples before and after the step discontinuity have to be altered
5 10 15 20 25 30 35 40 to include the polyBLEP residual amount for the chosen scheme.
This is done exactly the same way as in the original paper.
Stored Energy Due to the effect that the polyBLEP residual has on the dy-
4 Corrected namics of the simulations, the polyBLEP residual is applied to a
3
Uncorrected separate output signal path so that it does not affect the internal
Fixed Step-size
dynamics of the simulation. Thus, the modified output voltage is
Energy
DAFX-7
DAFx-206
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
In [28], the variable step-size simulation of WDFs discretized free loops,” in IEEE Int. Conf. Acoust., Speech, Signal Process.,
with the trapezoidal rule was identified as an open problem. By Florence, Italy, May 4–9 2014, pp. 389–393.
identifying the relationship between wave variables and network [12] S. Petrausch and R. Rabenstein, “Wave digital filters with multiple
nonlinearities,” in Proc. 12th European Signal Process. Conf., Vi-
variables it has been shown how the correct behavior can achieved
enna, Austria, Sept. 6–10 2004, vol. 12, pp. 77–80.
in a WDF model which directly discretizes the reactive elements [13] T. Schwerdtfeger and A. Kummert, “Newton’s method for
and does not use network component equivalencies. The valid- modularity-preserving multidimensional wave digital filters,” in
ity of the technique holds regardless of the discretization tech- IEEE 9th Int. Workshop Multidimensional (nD) Syst. (nDS), Vila
nique being used to discretize the reactive network elements. With Real, Portugal, Sept. 7–9 2015.
WDFs, however, it is well known that passive discretization tech- [14] R. C. D. Paiva, S. D’Angelo, J. Pakarinen, and V. Välimäki, “Emula-
niques must be used. tion of operational amplifiers and diodes in audio distortion circuits,”
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp. 688–
Through the case study of a relaxation oscillator circuit, the 692, Oct. 2012.
need for the new variable step-size technique was demonstrated. [15] D. T. Yeh and J. O. Smith, “Simulating guitar distortion circuits using
The variable step-size WDF simulation technique was used to mit- wave digital and nonlinear state-space formulations,” in Proc. 11th
igate frequency error related to the discontinuities in the device Int. Conf. Digit. Audio Effects, Espoo, Finland, Sept. 1–4 2008, pp.
voltages. An extrapolation formula was used to estimate the oc- 19–26.
currence of future zero crossings in the input voltage of the op [16] K. J. Werner, V. Nangia, A. Bernardini, J. O. Smith III, and A. Sarti,
“An improved and generalized diode clipper model for wave digital
amp. This allowed the system to be switched to the higher simu- filters,” in Proc. 139 Int. Audio Eng. Soc., New York, NY, Oct. 29 –
lation rate for a short duration of three base rate samples encom- Nov. 1 2015.
passing the occurrence of the step discontinuity in the op amp out- [17] W. R. Dunkel, M. Rest, K. J. Werner, M. J. Olsen, and J. O. Smith III,
put voltage. It was also demonstrated how physical modeling and “The Fender Bassman 5F6-A family of preamplifier circuits—a wave
antialiasing techniques can be combined to yield an accurate and digital filter case study,” in Proc. 19th Int. Conf. Digit. Audio Effects,
alias-reduced simulation of the relaxation oscillator circuit. The Brno, Czech Republic, Sept. 5–9 2016, pp. 263–270.
[18] J. Pakarinen and M. Karjalainen, “Enhanced wave digital triode
polyBLEP antialiasing technique was used and the oversampling
model for real-time tube amplifier emulation,” IEEE Trans. Audio,
from the variable step-size technique had the additional benefit of Speech, Language Process., vol. 18, no. 4, pp. 738–746, May 2010.
improving the estimate of the fractional delay thus further improv- [19] M. Karjalainen and J. Pakarinen, “Wave digital simulation of a
ing the accuracy of the polyBLEP method. vacuum-tube amplifier,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process., Toulouse, France, May 14–19 2006, vol. 5.
[20] S. D’Angelo, J. Pakarinen, and V. Välimäki, “New family of wave-
5. REFERENCES digital triode models,” IEEE Trans. Audio, Speech, Language Pro-
cess., vol. 21, no. 2, pp. 313–321, Feb. 2013.
[1] A. Fettweis, “Wave digital filters: Theory and practice,” Proc. IEEE,
[21] K. J. Werner, W. R. Dunkel, M. Rest, M. J. Olsen, and J. O. Smith III,
vol. 74, no. 2, pp. 270–327, Feb. 1986.
“Wave digital filter modeling of circuits with operational amplifiers,”
[2] G. De Sanctis and A. Sarti, “Virtual analog modeling in the wave- in Proc. 24th European Signal Process. Conf., Budapest, Hungary,
digital domain,” IEEE Trans. Audio, Speech, Language Process., vol. Aug. 29 – Sept. 2 2016, pp. 1033–1037.
18, no. 4, pp. 715–727, May 2010. [22] T. Stilson and J. Smith, “Alias-free digital synthesis of classic analog
[3] K. J. Werner, V. Nangia, J. O. Smith III, and J. S. Abel, “Resolving waveforms,” in Proc. Int. Comput. Music Conf., Hong Kong, China,
wave digital filters with multiple/multiport nonlinearities,” in Proc. Aug. 19–24 1996, pp. 332–335.
18th Int. Conf. Digit. Audio Effects, Trondheim, Norway, Nov. 30 – [23] Eli Brandt, “Hard sync without aliasing,” in Proc. Int. Comput. Music
Dec. 3 2015, pp. 387–394. Conf., Havana, Cuba, Sept. 17–23, 2001, pp. 365–368.
[4] K. J. Werner, J. O. Smith III, and J. S. Abel, “Wave digital filter adap- [24] Vesa Välimäki, Jussi Pekonen, and Juhan Nam, “Perceptually in-
tors for arbitrary topologies and multiport linear elements,” in Proc. formed synthesis of bandlimited classical waveforms using inte-
18th Int. Conf. Digit. Audio Effects, Trondheim, Norway, November grated polynomial interpolation,” J. Acoust. Soc. America, vol. 131,
30 – December 3 2015, pp. 379–386. no. 1, pp. 974–986, 2012.
[5] M. J. Olsen, K. J. Werner, and J. O. Smith III, “Resolving grouped [25] F. Esqueda and V. Välimäki, “Rounding corners with BLAMP,” in
nonlinearities in wave digital filters using iterative techniques,” in Proc. 19th Int. Conf. Digit. Audio Effects, Brno, Czech Republic,
Proc. 19th Int. Conf. Digit. Audio Effects, Brno, Czech Republic, Sept. 5–9 2016, pp. 121–128.
Sept. 5–9 2016, pp. 279–286. [26] J. D. Parker, V. Zavalishin, and E. Le Bivic, “Reducing the aliasing of
[6] D. T. Yeh, J. S. Abel, and J. O. Smith, “Automated physical mod- nonlinear waveshaping using continuous-time convolution,” in Proc.
eling of nonlinear audio circuits for real-time audio effects—part I: 19th Int. Conf. Digit. Audio Effects, Brno, Czech Republic, Sept. 5–9
Theoretical development,” IEEE Trans. Audio, Speech, Language 2016, pp. 137–144.
Process., vol. 18, no. 4, pp. 728–737, 2010. [27] S. Bilbao, F. Esqueda, J. D. Parker, and V. Välimäki, “Antiderivative
[7] M. Holters and U. Zölzer, “A generalized method for the derivation antialiasing for memoryless nonlinearities,” IEEE Signal Process.
of non-linear state-space models from circuit schematics,” in Proc. Lett., 2017, to be published, DOI 10.1109/LSP.2017.2675541.
23rd European Signal Process. Conf., Nice, France, Aug. 31 – Sept. [28] Dietrich Fränken and Karlheinz Ochs, “Automatic step-size control
4 2015, pp. 1073–1077. in wave digital simulation using passive numerical integration meth-
[8] M. Holters and U. Zölzer, “A k-d tree based solution cache for the ods,” AEU – Int. J. Electron. Commun., vol. 58, no. 6, pp. 391–401,
non-linear equation of circuit simulations,” in Proc. 24th European 2004.
Signal Process. Conf., Budapest, Hungary, Aug. 29 – Sept. 2 2016, [29] K. J. Werner, M. J. Olsen, M. Rest, and J. D. Parker, “Generalizing
pp. 1028–1032. root variable choice in wave digital filters with grouped nonlineari-
[9] V. Duindam, A. Macchelli, S. Stramigioli, and H. Bruyninckx, ties,” in Proc. 20th Int. Conf. Digit. Audio Effects, Edinburgh, U. K.,
Modeling and Control of complex Physical Systems: The Port- Sept. 5–9 2017.
Hamiltonian Approach, Springer Berlin Heidelberg, 2009. [30] Paul Horowitz and Winfield Hill, The Art of Electronics, Cambridge
[10] D. Fränken, J. Ochs, and K. Ochs, “Generation of wave digital struc- University Press, Cambridge, UK, second edition, 1980.
tures for networks containing multiport elements,” IEEE Trans. Cir- [31] Adel S. Sedra and Kenneth C. Smith, Microelectronic Circuits, Ox-
cuits Syst. I, Reg. Papers, vol. 52, no. 3, pp. 586–596, Mar. 2005. ford University Press, Inc., New York, NY, USA, 6th edition, 2009.
[11] T. Schwerdtfeger and A. Kummert, “A multidimensional signal pro- [32] R. L. Burden and D. J. Faires, Numerical Analysis, Brooks/Cole,
cessing approach to wave digital filters with topology-related delay- Cengage Learning, Boston, MA, 9th edition, 2010.
DAFX-8
DAFx-207
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT with
Q
The accuracy of the Distribution Derivative Method (DDM) [1] X
xp (t) = exp(ap,0 + ap,q tq ) (3)
is evaluated on mixtures of chirp signals. It is shown that accurate
q=1
estimation can be obtained when the sets of atoms for which the in-
ner product is large are disjoint. This amounts to designing atoms As regards the design and evaluation of signal-derivatives analysis
with windows whose Fourier transform exhibits low sidelobes but techniques, previous work has generally assumed signals contain-
which are once-differentiable in the time-domain. A technique for ing a single component, i.e., P = 1 or assumed the influence of
designing once-differentiable approximations to windows is pre- other components to be negligible. Later we will refine when this
sented and the accuracy of these windows in estimating the pa- assumption can be made. In [10] the authors provide a comprehen-
rameters of sinusoidal chirps in mixture is evaluated. sive evaluation of various signal-derivatives analysis methods ap-
plied to a single-component signal. In [11] the extent to which two
components in mixture can corrupt estimations of the frequency
1. INTRODUCTION
slope (={a0,2 } and ={a1,2 }) is investigated in the context of the
Additive synthesis using a sum of sinusoids plus noise is a pow- reassignment method, one of the signal-derivatives techniques, but
erful model for representing audio [2], allowing for the easy im- the corruption of the other parameters is not considered.
plementation of many manipulations such as time-stretching [3] In this paper, we revisit the quality of signal-derivatives esti-
and timbre-morphing [4]. In these papers, [2–4] the phase evolu- mation of all the aq when analyzing a mixture of components. We
tion of the sinusoid is assumed linear over the analysis frame, only focus on the DDM [1] analysis method for its convenience as it can
the phase and frequency of the sinusoids at these analysis points simply be considered as an atomic decomposition (see Sec. 2), and
are used to fit a plausible phase function after some the analysis does not require computing derivatives of the signal to be analysed.
points are connected to form a partial [5]. Recently, there has been The DDM does, however, require a once-differentiable analy-
interest in using higher-order phase functions [6] as the estima- sis window. As we are interested in windows with lower sidelobes
tion of their parameters has been made possible by a new set of in order to better estimate parameters of sinusoidal chirp signals
techniques of only moderate computational complexity using sig- in mixture, we seek windows that combine these two properties.
nal derivatives [7]. The use of higher-order phase models allows For this, a technique to design once-differentiable approximations
for accurate description of highly modulated signals, for example to arbitrary symmetrical windows is proposed and presented along
in the analysis of birdsong [8]. The frequency modulation infor- with a design example for a high-performance window. Finally we
mation has also been used in the regularization of mathematical evaluate the performance of various once-differentiable windows
programs for audio source separation [9]. in estimating the parameters aq .
The sinusoidal model approximating signal s typically consid-
ered is 2. ESTIMATING THE PARAMETERS aq
Q
X
s̃(t) = exp(a0 + aq tq ) + η(t) (1)
We will now show briefly how the DDM can be used to estimate
q=1
the aq . Based on the theory of distributions [12], the DDM makes
where s̃ is the approximating signal, t the variable of time, the use of “test functions” or atoms ψ. These atoms must be once
aq ∈ C coefficients of the argument’s polynomial, and η(t) white differentiable with respect to time variable t and be non-zero only
Gaussian noise. Although this technique can be extended to de- on a finite interval [− L2t , L2t ]. First, we define the inner product
scribe a single sinusoid of arbitrary complexity simply by increas-
ing Q, it remains essential to consider signals featuring a sum of Z ∞
P such components, whether they represent the harmonic structure hx, ψi = x(t)ψ(t)dt (4)
−∞
of a musical sound or the union of partials resulting from a mixture
of multiple signal sources (e.g., recordings of multiple speakers or and the operator
performers), i.e.,
P T α : (T α x)(t) = tα x(t) (5)
X
x(t) = xp (t) + η(t) (2)
p=1
Consider the weighted signal
∗ Sound Processing and Control Laboratory f (t) = x(t)ψ(t) (6)
DAFX-1
DAFx-208
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
D E
dψr
or, using the operator T α , From this we see if T q−1 xp , ψ r and xp , dt
are small for
∗ 1
all but p = p and a subset of R atoms , we can simply estimate
Q
X
q−1 dψ the parameters ap∗ ,q using
qaq T x, ψ = − x, (9)
dt Q
X
q=1
dψ r
qap∗ ,q T q−1 xp∗ , ψ r = − xp∗ , (18)
dn
Estimating coefficients aq , 1 < q ≤ Q, simply requires R atoms q=1
ψr with R ≥ Q to solve the linear system of equations
for 1 ≤ r ≤ R. To compute ap∗ ,0 we simply use
Q !
X
dψ r Q
X
qaq T q−1 x, ψ r = − x, (10) γp∗ (t) = exp ap∗ ,q tq
(19)
q=1
dt
q=1
a0 = log (hx, γi) − log (hγ, γi) (15) minimizing some function.
2 Notice however that this is an approximation of the inner product and
As will be seen in subsequent sections, the DDM typically in- should not be interpreted as yielding the Fourier series coefficients of a
properly sampled signal x periodic in Lt . This means that other evalua-
volves taking the discrete Fourier transform (DFT) of the signal tions of the inner product that yield more accurate results are possible. For
windowed by both an everywhere once-differentiable function of example, the analytic solution is possible if x is assumed zero outside of
finite support (e.g., the Hann window) and this function’s deriva- [− L2t , L2t ] (the ψ are in general analytic). In this case the samples of x are
tive. A small subset of atoms corresponding to the peak bins in the convolved with the appropriate interpolating sinc functions and the integral
DFT are used in Eq. 10 to solve for the parameters aq . of this function’s product with ψ is evaluated.
DAFX-2
DAFx-209
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Nuttall, Prolate and its approximation: Nuttall, Prolate and its approximation:
asymptotic behaviour main lobe
0 0
Nuttall
−20 Prolate
−50 Prolate Approximation
−40
−60
−100
Power (dB)
−80
−150
−100
−120
−200
−140
−250 −160
2−1 21 23 25 27 0 5 10 15 20 25 30
Bin number Bin number
Figure 1: Comparing the main-lobe and asymptotic power spectrum characteristics of the continuous 4-term Nuttall window, the digital
prolate window with W = 0.008, and the continuous approximation to the digital prolate window.
the Fourier transform of modulated w [14]. Therefore choosing ψ Rather than design based on an optimality criterion, such as
amounts to a filter design problem under the constraints that the the height of the highest sidelobe [17], a once-differentiable ap-
impulse response of the filter be differentiable in t and finite. To proximation to an existing window w is desired. To do this, we
minimize the influence of all but one component, granted the com- choose the bm so that the window w̃’s squared approximation er-
ponents’s energy concentrations are sufficiently separated in fre- ror to w is minimized while having w̃( ±N2
) = 0, i.e. we find the
quency, we desire impulse responses whose magnitude response solution {b∗m } to the mathematical program
gives maximum out-of-band rejection or equivalently, windows
N
X −1 M
X −1
whose Fourier transform exhibits the lowest sidelobes. n 2
minimize (w(n) − bm cos(2πm )) (24)
In all the publications reviewed on the DDM for this paper, the N
n=0 m=0
window used was the Hann window which is once-differentiable
everywhere in the time-domain. In [11], a publication on the re- M
X −1
assignment method, other windows than the Hann are considered subject to bm cos(πm) = 0 (25)
but these windows must be twice-differentiable. Nuttall [15] has m=0
designed windows with lower sidelobes than the canonical Hann which can be solved using constrained least-squares; a standard
window which are everywhere at least once-differentiable. It is numerical linear algebra routine [18, p. 585].
also possible to design approximations to arbitrary symmetrical
window functions using harmonically related cosines, as is dis- 6. A CONTINUOUS WINDOW DESIGN EXAMPLE
cussed in the following section.
As a design example we show how to create a continuous approx-
imation of a digital prolate spheroidal window.
5. DIFFERENTIABLE APPROXIMATIONS TO
Digital prolate spheroidal windows are a parametric approx-
WINDOWS
imation to functions whose Fourier transform’s energy is max-
imized in a given bandwidth [19]. These can be tuned to have
A differentiable approximation to a symmetrical window can be extremely low sidelobes, at the expense of main-lobe width. Dif-
designed in a straightforward way. In [16] and [17] it is shown ferentiation of these window functions may be possible but is not
how to design optimal windows of length N samples using a linear as straightforward as differentiation of the sum-of-cosine windows
combination of M harmonically related cosines above. Furthermore, the windows do not generally have end-points
equal to 0. In the following we will demonstrate how to approx-
M
X −1
n n imate a digital prolate spheroidal window with one that is every-
w̃(n) = bm cos(2πm )R( ) (22)
N N where at least once-differentiable.
m=0
In [20] it was shown how to construct digital prolate spher-
where R is the rectangle function. This function is discontinuous oidal windows under parameters N , the window length in samples,
at n = ±N , and therefore not differentiable there, unless and a parameter W choosing the (normalized) frequency range in
2
which the proportion of the main lobe’s energy is to be maximized.
M −1
We chose N = 512 based on the window length chosen in [1] for
X ease of comparison. Its W parameter’s value was chosen by syn-
bm cos(±πm) = 0 (23)
m=0
thesizing windows with W ranging between 0.005 and 0.010 at a
DAFX-3
DAFx-210
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ℜ{a0} ℑ{a0}
0 0
Error variance (log10 )
−5 −5
C
N3
−10 −10
N4
H
−15 P5 −15
ℜ{a1} ℑ{a1}
5 5
Error variance (log10 )
0 0
−5 −5
−10 −10
ℜ{a2} ℑ{a2}
Error variance (log10 )
5 5
0 0
−5 −5
Figure 2: The estimation variance of random polynomial phase sinusoids averaged over K1 = 1000 trials using atoms generated from
various windows. C is the Cramér-Rao lower bound, N3 and N4 are the 3- and 4-cosine-term continuous Nuttall windows, H is the Hann
window, and P5 is the continuous 5-cosine-term approximation to a digital prolate window as described in Sec. 6.
Table 1: The coefficients of the once-differentiable approximation sponse of the approximation is shown in Fig. 1 and its coefficients
to a digital prolate window designed in Sec. 6. are listed in Tab. 1; the temporal shape is very close to a digital pro-
late spheroidal window with W = 0.008 and is therefore omitted
for brevity.
b0 = 3.128 ×10−1 It is seen that a lower highest sidelobe level than the Nuttall
b1 = 4.655 ×10−1 and Prolate windows is obtained by slightly sacrificing the nar-
b2 = 1.851 ×10−1 rowness of the main lobe. More importantly, in Fig. 1 we observe
b3 = 3.446 ×10−2 that the falloff of the window is 18 dB per octave because it is
b4 = 2.071 ×10−3 once-differentiable at all points in its domain.
resolution of 0.001. The window with the closest 3 dB bandwidth 7.1. Signals with single component
to the 4-term Nuttall window was obtained with W = 0.008. Its To compare the average estimation error variance with the theo-
magnitude response is shown in Fig. 1. We see that this window’s retical minimum given by the Cramér-Rao bound we synthesized
asymptotic falloff is 6 dB per octave and therefore has a disconti- K1 random chirps using Eq. 1 with Q = 2 and parameters cho-
nuity somewhere in its domain [15]. sen from uniform distributions justified in [1]. The original Hann
We designed an approximate window using Eq. 24 for M window, the windows proposed by Nuttall and the new digital pro-
varying between 2 and N/8 to find the best approximation to the late based window were used to synthesize the atoms as described
digital prolate window’s main lobe using a small number of cosines. in Sec. 4 and their estimation error variance was compared (see
The M giving the best approximation was 5. The magnitude re- Fig. 2). After performing the DFT to obtain inner products with the
DAFX-4
DAFx-211
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
atoms, the three atoms whose inner products were greatest were could be constructed, we synthesized a set C of K2 components for
used in the estimations, i.e., R = 3 in Eq. 10. The windows with which the energy is maximized in bin 0. Test signals were obtained
the lowest sidelobes only give the lowest error variance at very by choosing a pair of unique components from this set and modu-
favourable SNRs, at real-world SNRs the original Hann window lating one to give the desired frequency and amplitude difference.
still performs best at estimating the parameters of a single compo- This was carried out as follows: the atom r∗ for which the inner
nent signal. product was maximized was determined for each unmixed chirp
∗
and the chirp was modulated by exp(−2π rNn j) for 0 ≤ n < N
Synthesize K2 single- in order to move this maximum to r = 0. Then for each desired
component signals. difference d, with 0 ≤ d < D (for the evaluation D = 40), two
unique chirps were selected from C and one chirp was modulated
Modulate so that peak by exp(2π nd N
j) for 0 ≤ n < N in order to give the desired dif-
bin is at frequency ference between maxima. This component was also scaled by a
0 for all signals. constant to give a desired signal power ratio from set S with the
other component (the power ratios S tested were 0 dB and -30 dB).
Set d = 0. As we assume perfect peak-atom selection for this evaluation no
inner-product maximizing r∗ is chosen, rather atoms with angular
dˆ
Choose K2 (K2 − 1) frequencies ω = 2π N for dˆ ∈ {d − 1, d, d + 1} in Eq. 20 (again,
pairs of signals. R = 3) were chosen to carry out the estimation. d was incre-
mented by ∆d = 0.25 and so dˆ was not generally integral valued
Scale one in each in this case. The parameters of the unmodulated component were
dˆ
pair to give desired estimated using angular frequencies ω = 2π N for dˆ ∈ {−1, 0, 1}
power from S and in Eq. 20. The squared estimation error for each parameter was
modulate to peak summed and divided by K2 (K2 − 1) (the number of choices of
bin according to d. two unique components) to give the averaged squared estimation
error for each parameter at each difference d. The procedure is
Add each signal summarized in Fig. 4.
pair together. The behaviour of the windows when used to analyse mixtures
of non-stationary signals is similar to the behaviour of windows
For each pair, try used for harmonic analysis in the stationary case [16]; here we
If signal power estimating parameters obtain further insight into how the estimation of each coefficient
ratios in S of unmodulated of the polynomial in Eq. 1 is influenced by main-lobe width and
remaining to component with atoms If d < D. sidelobe height and slope. In Fig. 3 we see that there is generally
be evaluated. at bins {−1, 0, 1}. less estimation error for components having similar signal power.
This is to be expected as there will be less masking of the weaker
For each pair, signal in these scenarios. The estimation error is large when the
try estimating atoms containing the most signal energy for each component are
parameters of not greatly separated in frequency. This is due to the convolu-
modulated component tion of the Fourier transform of the window with the signal, and
with atoms at bins agrees with what was predicted by Eq. 17: indeed windows with a
{d − 1, d, d + 1}. larger main lobe exhibit a larger “radius” (bandwidth) in which the
error of the parameter estimation will be high. However, for sig-
Sum estimation errors nals where local inner-product maxima are from atoms sufficiently
of each parameter and separated in frequency, windows with lower sidelobes are better at
divide by K2 (K2 − 1). attenuating the other component and for these the estimation error
is lowest.
Increment d by ∆d .
8. CONCLUSIONS
Figure 4: The evaluation procedure for 2-component signals. Motivated by the need to analyse mixtures of frequency- and am-
plitude-modulated sinusoids (Eq. 3), we have shown that the DDM
7.2. Signals with 2 components can be employed under a single-component assumption when com-
ponents have roughly disjoint sets of atoms for which their inner
To evaluate the performance of the various windows when esti- products take on large values. This indicates the need for windows
mating the parameters of components in mixture we synthesized whose Fourier transform exhibits low sidelobes. We developed
signals using Eq. 3 with P = 2 and Q = 2 and parameters chosen windows whose sidelobes are minimized while remaining every-
from the uniform distributions specified in [1]. We desired to see where once-differentiable: a requirement to generate valid atoms
how the accuracy of estimation is influenced by the difference (in for the DDM. These windows were shown to only improve param-
bins) between the locally maximized atoms and the difference in eter estimation of P = 1 component with argument-polynomial of
signal power between the two components. To obtain a set of com- order Q = 2 in low amounts of noise. However, for P = 2 com-
ponents from which test signals exhibiting the desired differences ponents of the same order in mixture without noise, granted the
DAFX-5
DAFx-212
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
components exhibited reasonable separation in frequency between sound signal model,” in Audio Engineering Society Con-
the atoms for which the inner product was maximized, these new ference: 45th International Conference on Applications of
windows substantially improved the estimation of all but the first Time-Frequency Processing in Audio. Audio Engineering So-
argument-polynomial coefficient. ciety, 2012.
Further work should evaluate these windows on sinusoids of [11] Axel Röbel, “Estimating partial frequency and frequency
different orders, i.e., Q 1. Optimal main-lobe widths for win- slope using reassignment operators,” in International Com-
dows should be determined depending on the separation of local puter Music Conference, 2002, pp. 122–125.
maxima in the power spectrum. It should also be determined if
these windows improve the modeling of real-world acoustic sig- [12] Laurent Schwartz and Institut de mathématique (Strasbourg),
nals. Théorie des distributions, vol. 2, Hermann Paris, 1959.
[13] Steven M Kay, Fundamentals of statistical signal processing,
9. ACKNOWLEDGMENTS volume I: Estimation theory, Prentice Hall, 1993.
[14] Jont B Allen and Lawrence R Rabiner, “A unified approach
This work was partially supported by grant from the Natural Sci- to short-time Fourier analysis and synthesis,” Proceedings of
ences and Engineering Research Council of Canada awarded to the IEEE, vol. 65, no. 11, pp. 1558–1564, 1977.
Philippe Depalle (RGPIN-262808-2012).
[15] Albert Nuttall, “Some windows with very good sidelobe be-
havior,” IEEE Transactions on Acoustics, Speech, and Signal
10. REFERENCES Processing, vol. 29, no. 1, pp. 84–91, 1981.
[16] Fredric J Harris, “On the use of windows for harmonic anal-
[1] Michaël Betser, “Sinusoidal polynomial parameter estima-
ysis with the discrete Fourier transform,” Proceedings of the
tion using the distribution derivative,” IEEE Transactions on
IEEE, vol. 66, no. 1, pp. 51–83, 1978.
Signal Processing, vol. 57, no. 12, pp. 4633–4645, 2009.
[17] L Rabiner, Bernard Gold, and C McGonegal, “An approach
[2] Xavier Serra, A system for sound analysis/transfor-
to the approximation problem for nonrecursive digital fil-
mation/synthesis based on a deterministic plus stochastic de-
ters,” IEEE Transactions on Audio and Electroacoustics, vol.
composition, Ph.D. thesis, Stanford University, 1989.
18, no. 2, pp. 83–106, 1970.
[3] Sylvain Marchand and Martin Raspaud, “Enhanced time-
[18] Gene H Golub and Charles F Van Loan, Matrix computa-
stretching using order-2 sinusoidal modeling,” in Proceed-
tions, The John Hopkins University Press, 3rd edition, 1996.
ings of the 7th International Conference on Digital Audio
Effects (DAFx-04), 2004, pp. 76–82. [19] David Slepian, “Prolate spheroidal wave functions, Fourier
analysis, and uncertainty V: The discrete case,” Bell Labs
[4] Lippold Haken, Kelly Fitz, and Paul Christensen, “Beyond Technical Journal, vol. 57, no. 5, pp. 1371–1430, 1978.
traditional sampling synthesis: Real-time timbre morphing
using additive synthesis,” in Analysis, Synthesis, and Per- [20] Tony Verma, Stefan Bilbao, and Teresa HY Meng, “The dig-
ception of Musical Sounds, pp. 122–144. Springer, 2007. ital prolate spheroidal window,” in the proceedings of the
1996 IEEE International Conference on Acoustics, Speech,
[5] Robert J McAulay and Thomas F Quatieri, “Speech anal- and Signal Processing (ICASSP). IEEE, 1996, vol. 3, pp.
ysis/synthesis based on a sinusoidal representation,” IEEE 1351–1354.
Transactions on Acoustics, Speech and Signal Processing,
vol. 34, no. 4, pp. 744–754, 1986.
[6] Wen Xue, “Piecewise derivative estimation of time-varying
sinusoids as spline exponential functions,” in Proceedings of
the 19th International Conference on Digital Audio Effects
(DAFx-16), 2016, pp. 255–262.
[7] Brian Hamilton and Philippe Depalle, “A unified view of
non-stationary sinusoidal parameter estimation methods us-
ing signal derivatives,” in the proceedings of the 2012 IEEE
International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2012, pp. 369–372.
[8] Dan Stowell, Sašo Muševič, Jordi Bonada, and Mark D
Plumbley, “Improved multiple birdsong tracking with dis-
tribution derivative method and Markov renewal process
clustering,” in the proceedings of the 2013 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Process-
ing (ICASSP). IEEE, 2013, pp. 468–472.
[9] Elliot Creager, “Musical source separation by coherent fre-
quency modulation cues,” M.A. thesis, McGill University,
2016.
[10] Brian Hamilton and Philippe Depalle, “Comparisons of pa-
rameter estimation methods for an exponential polynomial
DAFX-6
DAFx-213
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
N3 −30 dB
0.0 0.0
N3 0 dB
−2.5 −2.5 P5 −30 dB
P5 0 dB
−5.0 −5.0
−7.5 −7.5
ˆ } − ℜ{a0,1})2 + (ℜ{a1,1
mean{(ℜ{a0,1 ˆ } − ℜ{a1,1})2 } ˆ } − ℑ{a0,1})2 + (ℑ{a1,1
mean{(ℑ{a0,1 ˆ } − ℑ{a1,1})2 }
7.5 7.5
5.0 5.0
2.5 2.5
log10 MSE
0.0 0.0
−2.5 −2.5
−5.0 −5.0
−7.5 −7.5
ˆ } − ℜ{a0,2})2 + (ℜ{a1,2
mean{(ℜ{a0,2 ˆ } − ℜ{a1,2})2 } ˆ } − ℑ{a0,2})2 + (ℑ{a1,2
mean{(ℑ{a0,2 ˆ } − ℑ{a1,2})2 }
7.5 7.5
5.0 5.0
2.5 2.5
log10 MSE
0.0 0.0
−2.5 −2.5
−5.0 −5.0
−7.5 −7.5
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Difference between maxima in bins Difference between maxima in bins
Figure 3: The mean squared estimation error for each parameter in an analysis of two components in mixture. A set of K2 = 10 chirps
was synthesized and each unique pair used for maximum bin differences 0 ≤ d < 40, with d varied in 0.25 bin increments. The signal
power ratio between components is indicated with colours and the corresponding ratio in decibels is indicated in the plot legend. The names
indicate the windows used to generate the atoms for estimation: N3 and N4 are the 3- and 4-cosine-term continuous Nuttall windows, H is
the Hann window, and P5 is the continuous 5-cosine-term approximation to a digital prolate window as described in Sec. 6.
DAFX-7
DAFx-214
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
NICHT-
NEGATIVEMATRIXFAKTORISIERUNGNUTZENDESKLANGSYNTHESENSYSTEM
(NIMFKS): EXTENSIONS OF NMF-BASED CONCATENATIVE SOUND SYNTHESIS
DAFX-1
DAFx-215
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
modify H after the NMF update at iteration l of L according to 3.2. Sparse NMF
( The generalised β-divergence offers a family of divergences that
[H]km , [H]km = max{[HT ek ]m−r:m+r } are parametrised by β [19, 20]. It is defined as follows:
[H]km ← (1)
[H]km (1 − l+1
L
), else β β−1
xβ
β(β−1) + y − xyβ−1 β ∈ R\{0, 1}
β
x
where ek is the kth standard vector, and r ∈ {0, 1, ...m} describes Dβ (x||y) = x log y − x + y β=1 (4)
the number of columns to the right and left of the mth activation x − log x − 1 β=0
y y
column over which one wishes to reduce repeated non-zero values.
To suppress non-zero values in each column of H, i.e., dis- When β = 2, this becomes the Euclidean distance; the Kullback-
courage too many templates from being active simultaneously, they Leibler (KL) divergence for β = 1, and the Itakuro-Saito (IS)
modify H according to divergence for β = 0. In our case, we denote the β-divergence
of V and WH as Dβ (V||WH). In this case, we apply (4) to
( each element of the two matrices, and then sum over all elements.
[H]km , [H]km ∈ maxp {Hem }
[H]km ← (2) Fevotte and Idier [21] show that one of the possible multiplicative
[H]km (1 − l+1
L
), else update rules using β-divergence is given by
h i
where maxp {·} returns the subset of p ∈ {0, 1, ...m} largest val- WT V (WH)(β−2)
ues of its vector argument. H←H (5)
WT (WH)(β−1)
To promote diagonal structures, i.e., continuity of the corpus
units, Driedger et. al. [3] process the activation matrix after its where denotes the element-wise multiplication, and exponenti-
full update in iteration l by filtering. This can be expressed as a ation as well as division are element-wise.
two-dimensional convolution between H and a diagonal kernel As an extension of Driedger et al. [3], we implement spar-
sity constraints in the NMF update, which effectively restrict both
H←H∗G (3) polyphony and the corpus diversity within the NMF framework.
There exist a number of algorithms for applying sparsity constraints
on the activation matrix because it has proven useful for a variety
where ∗ is convolution, and G = Ic (identity matrix of size c) in of audio and music tasks, see for example [22, 23]. In our imple-
Driedger et al. [3]. mentation, we use the penalty introduced in [17], notated Υ here,
The application of these constraints removes the NMF algo- so that the cost function we minimise with respect to H, W is:
rithm’s convergence guarantee. Thus the sequence of updates (1)
to (3) is repeated until a user-specified stopping criteria, such as Dβ (V||WH) + λΥ (6)
number of iterations. In summary, this NMF approach to CSS
requires the specification of the following parameters: the total where λ is a parameter to control the weight of the penalty. In
number of iterations L, the horizontal neighbourhood r, the verti- order for the regulariser Υ to enforce a sparsity constraint, it is
cal magnitude neighbourhood p, the filter kernel size c, and finally common to set 0 < β ≤ 1. Typically, the closer β is to 0, the
the parameters involved in computing the STFT and its inverse. stronger the sparsity constraint will be. We set β = 0.5 in our
application by default, as it enforces strong sparsity constraints, so
that the templates that only account for little variance in the tar-
get matrix and their corresponding activation tend to zero during
3. EXTENSIONS
the NMF updates. By virtue of the multiplicative updates, com-
ponents of zero magnitude remain zero for the remaining updates
We now present our three extensions to the application of NMF and are therefore effectively pruned out of the model [17]. The
to CSS: 1) time-domain synthesis; 2) sparse NMF; and 3) pruning corresponding update rule for H is [17]:
large corpora. ϕ
3
WT V (WH)[− 2 ]
H←H 1
(7)
3.1. Time-domain synthesis WT (WH)[− 2 ] + λΨH
Given a magnitude spectrogram, or the product WH in the present where ΨH = H−1/2 is a term resulting from the application of
case, the Griffin-Lim algorithm [18] can be used to project back to the penalty Υ with β = 0.5 and ϕ = 23 is a term ensuring the
the time-domain. If we have the corpus as a time-domain wave- descent of the cost function at each iteration.
form, then we can simply window, scale, and add the correspond-
ing waveforms to create the synthesis. Since the mth row of H 3.3. Working with Large Corpora
specifies the activation of the mth column of W, the time domain
synthesis is achieved by concatenating the corresponding portions Optimisation approaches applied to CSS [3, 4, 7], carry far higher
of the corpus waveform scaled by their activation (i.e. elements computational costs than “greedy” approaches, e.g., Catepillar [2]
of H). This approach to synthesis circumvents the restrictions on and MATConcat [5]. With the templates W fixed, NMF need only
window shape and overlap for invertibility inherent to the Griffin- iteratively update the activations H. Using the Euclidean distance,
Lim algorithm [18], and also makes possible the use of other kinds the computational complexity of each iteration is then O(K 2 M ).
of features, e.g., chroma. It can be parallelised to make it faster. The KL divergence carries a higher complexity. These increase
DAFX-2
DAFx-216
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
vT w
Figure 1: The corpus pruning algorithm removes those corpus d(v, w) := 1 − . (8)
kvk2 kwk2
frames that are dissimilar from any of the frames in the target.
Algorithm 1 describes the procedure. The parameter γ ≥ 0 controls the severity of the pruning pro-
cedure, with smaller values proucing a higher degree of pruning.
For γ = 0, only one corpus frame will be kept for each target
even more when we include the constraints detailed in section 2. frame considered. The parameter θ > 0 controls the number of
Hence, the complexity is linear in the duration of the target, but comparisons between target frames and corpus frames. As θ → 0
quadratic in the duration of the corpus. This imposes limits on the every target frame is considered, but this can be unnecessary when
size of the corpus one can work with in NMF for CSS, which can there is high similarity between target frames. As θ grows, we
affect the quality of the results. potentially consider fewer and fewer target frames in the prun-
ing. Finally, the parameter ρmin ∈ [0, 1) controls a preprocessing
Algorithm 1: Corpus template pruning: Given two se- pruning procedure, removing the indices of the corpus and target
quences of vectors in the same space RN — W from the cor- frames that have norms that are too small. The effects of this are
pus and V from the target — and three user-specified param- bypassed if ρmin = 0; and if ρmin → 1, only the frame with the
eters — γ, ρmin , θ — produce a subset of the index into W largest norm is kept.
according to the dissimilarity function d : RN ×RN → R+ . After pruning, we use the columns of W indexed by Kpruned to
create a W0 , and then use NMF as described above to find a H0
1 function Prune (W, V, γ, ρmin , θ); such that V ≈ W0 H0 . We then upsample the activation matrix H0
Input : to form H having a size commensurate with the original problem.
1. W = (wk ∈ RN )k∈K , K = {1, . . . , K} (sequence of This involves distributing the rows of H0 according to the indices
corpus frames) Kpruned , and leaving everything else zero. We can then apply mod-
2. V = (vm ∈ RN )m∈M , M = {1, . . . , M } (sequence of ifications to H, such as continuity enhancement and polyphony
target frames) restriction, but at the conclusion of the NMF procedure.
3. γ ≥ 0 (similarity pruning parameter)
4. THE MATLAB APPLICATION NiMFKS
4. ρmin ∈ [0, 1) (relative energy pruning parameter)
5. θ > 0 (comparisons parameter) Driedger et al. [3] do not supply a reproducible work package, but
Output: Kpruned ⊆ K (pruned index into W) their procedure is described such that it can be reproduced. We
2 Kremain ← {k ∈ K : kwk k2 > maxl∈K kwl k2 ρmin )}; have done so by implementing a GUI application in MATLAB,
3 Mremain ← {m ∈ M : kvm k2 > maxl∈M kvl k2 ρmin )}; named NiMFKS. We organise the interface into four panes. Figure
4 Kpruned ← ∅; 2 shows an example screenshot of the interface.
5 while Mremain is not empty do The “Sound Files” pane, top left, lets the user load the audio
6 m ← Mremain (1) (first element of set); files to be used as corpus and target. In this example figure, the
7 D ← (d(vm , wk ) : k ∈ Kremain ) (dissimilarities of the user has selected multiple audio files, and a target instrumental
target frame to remaining corpus frames); recording of “Mad World” by “Gary Jules”. If several audio files
8 J ← {k ∈ Kremain : d(vm , wk ) < (1 + γ) min D} are specified by the user, they are concatenated to form the corpus.
(indices of corpus frames acceptably similar to the If the sampling rates of the target and corpus are different, NiM-
target frame); FKS resamples the target signal to have the same sampling rate as
9 Kpruned ← Kpruned ∪ J (store indices of corpus frames); the corpus.
10 Kremain ← Kremain \J (remove indices of corpus frames The pane labeled “Analysis” lets the user set the parameters
from further consideration); for computing features of both the corpus and target audio. In
11 Mremain ← {m0 ∈ Mremain : d(vm , wm0 ) > θ} the example of Fig. 2, the user has specified the STFT feature to
(indices of target frames that acceptably dissimilar be computed using a Hann window of duration 400ms with 50%
from current target frame); overlap.
12 end In the “Synthesis” pane, the user may select the method used
13 return Kpruned ; for synthesising the output audio waveform and set the related pa-
rameters. Building on this observation and work reported in [16]
where Su et al. demonstrate the benefits of synthesis approaches
To address this complexity issue, we propose preprocessing a alternative to the ISTFT, we make both Time-Domain (cf. section
corpus by pruning. Figure 1 depicts the basic concept. Frames in 3.1) and ISTFT synthesis available in NiMFKS. The “Synthesis
the target (left) and corpus (right) are colour coded by their con- Method” drop-down menu lets the user chose between these.
tent. Pruning keeps only the corpus frames that are measured sim- The NMF sub-pane lets the user set all the parameters that
ilar enough to the target frames. We then execute NMF with an W influence the factorisation. The “Prune” parameter, γ (cf. section
DAFX-3
DAFx-217
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 2: This screen shot shows the main window of NiMFKS. We see the target is “Mad World”, and the corpus is multiple samples of
steel drums. The user presses the run button in the “Analysis” pane to compute STFT features using a Hann window of duration 400ms
with overlap of 50%. When the user presses the “Run” button in the “Synthesis” pane, 10 iterations of NMF with the Euclidean distance
will be performed with the additional constraints (see Sec. 2). No pruning is performed in this case. For synthesis “Template Addition” is
specified. On the right we see a visualisation of the activation matrix. Next to the slider is the play button to hear the resulting synthesis.
Circled in blue at top are the “Activation Sketching” and “Activation Eraser” features (see Sec. 5). Clicking one of these buttons will reveal
the “Sketching” pane in which you can choose the “Paint Brush” to perform the actual sketching. If new activations are added or removed
by hand, the buttons “Resynthesise” and “Re-run” repeat the synthesis and NMF procedure respectively using the new activation matrix.
3.3), allows to tune how aggressively the corpus loaded by the user the resulting activation matrix, and can adjust the contrast of the
should be pruned before it is fed to the NMF optimisation. Cur- image by the slider labeled “Max. Activation dB”.
rently, NiMFKS defines ρmin to be −60 dB below the maximum Finally, in order to enable the “Activation Sketching” feature
observed in the corpus and target, and θ = 0.1. (cf. section 5.2), the user must select the “Activation Sketching”
The NMF updates stop after a given number of iterations, or item from the “Tools” menu, which enabled the buttons located at
when the cost function reaches a convergence criteria. In the ex- the very top left of the application window and circled on Figure
ample of Fig. 2, NMF will perform no more than 10 iterations, 2. The sub-pane labeled “Sketching” then lets the user specify the
but could exit before if the relative decrease in cost between sub- settings to be used when sketching activations.
sequent iterations is less than 0.005.
NiMFKS relies on β-NMF update rules (cf. section 2) and 5. ARTISTIC EXTENSIONS
provides three presets accessible from the “Cost Metric” drop-
down menu: Euclidean (β = 2), Kullback-Leibler divergence We now describe a few artistic extensions that we have imple-
(β = 1), and Sparse NMF with β = 1/2 as described in (4). mented in NiMFKS.
Finally, the remaining parameters control the modifications to
be applied to the activation matrix H either at each iteration of the
5.1. Activation Sketching
NMF optimisation or only after convergence if the “Endtime Mod-
ifications” parameter is toggled. The user can set the parameters NiMFKS allows one to treat the activation matrix as an image
of the filtering kernel, as described in section 2. on which the user can directly erase, draw, inflate or deflate ac-
The pane labeled “Plots” displays various results from the pro- tivations. This feature is labeled as “Activation Sketching” in the
cedure, accessible from the dropdown menu at top-right. These GUI. These can be resynthesized on the fly to see any effects. This
include the sound waveforms, their sonograms, and the activation proved to be a useful experimentation tool to understand NMF and
matrix. In the example of Fig. 2, the user has selected to view the synthesis methods in action. Figure 3 shows an example of an
DAFX-4
DAFx-218
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
70
50
Templates
40
30
20
10
0
0.1 0.2 0.3 0.4 0.5
Pruning parameter ( )
Time Figure 5: For several values of the pruning parameter, the per-
Figure 3: Image of an activation matrix produced by NMF using centage of the corpus remaining given a target. About 70% of the
a specific corpus and target but on which we have sketched addi- corpus remains after pruning by energy (keeping frames within 60
tional activations (circled). dB of the most energetic frame).
6. DEMONSTRATION
DAFX-5
DAFx-219
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
[3] J. Driedger, T. Prätzlich, and M. Müller, “Let It Bee – to-
0.5 wards NMF-inspired audio mosaicing,” in Proc. Int. Conf.
0.8 Music Information Retrieval, Malaga, Spain, 2015, pp. 350–
Cost (Euclidean)
356.
0.6 [4] J.-J. Aucouturier and F. Pachet, “Jamming with plunderphon-
0.4
ics: Interactive concatenative synthesis of music,” J. New
0.1
Music Research, vol. 35, no. 1, 2006.
0.4
[5] B. L. Sturm, “Adaptive concatenative sound synthesis and its
application to micromontage composition,” Computer Music
0.2 J., vol. 30, no. 4, pp. 46–66, Dec. 2006.
[6] G Bernardes, Composing Music by Selection: Content-
0 5 10 15 20 25 30
Based Algorithmic-Assisted Audio Composition, Ph.D. the-
NMF iteration
sis, Faculty of Engineering, University of Porto, 2014.
(a) Euclidean distance
[7] G. Coleman, Descriptor Control of Sound Transformations
1 and Mosaicing Synthesis, Ph.D. thesis, Universitat Pompeu
0.5 Fabra, 2016.
0.8 [8] Juan José Burred, “Factorsynth: A max tool for sound
Cost (Divergence)
DAFX-6
DAFx-220
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5
4
3
2
1
5
4
3
2
1
Frequency (kHz)
5
4
3
2
1
5
4
3
2
1
5
4
3
2
1
[19] R. Kompass, “A generalized divergence measure for nonneg- [22] P. O. Hoyer, “Non-negative matrix factorization with sparse-
ative matrix factorization,” Neural computation, vol. 19, no. ness constraints,” J. Machine Learning Research, vol. 5, no.
3, pp. 780–791, 2007. Nov, pp. 1457–1469, 2004.
[20] A. Cichocki, R. Zdunek, and S. Amari, “Csiszar’s diver- [23] J. Eggert and E. Korner, “Sparse coding and NMF,” in Proc.
gences for non-negative matrix factorization: Family of new of the Int. Joint Conference on Neural Networks, 2004, vol. 4,
algorithms,” in Proc. Int. Conf. on Independent Component pp. 2529–2533.
Analysis and Signal Separation, 2006, pp. 32–39.
[21] C. Févotte and J. Idier, “Algorithms for nonnegative matrix
factorization with the β-divergence,” Neural computation,
vol. 23, no. 9, pp. 2421–2456, 2011.
DAFX-7
DAFx-221
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-222
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-2
DAFx-223
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(r)
So we fill the Hankel matrices Hn with the samples fik instead of can associate each computed αi λκi with the proper generalized
the samples fi , i = 0, . . . , 2n − 1. To avoid confusion we denote eigenvalue λki . Then by dividing the αi λκi computed from (8) by
the latter ones by the αi computed from (7), for i = 1, . . . , n, we obtain from λκi
a second set of κ plausible values for ωi . Because of the fact that
frk . . . f(r+n−1)k we choose κ and k relatively prime, the two sets of plausible val-
Hn(r) (k) := .. .. .. ues for ωi have only one value in their intersection [29]. Thus the
. . . .
aliasing problem is solved.
f(r+n−1)k . . . f(r+2n−2)k
Again these Hankel matrices can be extended to be rectangular of 5. ADDING VALIDATION TO THE ANALYSIS METHOD
size m × n with m > n. From λki = exp(kφi ∆) the imaginary
part of φi cannot be retrieved uniquely anymore, because now The Padé view from Section 3 can now nicely be combined with
the regularization technique from Section 4. The downsampling
|=(kφi ∆)| < kπ. option uses only 1/k of the overall sequence of samples f0 , f1 , . . .,
f2ν−1 , . . . taken at equidistant points tj = j∆, j = 0, 1, 2, . . . ,
So aliasing may have occurred: because of the periodicity of the plus an additional set of samples at shifted locations to get rid of
function exp(i2πkωi ∆) a total of k values in the 2kπ wide inter- some possible aliasing effect. So for a fixed k > 0 the full se-
val can be identified as plausible values for 2πωi . Note that when quence of samples points can be divided into k downsampled sub-
the original λi are clustered, the powered λki may be distributed sequences
quite differently and unclustered. Such a relocation of the general-
ized eigenvalues may significantly improve the conditioning of the t0 , tk , t2k , . . .
Hankel matrices involved. t1 , tk+1 , t2k+1 , . . .
What remains is to investigate how to solve the aliasing prob-
lem in the imaginary parts 2πωi of the φi . So far, from λki , we only ..
.
have aliased values for ωi . But this aliasing can be fixed at the ex-
pense of a small number of additional samples. In what follows tk−1 , t2k−1 , t3k−1 , . . .
n can everywhere be replaced by ν > n when using ν − n addi- For each downsampled subsequence t`+kj , ` = 0, . . . , k − 1 a
tional terms to model the noise, and all square ν × ν matrices can sequence of shifted sample points t`+kj+κ can also be extracted
be replaced by rectangular µ × ν counterparts with µ > ν. To fix from the original full sequence tj , as long as gcd(k, κ) = 1. Ac-
the aliasing, we add n samples to the set {f0 , fk , . . . , f(2n−1)k }, tually, the computation of λκi can be improved numerically by con-
namely at the shifted points sidering the following shift strategy instead of a single shift. Af-
ter the first shift by κ of the sample points to t`+kj+κ , multiples
tkj+κ = jk∆ + κ∆ = jk∆ + κ∆, of the shift κ can be considered. By sampling at t`+kj+hκ , h =
j = r, . . . , r + n − 1, 0 ≤ r ≤ n. 1, . . . , H = b(N − k − 1 − 2nk)/κc the values αi λκi , αi λ2κ i ,
. . ., αi λHκ
i are obtained, which can be considered as the samples
An easy choice for κ is a small number relatively prime with k ψ(1), ψ(2), . . . , ψ(H) of a mono-exponential analysis problem
(for the most general choice allowed, we refer to [28]). With the
additional samples we proceed as follows. ψ(h) = αi λκh
i . (9)
From the samples {f0 , fk , . . . , f(2n−1)k , . . .} we compute the
generalized eigenvalues λki = exp (φi k∆) and the coefficients αi From these we can compute λκi using (2) for a single exponential
going with λki in the model term. For this the Hankel matrices are again best enlarged to rect-
angular matrices. This sparse interpolation replaces the division
n
X αi λκi /αi by a more accurate procedure.
φ(jk∆) = αi exp(φi jk∆) By repeating the computation of λki from (7) and λκi from (9)
i=1 for each ` = 0, . . . , k − 1, we obtain k independent problems
Xn
of the form (8) instead of just one. In each of these – assuming
= αi λjk
i , j = 0, . . . , 2n − 1. (7) that we overshoot the true number of components n in (7) and (8)
i=1 by considering ν − n – the true parameters φi from the model
So we know which coefficient αi goes with which generalized (1) appear as n stable poles in the Padé-Laplace method and the
eigenvalue λki , but we just cannot identify the correct =(φi ) from ν −n spurious noisy poles behave in an unstable way. In fact, each
λki . The samples at the additional points tjk+κ satisfy downsampled sequence can be seen as a different noise realization
while the underlying function φ(t) remains the same. So the gen-
n
X eralized eigenvalues related to the signal φ(t) cluster near the true
φ(jk∆ + κ∆) = αi exp (φi (jk + κ)∆) λki , while the other generalized eigenvalues belong to independent
i=1 noise realizations and do not form clusters anywhere [23, 24].
Xn The gain of considering k downsampled multi-exponential anal-
= (αi λκi )λjk
i , j = r, . . . , r + n − 1, ysis problems rather than one large problem is twofold:
i=1
(8) • A cluster analysis algorithm can detect the number of clus-
ters in the complex plane, and hence deliver the number n
which can be interpreted as a linear system with the same coef- of components in the model of the form (1). So there is no
ficients matrix as (7), but now with a new left hand side and un- need to estimate the number n separately, by means of an
(0)
knowns α1 λκ1 , . . . , αn λκn instead of α1 , . . . , αn . And again we SVD of Hν with ν > n for instance.
DAFX-3
DAFx-224
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DBSCAN [30]. Since the cluster radii may vary, we typically per- 0
run allows to detect the less dense clusters of generalized eigenval- -0.4
0 20 40 60 80 100 120 140 160 180 200
ues.
After obtaining a center of gravity as approximation for the
1
λki and a center of gravity as approximation for the λκi asociated 0.8
to the λki , the intersection of the solution sets for ωi can be taken.
We simply build a distance matrix and look for the closest match
0.6
between 0.4
and
-0.2
We point out and emphasize that the above technique can be Figure 1: Corrupted trumpet recording reconstructed by ESPRIT
combined with any implementation to solve problem (1), more (top, circles) and by the new method (bottom, crosses).
precisely (7) and (8), popular methods being [31, 10, 11].
To illustrate the validation aspect and how it is robust in the
presence of outliers, we consider 400 audio samples of the sus-
the ESPRIT algorithm [31]. After superimposing the k = 5 anal-
tained part of an A4 note played by a trumpet, corrupted by an out-
yses of the downsampled audio windows, a cluster analysis using
lier as shown in Figure 1. We put k = 4 and κ = 3 and compare
DBSCAN is performed for each window, thus retrieving the gen-
the validation to a standard implementation of ESPRIT. As can be
eralized eigenvalues λki most accurately.
seen in the ESPRIT reconstruction in Figure 1 (top), it suffers from
To illustrate the regularization effect on the rectangular 143 ×
the presence of the outlier, as any parametric method would. The
61 analogon of (2) from choosing k > 1, we show in Figure 2 the
new method, illustrated in Figure 1 (bottom), deals with k deci-
distribution of the generalized eigenvalues λi , i = 1, . . . , n of the
mated signals instead of the full signal and is more robust. In both
full not downsampled 60-th windows starting at t15104 opposed to
approaches we choose n = 20. While the ESPRIT algorithm deals
that of the λ5i , i = 1, . . . , n of the downsampled set of samples
with a Hankel matrix of size 260 × 141, the cluster analysis only
from the same window.
needs Hankel matrices of size 70 × 30. When the recording is cor-
rupted by an outlier, here only one of the k decimated signals is
affected. The decimated sample set that contains the outlier does 1 1
not contribute to the formed clusters. But the cluster algorithm still
detects clusters composed of at least k − 1 eigenvalues at the cor-
0.5 0.5
rect locations λki . Since one easily identifies the decimated signal
that did not contribute to all clusters, the equations coming from
0 0
-1 -1
DAFX-4
DAFx-225
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 1
a multi-exponential model. It can thus remove outliers and/or the
noisy part of the signal. We illustrate our approach on a harmonic
0.5 0.5
sound where the validated output is used to refine the estimation
of the lowest partial and reconstruct the signal thus eliminating the
0 0
have not fully exploited the potential of the presented method yet.
In the future we plan to apply the techniques described in Section
5 to decaying signals and signals with modulated amplitudes. Our
-1 -1
Figure 3: Cluster of λ515 values (left) and λ715 values (right). tions and connections. Actually, every audio algorithm that makes
use of a parametric method, may benefit from the ideas presented
here.
additional step can be performed to estimate the base frequency
(per window) more accurately. Once the stable frequencies φi , i = 8. REFERENCES
1, . . . , n are retrieved, we look for a harmonic relation between
them. We divide every detected harmonic partial φi by the inte- [1] S. Marchand, “Fourier-based methods for the spectral anal-
gers j = 1, . . . , 40 (which is the largest number of partials ex- ysis of musical sounds,” in 21st European Signal Processing
pected) and we add these quotients to the discovered φi , in this Conference (EUSIPCO 2013), Sept 2013, pp. 1–5.
way creating a new larger cluster at the base frequency, which we
call φ1 . The center of gravity of this larger cluster estimates the [2] R. Badeau, G. Richard, and B. David, “Performance of ES-
lowest partial of the harmonics. Using this estimate of the base fre- PRIT for estimating mixtures of complex exponentials mod-
quency φ1 , all higher harmonic partials jφ1 , j = −60, . . . , 60 are ulated by polynomials,” IEEE Transactions on Signal Pro-
reconstructed and substituted in one large rectangular 512 × 121 cessing, vol. 56, no. 2, pp. 492–504, Feb 2008.
Vandermonde system (4), which serves as the coefficient matrix [3] J. Nieuwenhuijse, R. Heusens, and E. F. Deprettere, “Robust
for the computation of the αi , i = 1, . . . , 121. exponential modeling of audio signals,” in Acoustics, Speech
While moving from one window to the next over the course and Signal Processing, 1998. Proceedings of the 1998 IEEE
of the 9 seconds, the higher harmonic partials that are detected International Conference on, May 1998, vol. 6, pp. 3581–
become weaker and fewer. So n decreases with time. We re- 3584 vol.6.
fer to Figure 4, where we again show the generalized eigenval-
[4] J. Jensen, R. Heusdens, and S. H. Jensen, “A perceptual
ues λi and λ5i , before and after regularization, now for one of the
subspace approach for modeling of speech and audio signals
middle audio windows. Fortunately, the number of partials re-
with damped sinusoids,” IEEE Transactions on Speech and
mains large enough during the whole audio fragment to rebuild
Audio Processing, vol. 12, no. 2, pp. 121–132, March 2004.
the harmonics as described. Since the final reconstructed gui-
tar sound only makes use of the φi from the stable generalized [5] Kris Hermus, Werner Verhelst, Philippe Lemmerling, Patrick
eigenvalues, the reconstruction, which can be downloaded from Wambacq, and Sabine Van Huffel, “Perceptual audio mod-
https://www.uantwerpen.be/br-cu-ea-val-17/, eling with exponentially damped sinusoids,” Signal Process-
does not suffer from the electrical hum anymore. ing, vol. 85, no. 1, pp. 163 – 176, 2005.
[6] Romain Couillet, “Robust spiked random matrices and a ro-
1 1
bust G-MUSIC estimator,” Journal of Multivariate Analysis,
vol. 140, pp. 139–161, 2015.
0.5 0.5
[8] Annie Cuyt and Wen-shin Lee, “Smart data sampling and
data reconstruction,” Patent PCT/EP2012/066204.
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 [9] Annie Cuyt and Wen-shin Lee, “Sparse interpolation and
rational approximation,” 2016, vol. 661 of Contemporary
Figure 4: Generalized eigenvalues λi (left) versus λ5i (right) from Mathematics, pp. 229–242, American Mathematical Society.
a middle window.
[10] Yingbo Hua and Tapan K. Sarkar, “Matrix pencil method for
estimating parameters of exponentially damped/undamped
sinusoids in noise,” IEEE Trans. Acoust. Speech Signal Pro-
7. CONCLUSION cess., vol. 38, pp. 814–824, 1990.
[11] Delin Chu and Gene H. Golub, “On a generalized eigenvalue
We present an approach that can be embedded in several paramet-
problem for nonsquare pencils,” SIAM Journal on Matrix
ric methods. We exploit the aliasing phenomenon caused by sub-
Analysis and Applications, vol. 28, no. 3, pp. 770–787, 2006.
Nyquist sampling and the connections with sparse interpolation
and Padé approximation. The result is a parametric method that [12] P. Henrici, Applied and computational complex analysis I,
is able to discern between the stable and unstable components of John Wiley & Sons, New York, 1974.
DAFX-5
DAFx-226
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[13] Erich Kaltofen and Wen-shin Lee, “Early termination in [31] R. Roy and T. Kailath, “ESPRIT-estimation of signal pa-
sparse interpolation algorithms,” J. Symbolic Comput., vol. rameters via rotational invariance techniques,” IEEE Trans.
36, no. 3-4, pp. 365–400, 2003, International Symposium on Acoust. Speech Signal Process., vol. 37, no. 7, pp. 984–995,
Symbolic and Algebraic Computation (ISSAC’2002) (Lille). July 1989.
[14] Walter Gautschi, “Norm estimates for inverses of Vander- [32] Frederic Font, Gerard Roma, and Xavier Serra, “Freesound
monde matrices,” Numer. Math., vol. 23, pp. 337–347, 1975. technical demo,” in Proceedings of the 21st ACM Inter-
[15] B. Beckermann, G.H. Golub, and G. Labahn, “On the numer- national Conference on Multimedia, New York, NY, USA,
ical condition of a generalized Hankel eigenvalue problem,” 2013, MM ’13, pp. 411–412, ACM.
Numer. Math., vol. 106, no. 1, pp. 41–68, 2007. [33] Xavier Serra and Julius Smith, “Spectral Modeling Synthe-
[16] Zeljko Bajzer, Andrew C. Myers, Salah S. Sedarous, and sis: A Sound Analysis/Synthesis System Based on a Deter-
Franklyn G. Prendergast, “Padé-Laplace method for anal- ministic Plus Stochastic Decomposition,” Computer Music
ysis of fluorescence intensity decay,” Biophys J., vol. 56, no. Journal, vol. 14, no. 4, pp. 12, 1990.
1, pp. 79–93, 1989. [34] R. Schmidt, “Multiple emitter location and signal parameter
[17] G.A. Baker, Jr. and P. Graves-Morris, Padé approximants estimation,” IEEE Transactions on Antennas and Propaga-
(2nd Ed.), vol. 59 of Encyclopedia of Mathematics and its tion, vol. 34, no. 3, pp. 276–280, 1986.
Applications, Cambridge University Press, 1996.
[18] J. Nuttall, “The convergence of Padé approximants of mero-
morphic functions,” J. Math. Anal. Appl., vol. 31, pp. 147–
153, 1970.
[19] Ch. Pommerenke, “Padé approximants and convergence in
capacity,” J. Math. Anal. Appl., vol. 41, pp. 775–780, 1973.
[20] J.L. Gammel, “Effect of random errors (noise) in the terms of
a power series on the convergence of the Padé approximants,”
in Padé approximants, P.R. Graves-Morris, Ed., 1972, pp.
132–133.
[21] J. Gilewicz and M. Pindor, “Padé approximants and noise: a
case of geometric series,” J. Comput. Appl. Math., vol. 87,
pp. 199–214, 1997.
[22] J. Gilewicz and M. Pindor, “Padé approximants and noise:
rational functions,” J. Comput. Appl. Math., vol. 105, pp.
285–297, 1999.
[23] P. Barone, “On the distribution of poles of Padé approxi-
mants to the Z-transform of complex Gaussian white noise,”
Journal of Approximation Theory, vol. 132, no. 2, pp. 224 –
240, 2005.
[24] L. Perotti, T. Regimbau, D. Vrinceanu, and D. Bessis, “Iden-
tification of gravitational-wave bursts in high noise using
padé filtering,” Phys. Rev. D, vol. 90, pp. 124047, Dec 2014.
[25] D. Bessis, “Padé approximations in noise filtering,” J. Com-
put. Appl. Math., vol. 66, pp. 85–88, 1996.
[26] Pedro Gonnet, Stefan Güttel, and Lloyd N. Trefethen, “Ro-
bust Padé approximation via SVD,” SIAM Rev., vol. 55, pp.
101–117, 2013.
[27] O.L. Ibryaeva and V.M. Adukov, “An algorithm for comput-
ing a Padé approximant with minimal degree denominator,”
J. Comput. Appl. Math., vol. 237, no. 1, pp. 529–541, 2013.
[28] Annie Cuyt and Wen-shin Lee, “An analog Chinese Remain-
der Theorem,” Tech. Rep., Universiteit Antwerpen, 2017.
[29] Annie Cuyt and Wen-shin Lee, “How to get high resolution
results from sparse and coarsely sampled data,” Tech. Rep.,
Universiteit Antwerpen, 2017.
[30] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei
Xu, “A density-based algorithm for discovering clusters in
large spatial databases with noise,” in Proceedings of the
Second International Conference on Knowledge Discovery
and Data Mining. 1996, KDD’96, pp. 226–231, AAAI Press.
DAFX-6
DAFx-227
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(PVC) from the degraded signal. Then, a time-varying resampling where Aj [n] and Ψj [n] represent the time-varying envelope and
algorithm is applied on the PVC. One of the first proposed meth- the phase modulations of each component j, respectively.
ods following this guideline is [4, 5], where the curve is estimated The main goal is to estimate the parameters Aj [n] and Ψj [n]
via a statistical procedure from the spectrogram of the degraded from the respective audio excerpt. With this set of parameters in
signal and then used to resample it. A drawback of this method is hand, several tasks could be performed, for example, feature ex-
that it is quite computationally intensive. This same idea was also traction or some modification and posterior re-synthesis of the sig-
explored in [6, 7], where an improved method for estimating the nal [9]. In our case, this framework is used to estimate the PVC,
based on the idea that all the frequencies present in an audio signal
∗ The authors thank CAPES and CNPq Brazilian agencies for funding
this work. 1 http://www.celemony.com/en/capstan
DAFX-1
DAFx-228
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Sinusoidal Pitch Variation Non-Uniform
ti (k, b) = exp − [i · vi (k, b)]2 . (3)
x[n] Analysis Curve Resampling
xres [n] This measure can be explained as the probability of the given fea-
ture present a tonal component in the bin k of block b. The factor i
Figure 1: Main steps of the proposed method. is a normalization constant which ensures that all specific features
will equally contribute to the tonalness spectrum when combining
them, and it is obtained by setting in Eq. 3 the specific tonal score
degraded by speed variation must contain some degree of devia- of the median of the feature in each block to 0.5. This yields the
tion, more discussed in Section 5. Therefore, in this work it is following expression for the normalization constant:
more important to estimate the frequencies and amplitudes present p
within short excerpts of the audio signal than other quantities, and log(2)
i = , (4)
this estimation is performed as follows (see [10] for more details): mvi
1. The whole signal x[n] is segmented, with each segment be- where mvi (b) is the median value of feature i in the block b and
ing multiplied by a window function w[n] (here the Hann mvi is the mean over all blocks of all files (in the case that a dataset
window was employed) of length NW , and contiguous seg- is being analyzed).
ments have an overlap of NH samples. Denote the n-th The feature set comprises simple and established features, in-
sample of the b-th block as xb [n], for n = 1, . . . , NW ; cluding a few that are purely based on information from the cur-
2. The NFFT -point DFT of each segment is computed, its re- rent magnitude spectrum of a block, such as frequency deviation,
sult being denoted by X(k, b), representing the coefficient peakiness and amplitude threshold; some that are based on spec-
of the k-th frequency bin in the b-th block. tral changes over time, such as amplitude and frequency continu-
ity; and one feature that is based on the phase and amplitude of the
3. The most prominent peaks of each block that are more likely signal, which is the time window center of gravity.
to be related to tonal components of the underlying signal Although the proposed method for estimating speed variations
are estimated via a procedure described in Section 3, based is based on a time-frequency representation, the peak detection
on [11]; stage is characterized by analyzing each frame individually. There-
4. Finally, in order to form tracks with the peaks estimated in fore we take into account only those features which extract infor-
the previous step, a simple algorithm is applied, where es- mation from a single block.
sentially a peak in some particular block is associated with The evaluation of results in [11] showed that, although the
the closest peak in the following block. This is a modifi- combination of features intuitively and empirically performs better
cation of the MQ algorithm [12], and is explained in more than individual scores, combinations of more than three features
details in Section 4. The frequency of the i-th track in the did not achieve better representations. Moreover, it was also re-
b-th block of signal is denoted by fi [b]. ported that combination with a simple product, that is, setting η to
1 in Eq. 2, yielded better results than with the distorted geometric
Now, using an weighted average of the previously obtained
mean.
tracks, the PVC is estimated via a simple procedure described in
detail in Section 5. Finally, the PVC is given, together with the Taking these information into account, our tonalness spec-
degraded signal, as input to a non-uniform resampling algorithm, trum is formed by the combination of the amplitude threshold and
detailed in Section 6. In the next Sections, the aforementioned peakiness features, since in [15] the combination of these features
steps are thoroughly discussed. achieved good results on the detection of fundamental frequencies
and their harmonics in music signals. Therefore, in our case, the
expression in Eq. 2 can be simplified to:
3. PEAK DETECTION
T (k, b) = tPK (k, b) · tAT (k, b), (5)
Since the presented method for estimating the PVC is based on
sinusoidal analysis, it is important to employ a peak detection al- with tPK (k, b) and tAT (k, b) being the specific tonal scores of the
gorithm that rejects those noise-induced. Several methodologies peakiness and amplitude threshold, respectively.
have been proposed to detect tonal peaks in audio signals, includ- The peakiness feature measures the height of a spectral sample
ing threshold-based methods [9, 13] and statistical analysis [14]. in relation to its neighboring bins, and it is defined as:
To separate genuine peaks from spurious ones, we adopt the
Tonalness Spectrum, introduced in [11], since it is an easily exten- |X(k + p, b)| + |X(k − p, b)|
vPK (k, b) = , (6)
sible and flexible framework for sinusoidal analysis. This is a non- |X(k, b)|
binary representation which indicates the likelihood of a spectral
where the distance p to the central sample should approximately
bin to be a tonal or non-tonal component. In this metric, a set of
correspond to the spectral main lobe width of the adopted window
spectral features V = {v1 , v2 , ..., vV } is computed from the signal
function when segmenting the signal, so side lobe peaks can be
spectrum and combined to produce the overall tonalness spectrum:
avoided.
V
!1/η The amplitude threshold feature measures the relation of the
Y
T (k, b) = ti (k, b) , (2) magnitude spectrum by an adaptive magnitude threshold, and it is
i=1 defined as:
where ti (k, b) ∈ [0, ..., 1], which is referred to as the specific tonal rTH (k, b)
score, is calculated for each extracted feature vi according to: vAT (k, b) = , (7)
|X(k, b)|
DAFX-2
DAFx-229
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
where rTH (k, b) is a recursively smoothed version of the magni- in which both frequency and amplitude can vary. This process is
tude spectrum: referred to as partial tracking.
Several methods have been proposed to track spectral peaks.
Relevant works include the classical McAuley & Quatieri (MQ)
rTH (k, b) = β · rTH (k − 1, b) + (1 − β) · |X(k, b)|, (8) algorithm [16], its extended version using linear prediction [17],
with this filter being applied in both forward and backward direc- and a solution via recursive least-squares (RLS) estimation [18].
tion in order to adjust the group delay, and β ∈ [0, 1] being a factor In this work, a modified version of the MQ algorithm is em-
that is empirically tweaked. ployed, which is described in [12], since this is a computationally
efficient technique that is easily implementable and achieves good
3.1. Peak Selection representation of sinusoidal tracks.
In this algorithm, a track can be marked with three different
After computing the tonalness spectrum, all peaks ki are selected labels: it emerges when a peak is not associated with any existing
in each block, and those which do not fulfill the following criterion track, remains active while it is associated with peaks, and van-
are discarded: ishes when it finds no compatible peak to incorporate. Defining
T (ki , b) ≥ TTH , (9) fi,b and Ai,b the frequency and magnitude amplitude of the ith
where TTH ∈ [0, 1] is an empirically adjusted likelihood thresh- detected peak in frame b, the algorithm can be explained as fol-
old. lows:
Moreover, since our tonalness spectrum measurement evalu- 1. For each peak fj,b+1 a search is performed to find a peak
ates peaky components and their surroundings, independently of fi,b from a track which had remained active until the frame
their absolute magnitude amplitudes, some small and insignificant b, satisfying the condition |fi,b − fj,b+1 | < ∆fi,b . The pa-
peaks could present a high tonalness likelihood and thus be se- rameter ∆fi,b controls the maximum frequency variation,
lected. Hence the following criterion is employed in each block to and is set to a quarter tone around fi,b .
discard such irrelevant peaks:
2. If the peak fj,b+1 finds a corresponding track in the pre-
|X(ki , b)| ≥ γ · max|X(k, b)|, (10) vious frame satisfying the condition described in step 1, it
associates with this track, which remains active. If two or
where γ is a percentage factor, and satisfactory peak selections more peaks satisfy the condition, the peak that minimizes
were obtained by setting this factor to around 1%.
Figure 2 illustrates the peak selection stage in a block of an |fi,b − fj,b+1 | |Ai,b − Aj,b+1 |
audio signal containing a single note being played by a flute. The J = (1 − κ) +κ (11)
fi,b Ai,b
criteria in Eq. 9 and 10 were set to 0.8 and 1%, respectively. It can
be seen that the tonalness spectrum is a powerful representation of is selected, where κ ∈ [0, 1] is a weighting parameter that
tonal components, when comparing it with the original magnitude controls the influence of the relative frequency and ampli-
spectrum. tude differences in the cost function in Eq. 11.
3. When the peak of a track in b is not associated with any peak
1 60 in b + 1 satisfying the condition, it is marked as vanishing
and a virtual peak with its same frequency and amplitude
40 is created in b + 1. When the track reaches D consecutive
virtual peaks, it is then terminated.
|X(k, b)| (dB)
T (k, b)
20 Except for the first frame, where all the peaks invariably start
0.5
new tracks, these steps are performed in all frames, until all peaks
are labeled. Lastly, short tracks whose length is less than a number
0
E of frames are removed.
-20
5. GLOBAL PITCH VARIATION CURVE
0
0 500 1000 1500 2000 A consequence of speed variations in musical signals is the devia-
Frequency (Hz) tion of all their frequencies by the same percentage factor, that is,
a pitch modification. Therefore, a curve which combines the vari-
Figure 2: Illustration of the peak detection stage. The thicker and ations of the main frequency components of the signal is a suitable
brighter line represents the tonalness spectrum of a block from metric for estimating the defect.
an audio signal, whose magnitude spectrum is indicated on the In [4, 5] a Bayesian procedure is proposed to estimate the
thinner and darker line. The likelihood threshold is represented by pitch variation curve, but it is quite computationally expensive.
the horizontal dashed line and the selected peaks are marked with An alternative is to determine the distortion using only the most
asterisks. prominent spectral component of the signal, a technique proposed
in [8]. However, this method is not practical, and such spectral
component may contain frequency modulations not corresponding
4. PARTIAL TRACKING to speed variations, interfering then with the correct curve estima-
tion.
To conclude the sinusoidal analysis stage, the spectral peaks se- Our proposed approach consists of calculating a weighted av-
lected in each frame are grouped into time-changing trajectories erage curve from the computed tracks, thus exhibiting an overall
DAFX-3
DAFx-230
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
the detected peaks. The motivation for weighting the curve is that where NMA is the order of the filter, and the final PVC is then the
the most important tracks should contribute more to the PVC than vector f̃ , whose components are f˜[b], for m = 1, . . . , NB , with
the less prominent ones. This metric can be interpreted as an ex- the latter term being the total number of blocks.
tended version of the method proposed in [8]. Since this metric It is then expected that the pitch variation curve of a signal ex-
takes into account more frequency components and it is refined by hibits values around zero. For a better interpretation of this curve,
their magnitude amplitudes, it is expected that this overall pitch it is interesting to normalize it, which is realized by shifting its av-
variation curve can be more reliable than taking only one compo- erage to 1. This is a better notation when it is necessary to adopt a
nent. frequency reference. For example, curves related to parts in which
For aesthetic reasons, it is quite common the presence of notes there was no speed variations now exhibit values around 1, indi-
played with ornaments in music recordings, such as vibrato. This cating that the frequencies of their tracks have all been multiplied
is often seen in bowed string or woodwind instruments, and in by 1. If in a frame transition there is a deviation of 0.6%, this now
singing voice as well. Thus, when tracking an instrument record- is represented in the curve as 1.006, that is, all frequencies in this
ing with such effect, parts where vibrato occurred could be erro- transition on average were multiplied by 1.006.
neously detected as speed variations. However, if the instrument
is in a recording with other musical instruments, which would not 6. NON-UNIFORM RESAMPLING
necessarily be synchronized with its vibrato, that effect would be
attenuated by computing an average of the tracks. Nevertheless, In this section, it is described how non-uniform sampling rate con-
the method presented in this paper can be implemented in a way version can be employed from the PVC to compensate speed vari-
that the user may select the specific parts of the audio signal to be ations in digital audio signals.
restored. When an audio signal is played back with a different rate from
that which was originally sampled, the perceived pitch is modi-
5.1. Extraction of the weighted global average of the tracks fied, as well as its time duration is distorted. It turns out, as ex-
plained in Section 1, that speed variations yield exactly pitch and
In the first step of the global pitch variation curve extraction, the
time variations. Therefore sampling rate conversion is a suitable
mean of all tracks are shifted to zero, and then each of them are
technique for compensating such defects in digital versions of de-
normalized by its respective frequency average in time, this way
graded recordings.
obtaining the percentage average of each sinusoidal trajectory. For
a track i, this stage is mathematically described as: Furthermore, since speed variations are not constant in time,
this sampling rate conversion must be non-uniform, which can
fi [b] − f¯i also be referred to as time-varying resampling. In [3] different
f˜i [b] = , (12) methods for “wow” reduction are compared, and the sinc-based in-
f¯i
terpolation [19] achieved less distortions in reconstructed signals.
with fi [b] being the frequency of the ith track in the frame b, and Therefore, an algorithm that receives as input the digital version of
f¯i being the arithmetic mean of all frequencies in the ith track. the degraded signal and its PVC, and its output the reconstructed
Following that, the weighted average of each frame b is calcu- signal using the sinc-based time-varying resampling was imple-
lated: P mented.
Ai [b]α f˜i [b] The sinc-based interpolation is closely related to the analog in-
f˜0 [b] = iP
, (13)
Ai [b]α terpretation of resampling, which is the reconstruction of the con-
where Ai [b] represents the magnitude amplitude of the ith track in tinuous version from the discrete signal, followed by resampling it
the frame b and α ∈ [0, 1] is a parameter that controls the influence with the new desired rate. The expression for the converted signal
of large amplitudes over smaller ones, which we will from now on xrec [n0 ] with new rate Fs0 using the sinc-based resampling is given
refer to as the weighting factor. The motivation for this parame- by [19]:
ter is that the magnitude amplitude of the harmonic components 0
NT
of tonal peaks drop sharply as the harmonic order increases, and X n
one may desire to increase the participation of such medium and xrec [n0 ] = x[n] sinc π −n , (15)
n=−NT
fr
small amplitude harmonics in the computation of the pitch varia-
tion curve. where fr = Fs0 /Fs is the resampling factor, i.e. the ratio between
It can be noticed that when α = 0, the Eq. 13 happens to be the desired and original sampling rates, and 2NT + 1 is the num-
the arithmetic mean of the tracks, indicating that all tracks will ber of samples used in the interpolation. The reconstruction of a
equally contribute to the PVC; analogously, when α = 1, the PVC sample is illustrated in the Figure 3.
is set so the small-amplitude peaks influence less in its estimation.
To minimize the effects of false tracks, the weighting factor would 6.1. Implementation
naturally be set to a value close to 1, and empirical tests indicate
that setting α between 0.7 and 1 achieve satisfying results. Since the pitch variation curve f̃ from Eq. 14 is normalized to 1,
The curve f˜0 [b] may present undesired small irregularities, each of its elements correspond exactly to the term fr in Eq. 15.
which are caused by frequency inaccuracies and the tracking of However, each element is associated to a block of the signal, which
non-tonal peaks which do not correspond to frequency compo- was segmented during the STFT procedure. Hence the transition
nents. Hence, this curved is smoothed by a moving average filter, between blocks must be modeled sample by sample.
DAFX-4
DAFx-231
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
PVC
central sample of its respective block; 1
1
by the sample in n0 in Figure 3);
0.98
(a) This process inevitably involves numerical approxi-
0.5 1 1.5 2 2.5 3 3.5
mations, which in this context are translated into phase Time (s)
deviations. Such approximations must be taken into
account, and therefore compensated in order to elim- Figure 4: Restoration of the signal ‘orchestra’. In the first graph,
inate audible artifacts. the solid line and dotted line represent the estimated and true
PVCs, respectively. The PVC of the restored signal is shown in
3. Finally, the expression in Eq. 15 is applied for each new
the second graph.
sample.
It is also worth mentioning that the correction of speed varia-
tions can be performed offline, so NT can be chosen large enough also worth mentioning that there are no precise objective measure-
in order not to compromise the quality of the reconstructed signal. ments to compare the original signal with the restored one.
Experiments in [3] showed that using NT = 100 achieved inaudi- The second test was realized on a piano recording which presents
ble distortions. Moreover, our experiments with such number of genuine speed variations. Figure 5 shows the PVCs of the de-
samples or even less also did not degraded the signal. graded and restored signals, respectively. What sorts out from
the graphs is that the proposed method estimated the shape of the
7. RESULTS variations with considerable accuracy, and therefore the result was
considered satisfying. However, the PVC of the restored signal
This section presents the evaluation of the proposed algorithm. still presents some slight variations, but this can be explained by
Several degraded signals were investigated, and two of them are the inaccuracies of the sinusoidal analysis stage, more precisely
shown in this paper, one from an artificially degraded recording, during the partial tracking stage.
and one from a real recording. The audio signals presented here
as well as all implemented codes in this work are available in [20]. 8. CONCLUSION
All tests were realized with the same set of parameters, which are
summarized in Table 1. This work has presented a computationally efficient system which
The first test was performed with an excerpt of orchestral mu- requires minimum user interaction for the estimation and correc-
sic, containing long notes being played by bowed instruments. An tion of speed variations in the playback of musical recordings. The
artificial sinusoidal pitch variation was imposed into this signal and stage of estimating the pitch variation curve was based purely on
the proposed method was applied, so both true and estimated PVCs sinusoidal analysis, and the good results indicated that the pro-
can be compared. Although the curves slightly differ in terms of posed framework can serve, for example, as the core of a more
their amplitudes, as can be seen in the first graph of Fig. 4, a sat- sophisticated and robust professional tool for audio restoration.
isfactory correspondence can be observed between them. As can It can be said that the partial tracking stage appears as the less
be seen in the second graph of Fig. 4, the pitch variation curve of robust stage in the system, and it may have the biggest influence
the resampled signal shows only slight variations, which can be in the inaccuracies reported in Figures 4 and 5. Therefore, fu-
explained by the amplitude difference mentioned before; however, ture works include the development of a more robust algorithm
informal listening tests reported no audible pitch variations. It is for peak tracking, possibly operating simultaneously with a group
DAFX-5
DAFx-232
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.005 [6] J. Nichols, “An interactive pitch defect correction system for
archival audio,” in Proceedings of the AES 20th International
PVC
1
Conference, Budapest, Hungary, Oct 2001.
0.995 [7] J. Nichols, “A high-performance, low-cost wax cylinder tran-
1 1.5 2 2.5 3 3.5 4 scription system,” in Proceedings of the AES 20th Interna-
Time (s) tional Conference, Budapest, Hungary, Oct 2001.
[8] A. Czyzewski, A. Ciarkowski, A. Kaczmarek, J. Kotus,
1.005
M. Kulesza, and P. Maziewski, “DSP techniques for deter-
mining wow distortion,” Journal of the Audio Engineering
PVC
1
Society, vol. 55, no. 4, pp. 266–284, 2007.
0.995
[9] J. Smith and X. Serra, “PARSHL: An analysis/synthesis pro-
1 1.5 2 2.5 3 3.5 4 gram for non-harmonic sounds based on a sinusoidal rep-
Time (s)
resentation,” in International Computer Music Conference,
Urbana, Illinois, USA, Aug 1987, pp. 290–297.
Figure 5: Restoration of the signal ‘piano’. The estimated PVC is
depicted in the first graph, and the restored PCV is shown in the [10] P. A. A. Esquef and L. W. P. Biscainho, “Spectral-based anal-
second one. ysis and synthesis of audio signals,” in Advances in Audio
and Speech Signal Processing: Technologies and Applica-
tions, H. M. Pérez-Meana, Ed., chapter 3. Idea Group, Her-
of blocks, as proposed in [17]. Other potential improvements to shey, 2007.
be implemented include the development of a better method for [11] S. Kraft, A. Lerch, and U. Zölzer, “The tonalness spectrum:
converting the set of tracks into the pitch variation curve, and the feature-based estimation of tonal components,” in Proceed-
assessment of results via subjective tests. ings of the 16th International Conference on Digital Audio
As stated in [8], a fully automatic system, although tempting, Effects (DAFx’14), Maynooth, Ireland, Sep 2014.
is not feasible. Since the nature of pitch variations in old mu- [12] L. Nunes, L. Biscainho, and P. Esquef, “A database of partial
sic recording is wide [5], this class of audio degradation requires tracks for evaluation of sinusoidal models,” in Proceedings
a human operator for applying restoration algorithms. However, of the 13th International Conference on Digital Audio Effects
we believe that such systems can be minimally and friendly in- (DAFx’10), Graz, Austria, Sep 2010.
teractive, so non-professional users can restore their old domestic
[13] N. Laurenti, G. De Poli, and D. Montagner, “A nonlinear
recordings by their own.
method for stochastic spectrum estimation in the modeling
of musical sounds,” IEEE Transactions on Audio, Speech,
9. ACKNOWLEDGMENTS and Language Processing, vol. 15, no. 2, pp. 531–541, Feb
2007.
The authors wish to express their sincere gratitude to Prof Luiz W.
[14] D. J. Thomson, “Spectrum estimation and harmonic analy-
P. Biscainho for the motivation and ideas for this work. They also
sis,” Proceedings of the IEEE, vol. 70, no. 9, pp. 1055–1096,
would like to thank the anonymous reviewers for their valuable
Sep 1982.
comments and suggestions to improve the quality of the paper.
[15] S. Kraft and U. Zölzer, “Polyphonic pitch detection by
matching spectral and autocorrelation peaks,” in Proceedings
10. REFERENCES
of the 23rd European Signal Processing Conference (EU-
SIPCO’15), Nice, France, Aug 2015, pp. 1301–1305.
[1] U. R. Furst, “Periodic variations of pitch in sound reproduc-
tion by phonographs,” Proceedings of the IRE, vol. 34, no. [16] R. McAulay and T. Quatieri, “Speech analysis/synthesis
11, pp. 887–895, Nov 1946. based on a sinusoidal representation,” IEEE Transactions
on Acoustics, Speech, and Signal Processing, vol. 34, no. 4,
[2] P. E. Axon and H. Davies, “A study of frequency fluctuations
pp. 744–754, Aug 1986.
in sound recording and reproducing systems,” Proceedings
of the IEE - Part III: Radio and Communication Engineering, [17] M. Lagrange, S. Marchand, and J. B. Rault, “Using linear
vol. 96, no. 39, pp. 65–75, Jan 1949. prediction to enhance the tracking of partials,” in IEEE Inter-
national Conference on Acoustics, Speech, and Signal Pro-
[3] P. Maziewski, “Wow defect reduction based on interpolation cessing (ICASSP’04), May 2004, vol. 4, pp. 241–244.
techniques,” in Proceedings of the 4th National Electron-
ics Conference, Darlowko Wschodnie, Poland, Jun 2005, pp. [18] L. Nunes, R. Merched, and L. Biscainho, “Recursive least-
481–486. squares estimation of the evolution of partials in sinusoidal
analysis,” in IEEE International Conference on Acoustics,
[4] S. J. Godsill and P. J. W. Rayner, “The restoration of pitch Speech and Signal Processing (ICASSP’07), April 2007,
variation defects in gramophone recordings,” in Proceedings vol. 1, pp. 253–256.
of IEEE Workshop on Applications of Signal Processing to
[19] F. Marvasti, Nonuniform Sampling Theory and Practice,
Audio and Acoustics, Oct 1993, pp. 148–151.
Kluwer Academic Publishers, New York, USA, 2001.
[5] S. J. Godsill, “Recursive restoration of pitch variation defects
[20] WOW, “WOW companion webpage,” Available
in musical recordings,” in IEEE International Conference on
at http://www.smt.ufrj.br/~luis.carvalho/
Acoustics, Speech, and Signal Processing (ICASSP’94), Apr
WOW/, 2017.
1994, vol. 2, pp. 233–236.
DAFX-6
DAFx-233
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT calculated with respect to said time signal. Since this cost is de-
Gradient-based optimizations are commonly found in areas where fined over the envelopes of the sub-band signals statistics, comput-
Fourier transforms are used, such as in audio signal processing. ing its gradient with respect to those envelopes is usually straight-
This paper presents a new method of converting any gradient of a forward, but converting this gradient in a gradient with respect to
cost function with respect to a signal into, or from, a gradient with the base time signal means going back up the whole envelope ex-
respect to the spectrum of this signal: thus, it allows the gradient traction process. Because this procedure involves manipulating
descent to be performed indiscriminately in time or frequency do- spectra and because the sub-band signals are complex-valued, this
main. For efficiency purposes, and because the gradient of a real situation typically falls in the case mentioned above where the for-
function with respect to a complex signal does not formally exist, mal gradients of the cost function with respect to those complex-
this work is performed using Wirtinger calculus. An application to valued signals do not exist.
sound texture synthesis then experimentally validates this gradient
conversion. Hence the goal of the work presented in this paper is twofold: es-
tablishing a relation between a gradient of a cost function with
respect to a signal and with respect to its spectrum, all the while
1. INTRODUCTION using a formalism that both allows the manipulation of those oth-
erwise non-existent complex gradients and does so in the most ef-
Mathematical optimization is a recurring theme in audio signal
ficient way possible.
processing: many sound synthesis algorithms, for instance physics-
based synthesis [1] or sound texture synthesis [2, 3], require the The formalism chosen to be worked with is Wirtinger calcu-
solving of non-linear equations, which in turn becomes a function lus, first introduced in [4], which offers an extension of complex
optimization problem. Formally, this means seeking the minimum differentiability in the form of Wirtinger derivatives and Wirtinger
of a real-valued cost function C over the ensemble of its inputs. gradients. Even though it can be encountered in articles such as [5]
From here on several solvers exist, including the widely used (although without any mention of the name Wirtinger), it has been
gradient descent algorithms. As the name implies, this kind of al- scarcely used in signal processing since then. Because this formal-
gorithm requires the computation of the gradient of C to iteratively ism is not well known, this paper starts off with an introduction to
progress toward a minimum. But a problem arises if the parame- it, followed by our work on gradient conversion between time and
ters of C are complex-valued: from the Cauchy-Riemann equations frequency domains, and then the application and experimental val-
follows that a real-valued non-constant function with complex pa- idation of this conversion to the case of sound texture synthesis.
rameters is not differentiable. This means that if the parameters
of our cost function C are complex-valued, for example when the
cost is evaluated from a spectrum, the gradient of C does not exist. 2. WIRTINGER CALCULUS AND NOTATIONS
Recently, this situation was encountered in the context of sound 2.1. General mathematical notations
texture synthesis. In this synthesis algorithm the sub-bands en-
Let us first introduce the mathematical notations that are used through-
velopes of a white noise signal are adapted so that a set of their
out this paper.
statistics reach some desired values, previously extracted from a
sound texture. Current implementations of this algorithm either
Arrays are denoted by bold lower case letters as c, while their
perform the optimization over each sub-band envelope individu-
m-th element is denoted cm .
ally while performing additional steps in parallel to compute a cor-
responding time signal [2], or perform it over the magnitude of the
For a differentiable function f : R → C, its real derivative at
short-term Fourier transform of the white noise signal, combined
a ∈ R is denoted:
with a phase retrieval algorithm allowing to recreate the synthe-
∂f
sized time signal [3]. But both methods require additional steps (a) (1)
to recreate the outputted signal, be it reconstruction from the sub- ∂x
bands or phase retrieval, which in turn tend to damage the opti- For a differentiable function f : RM → C, its real gradient at a ∈
mization performed over the sub-bands envelopes. RM is denoted:
Such a problem could be avoided altogether if the optimiza-
tion was performed directly over the base time signal, although this ∂f ∂f ∂f ∂f
would mean that the gradient of the cost function would have to be (a) = (a), (a), ..., (a) (2)
∂x ∂x1 ∂x2 ∂xM
DAFX-1
DAFx-234
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The discrete Fourier transform (DFT) is denoted by F : CM → one expression to the other if need be: since the partial derivatives
CN , with M ≤ N , while the inverse DFT (iDFT) is denoted by are enough to optimize a function, so is the Wirtinger derivative.
F −1 : CN → CM . For a vector c ∈ CM , its spectrum is denoted
by a bold upper case letter C ∈ CN . As such we have: The case can then be extended to functions of several variables,
X 2πmn
such as arrays: if f denotes a function of CM → C and is differ-
Cn = F(c)n = cm e−j N (3) entiable in the real sense, meaning here differentiable with respect
m∈[0,M −1] to the real and imaginary parts of all of its inputs, we define the W
and W ∗ -gradients of f similarly to their real counterpart:
And: X
1 2πmn
cm = F −1 (C)m = Cn e j N (4) ∂f
(c) =
∂f
(c),
∂f
(c), ...,
∂f
(c) (9)
N ∂z ∂z1 ∂z2 ∂zM
n∈[0,N −1]
∂f ∂f ∂f ∂f
2.2. Wirtinger Calculus (c) = (c), ∗ (c), ..., ∗ (c) (10)
∂z ∗ ∂z1∗ ∂z2 ∂zM
We now introduce Wirtinger calculus and summarize some of its Once more, and as shown in [6], in the case of a real-valued func-
more useful properties, as can be found in [5, 6, 7]. tion (such as a cost function) knowing either the W or W ∗ -gradient
of the function is sufficient to minimize it.
2.2.1. Wirtinger derivatives and gradients
Any complex number c ∈ C can be decomposed as a + jb with 2.2.2. Link with C-differentiability
(a, b) ∈ R2 its real and imaginary parts. Similarly, any function If f denotes a function of CM → C the following property holds:
f : C → C can be considered as a function of R2 → C with (
f (c) = f (a, b). If it exists, the derivative of f at c with respect to f is differentiable in the real sense
the real part of its input is denoted by: f is C-differentiable ⇐⇒ ∂f
∂z ∗
=0
∂f (11)
(c) (5)
∂x In the case where f is C-differentiable, both its complex and W -
gradients are equal: this is what makes of Wirtinger calculus a
This concords with our previous notations, since this derivative is
proper extension of C-differentiability.
also the real derivative of f when its domain is restrained to R. If
it exists, the derivative of f at c with respect to the imaginary part
of its input is denoted by: 2.2.3. Linearity
DAFX-2
DAFx-235
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
According to the chain rule of Wirtinger calculus stated in 2.2.4 Thus, it is now possible to straightforwardly convert the gradient
we have for n ∈ [0, N − 1]: of a cost function between time and frequency domains: this also
means that any gradient descent can be chosen to be performed on
∂ Ẽ X ∂E
∂Fm −1
(C) = (F −1 (C) × (C) (22) a signal or its spectrum, independently of which signal was used
∂zn
m∈[1,M ]
∂zm ∂zn in cost computations. Additionally, and since this conversion is
−1 ∗
performed using only DFTs or iDFTs, we can also make full use
∂E −1 ∂ Fm of efficient Fourier transform algorithms such as the Fast Fourier
+ (F (C)) × (C)
∂zm∗ ∂zn Transform (FFT).
But F −1 is C-differentiable, meaning that according the prop- We now apply and experimentally validate the present results in
erty of Wirtinger calculus stated in 2.2.2 its W ∗ -gradient is null. a sound synthesis algorithm.
From (18) then follows that the W -gradient of (F −1 )∗ is also null,
which leaves us with:
4. APPLICATION
∂ Ẽ X ∂E ∂Fm−1
(C) = (c) × (C) (23) As mentioned in the introduction, the conversion of a gradient be-
∂zn ∂zm ∂zn
m∈[1,M ]
tween time and frequency domains can be put to use in the case of
From (4) we easily derive: sound texture synthesis through statistics imposition.
!
−1 X During the synthesis algorithm a base signal such as a white noise
∂Fm 1 ∂ j 2πmk
(C) = zk e N is converted to frequency domain where it is filtered using a Fourier
∂zn N ∂zn
k z=C (24) multiplier, then taken back to time domain to result in an analytic
1 2πmn
= ej N sub-band of the base signal. The goal of the algorithm is then to
N impose a given set of values to some selected statistics of those
DAFX-3
DAFx-236
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
sub-bands magnitudes. Since imposing the statistics via gradient 2.2.3 Wirtinger derivation is linear, we directly obtain:
descent directly over the sub-bands and then proceeding on recre-
ating the corresponding time signal would damage the imposition, ∂ C ◦ F −1 ◦ G ∂ C ◦ F −1
(S) = W . (B) (35)
we seek to perform the gradient descent directly over the base time ∂z ∗ ∂z ∗
signal. This requires a conversion of the gradient of the cost func- All that is left now is the conversion from frequency to time do-
tion, easily defined with respect to the sub-bands, to a gradient main to obtain the W ∗ -gradient at s. Using the result in (29) gives:
with respect to the base signal. !
∂ C˜ −1 ∂ C ◦ F −1 ◦ G
The case can be formulated this way: we call s a real-valued (s) = N F (S) (36)
∂z ∗ ∂z ∗
array of length M representing the base sound signal we wish to
alter using gradient descent and S its DFT of length N . S is then Combining (34), (35) and (36) then gives:
windowed in frequency domain by a function G defined as:
∂ C˜ −1 ∂C
G : CN → CN (30) (s) = F W .F (b) (37)
∂z ∗ ∂z ∗
S 7→ W .S
Using this relation we can now convert the W ∗ -gradient at b ex-
pressed in (32) to a W ∗ -gradient at s, the base signal.
With W ∈ CN the spectral weighting array used for band-filtering
In addition to this, since s is a purely real-valued signal it is
S and . the element-wise product. We denote the filtered spectrum
also straightforward to make use of the previous result to obtain the
B = G(S), and b = F −1 (B) its inverse DFT of length M : b is
more common real gradient of C˜ at s. Indeed, from the definition
thus the complex sub-band of s centered around a frequency cho-
of the W ∗ -derivative in (8) and since C˜ is real-valued, we have that
sen via W . In this example we want to impose a given value γ to
if s ∈ RM :
the third order moment of the envelope of b, so the cost function C
is chosen as the squared distance between the third order moment ∂ C˜ ∂ C˜
of the envelope of |b| the sub-band and γ: (s) = 2< (s) (38)
∂x ∂z ∗
C : CM → R (31) Which results in the final expression of the real gradient of the cost
2 function with respect to the real base time signal:
X
b 7→ |bk |3 − γ X
∂ C˜
k∈[0,M −1] (s) = 6 |bk |3 − γ < F −1 (W .F (b. |b|))
∂x
k∈[0,M −1]
For simplicity’s sake we will consider the raw moment instead of (39)
the usual standardized moment that can be found in [2]. Using the
rules of Wirtinger’s calculus detailed in Section 2.2.4 we straight- We now have all the tools required to use a gradient descent algo-
forwardly obtain the W ∗ -gradient of C with respect to the sub- rithm over the base signal s in order to minimize the cost function
band at b: C and thus impose a given statistic over a selected sub-band of it,
without having to first perform it over the sub-band and then resort
∂C X to a potentially damaging reconstruction.
(b) = 3 3
|bk | − γ b. |b| (32)
∂z
k∈[0,M −1] Additionally, it is now possible to impose a given set of statis-
tics to several sub-bands at once. Indeed, if we define a general
But since the gradient descent is made over the base signal s we cost function as the sum of several sub-bands cost functions, such
need to convert the gradient at b over a gradient at s. Mathemati- as the one expressed in (31), then the linearity of derivation gives
cally speaking, we wish to know the W ∗ -gradient of the function that the global cost gradient with respect to the base time signal
C˜ defined as: is simply the sum of the sub-band cost gradients which we estab-
lished in (39): using this global cost in the gradient descent will
C˜: CM → R (33) then tend to lower the sub-band cost functions, thus imposing the
−1
s 7→ C F (G(F(s))) desired values to the statistics of each sub-bands at once.
So as to act both as an example and an experimental valida-
Since we know the gradient of C, all we need to do is convert it to tion, such a simultaneous imposition was made over a 10 seconds-
the frequency domain, compose it with G and bring it back to time long white noise signal sampled at 48 kHz. The statistics chosen
domain using the rules of Wirtinger calculus and time-frequency as goals are the raw third moments extracted from a rain sound
gradient conversion. texture over 3 sub-bands arbitrarily chosen at 20 Hz, 1765 Hz and
10.3 kHz, while the algorithm used is a conjugate gradient descent.
Using the time to frequency domain conversion expression in (28) The evolution over the iterations of the gradient descent of the cost
we obtain the value of the W ∗ -gradient of C ◦ F −1 at B as: functions, both global and band-wise, is plotted in Figure 1. Be-
cause the band-wise cost functions are decreasing during the gra-
∂ C ◦ F −1 1 ∂C dient descent, the optimization successfully outputs a time signal
(B) = F (b) (34)
∂z ∗ N ∂z ∗ possessing the exact statistics we meant to impose on it, without
requiring a separate imposition for each sub-band nor a possibly
From here we need to obtain the gradient of C ◦ F −1 ◦ G. Since harmful reconstruction of the time signal from the sub-bands sig-
G is a simple element-wise product, and since as stated in section nals.
DAFX-4
DAFx-237
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
10 3
10 6
10 6
10 10
10 9
10 14
10 12
10 18
10 15
10 22
0 10 20 30 40 50 0 10 20 30 40 50
number of iterations number of iterations
(a) global error over the sub-bands (b) separate errors over each sub-band
Figure 1: Value of the cost functions over the number of iterations made during the gradient descent. The total gradient is obtained by
summing the gradients of 3 arbitrary sub-bands (20 Hz, 1765 Hz and 10.3 kHz) as expressed in (39)
6. REFERENCES
DAFX-5
DAFx-238
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT ing two live sources to be convolved with each other in a stream-
ing fashion. Our current implementation allows this kind of live
This paper describes a method for doing convolution of two live
streaming convolution with minimal buffering. As we shall see,
signals, without the need to load a time-invariant impulse response
this also allows for parametric transformations of the IR. Examples
(IR) prior to the convolution process. The method is based on
of useful transformations might be pitch shifting, time stretching,
stepwise replacement of the IR in a continuously running convolu-
time reversal, filtering, all done in a time-variant manner.
tion process. It was developed in the context of creative live elec-
tronic music performance, but can be applied to more traditional
use cases for convolution as well. The process allows parametriza- 3. CONVOLUTION METHODS
tion of the convolution parameters, by way of real-time transfor-
mations of the IR, and as such can be used to build parametric Convolution of two finite sequences x(n) and h(n) of length N is
convolution effects for audio mixing and spatialization as well. defined as:
N
X −1
1. INTRODUCTION y(n) = x (n) ∗ h (n) = x (n − k) h (k) (1)
k=0
Convolution has been used for filtering, reverberation, spatializa-
tion and as a creative tool for cross-synthesis ([1], [2] and [3] to This is a direct, time-domain implementation of a FIR filter struc-
name a few). Common to most of them is that one of the inputs is a ture with filter length N . A few observations can be made at this
time-invariant impulse response (characterizing a filter, an acoustic stage:
space or similar), allocated and preprocessed prior to the convolu- • Filter length is the only parameter involved.
tion operation. Although developments have been made to make
• There is no latency: y(0) = x(0)h(0).
the process latency free (using a combination of partitioned and
direct convolution [4]), the time-invariant nature of the impulse re- • The output length Ny is equal to 2N − 1
sponse (IR) has inhibited a parametric modulation of the process. • Computational complexity is of the order O(N 2 ).
Modifying the IR traditionally has implied the need to stop the au-
dio processing, load the new IR, and then re-start processing using Eq. 1 can be interpreted as a weighted sum of N time-shifted
the updated IR. This paper will briefly present a context for the copies of the input sequence x(n). The weights are given by the
work before presenting the most common convolution methods. coefficients of the filter h(n). In acoustic simulations, a room im-
A number of observations along the way will motivate a strategy pulse response is a typical example of such a filter, with wall reflec-
for convolution with time-variant impulse responses based on step- tions represented as non-zero coefficients. The result of running a
wise replacement of IR partitions. Finally we present a number of source signal through this filter is an accumulation of delayed re-
use cases for the method with focus on live performance. flections, into what we recognize as a reverberant sound.
If instead the filter h(n) is live-sampled from a musical per-
formance, the filter coefficients will typically form continuous se-
2. CONTEXT
quences of non-negligible values. Hence, when the sound of an-
other musical instrument is convolved with this filter, the output
The current implementation was developed in the context of our
will exhibit a dense texture of overlays, leading to a characteristic
work with cross-adaptive audio processing (see [5] and also the
time smearing of the signals involved.
project blog http://crossadaptive.hf.ntnu.no/) and live processing
(previous project blog at [6] and the released album "Evil Stone Figure 1 illustrates what happens when a simple sinusoidal
Circle" at [7]). Our previous efforts on live streaming convolution signal is convolved with itself (x(n) = h(n)). As expected the
area are described in ([8], [9]). We also note that the area of flex- output is stretched out to almost double length. What is more in-
ible convolution processing is an active field of research in other teresting is the ramp characteristics caused by the gradual overlap
environments (for example [10], in some respects also [11] and of the sequences x(n) and h(n) in eq. 1. The summation has a sin-
[12] due to the parametric approach to techniques closely related gle term at n = 0, reaches a maximum of N terms at n = N − 1,
to convolution). and returns to a single term again at n = 2N − 2. This inherent
"fade in/fade out"-property of convolution will be exploited in our
Our primary goal has been to enable the use of convolution
approach.
as a creative tool for live electronic music performance, by allow-
The computational complexity of direct convolution will be
∗ Thanks to NTNU for generously providing a sabattical term, within prohibiting when filter length increases. We normally work with
which substantial portions of this research have been done filters of duration 1-2 seconds or more, which can amount to 96000
DAFX-1
DAFx-239
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
with:
(
x(n), if n ∈ [i · NP , (i + 1) · NP − 1]
xi (n) = (5)
0, otherwise.
and:
(
h(n), if n ∈ [j · NP , (j + 1) · NP − 1]
hj (n) = (6)
0, otherwise.
Then it can be proven from the linear and commutative properties
of convolution that eq. 1 can be rewritten as:
P
X −1 P
X −1
y(n) = x (n) ∗ h (n) = xi (n) ∗ hj (n) (7)
i=0 j=0
DAFX-2
DAFx-240
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4. RESEARCH GOALS AND PREVIOUS WORK smooth out discontinuities. The disadvantage of this method (as-
suming fast convolution in the frequency domain) is the added cost
Traditionally convolution is an asymmetric operation where the of an extra spectral convolution and inverse transform, as well as
two inputs have different status: a continuous stream of input sam- the operations consumed by the crossfade. A modified approach
ples x(n) and an impulse response (IR) h(n) of finite length N . with crossfading in the frequency domain [16] saves the extra in-
Typically the latter is a relatively short segment, a time-invariant verse transform.
representation of a system (e.g. a filter or a room response) onto Our goal is to be able to dynamically update the impulse re-
which the input signal is applied. sponse without reinitializations and crossfades of parallel convolu-
Instead we are searching for more flexible tools for musical tion processes. In the following we assume a uniformly partitioned
interplay and have previously presented a number of strategies for convolution scheme implemented with FFTs and multiplication in
dynamic convolution [9]. Our aim has been to: the frequency domain. The key to our approach is the inherent
• Attain dynamic parametric control over the convolution pro- "fade in/fade out"-characteristic of convolution (illustrated in fig-
cess in order to increase playability. ure 1) and the previously noted properties of partitioned convolu-
tion:
• Investigate methods to avoid or control dense and smeared
output
Property 1. The partitions xi (n) and hj (n) contributes to output
• Provide the ability to use two live audio sources as inputs
interval Yk only if i, j ≤ k.
to a continuous real-time convolution process.
• Provide the ability to update/change the impulse responses
in real-time without glitches. An immediate consequence of Property 1 is that we can load
the impulse response partition by partition in parallel with the in-
With existing tools we haven’t been able to get around the put. A fully loaded IR is not necessary to initiate the convolu-
time-invariant nature of the impulse response in an efficient way. tion process since the later partitions initially do not contribute to
In order to make dynamic updates of the IR during convolution the convolution output. This drastically reduces the latency e.g.
without audible artifacts, we had to use two (or more) concurrent for application of live sampled impulse responses. Convolution
convolution processes and then crossfade between them whenever can start during sampling, as soon as a single partition of the IR
the IR should be modified. The IR update could be triggered ex- is available. In addition the computational load associated with
plicitly, at regular intervals or based on the dynamics of the input FFT calculations on the IR is conveniently spread out in time,
signal (i.e. transient detection). Every update of the IR triggered a hence avoiding bursts of CPU usage when loading long impulse
reinitialization of the convolution process. responses.
We have also done work on live convolution of two audio sig-
nals of equal status: neither signal has a subordinate status as filter
for the other [8]. Instead both are continuous audio streams seg- Property 2. The partitions xi (n) and hj (n) contributes to output
mented at intervals triggered by transient detection. Each pair of interval YP +k only if i, j ≥ k.
segments are convolved in a separate process. With frequent trig-
gering the number of concurrent processes could grow substan-
From Property 2 it is clear that we also can unload the im-
tially due to the long convolution tail (P − 1 partitions) of each
pulse response partition by partition without audible artifacts, even
process.
while the convolution process is running. The reason being that the
earlier partitions no longer contribute to the output. Unloading a
5. TIME-VARIANT IMPULSE RESPONSE partition simply means clearing it to zero, and it should progress
in the same order as the loading process.
In this paper we present a simple, but efficient implementation of The shortest possible load/unload interval is a single partition,
convolution with a time-variant impulse response h(n) of finite which is actually equivalent to segmenting out a single partition
length N. In this case the coefficients of h(n) are no longer con- of the input signal and convolve it with the full impulse response.
stant throughout the convolution process. At a point nT in time Figure 3 illustrates the effect.
the current impulse response, denoted hA (n) is updated with a
new filter hB (n).
Wefers and Vorländer [16] points at two possible solutions to
time-varying FIR filtering: instantaneous switching or crossfading.
The former can be expressed as:
(
x (n) ∗ hA (n) , if n < nT
yn = (11)
x (n) ∗ hB (n) , if n ≥ nT
Figure 3: Example of minimal load/unload interval
It is a cheap implementation, but the hard switching is known to
cause audible artifacts from discontinuities in the output wave-
form. The combination of the two strategies, stepwise loading and
The second option, crossfading, avoids these artifacts by fil- unloading, naturally leads to stepwise replacement of the impulse
tering x(n) with both impulse responses, hA (n) and hB (n), in a response. If we return to the direct form we can express the con-
transition interval and then crossfade between the two outputs to volution with stepwise IR replacement starting at time nT as:
DAFX-3
DAFx-241
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-242
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 5: Convolution demonstration. Top: The two impulse re- Figure 7: Convolution demonstration: Comparing time-variant
sponses, hA (n) and hB (n). Bottom: The input signal x(n) method with two overlapping time-invariant convolutions. Top:
The merged result of the two time-invariant convolutions in figure
In the following all convolution operations are running with 6. Middle: The output when using stepwise IR replacement. Bot-
partition length NP = 1024 and FFT block size NB = 2048. tom: Difference between the two.
At time nT ≈ 4.5 seconds, a switch between impulse responses
hA (n) and hB (n) is initiated (the exact time is slightly less in If we merge together the two convolution outputs in figure 6
order to align it with the partition border). we get the signal shown at the top of figure 7. If, on the other hand,
First we show the output of two separate time-invariant con- we run the entire operation as a single time-variant convolution on
volution processes, using traditional partitioned convolution (fig- the original signal x(n), where the impulse response is stepwise
ure 6). The input signal is split in two non-overlapping segments replaced beginning at time nT , we get the output signal shown in
at the transition time nT . The first segment defined by n < nT the middle of figure 7. The difference between the two signals are
is convolved with impulse response hA (n) only and the output displayed at the bottom. It is uniform and for all practical pur-
is shown immediately below (Convolution 1). The second input poses negligible. An estimate of the signal-to-noise ratio (SNR)
segment defined by n ≥ nT follows next. It is convolved with returns 67.2 dB. From this demonstration it should be clear that
DAFX-5
DAFx-243
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
convolution with stepwise IR replacement preserves the behavior (for example time reversal) of the audio before making the IR. As
of time-invariant convolution, while at the same time allowing fil- an attempt to visualize when the IR is taken from, we use a circu-
ter updates within a single process at no extra cost. lar coulouring scheme to display the circular input buffer. We also
A special case appears when the IR’s are sampled from the represent the IR using the same colours. Time (of the input buffer)
same continuous audio stream, one partition apart. Then we get is thus represented by colour. As the color of now continuously
a result similar to cross-synthesis, pairwise multiplication of par- and gradually changes from red to green to blue, it is possible to
titions from two parallel audio streams, but still with the effect of identify time lag as an azimuthal distance on the color circle 4 . Fig-
long convolution filters. It should be added that there are more ef- ure 9 shows the plugin gui, with a representation of the coloured
ficient ways to implement this particular condition, but that will be input buffer and a live sampled IR. Similarly, a time reversed IR is
deferred to another paper. shown in figure 10. Note the direction of color change (green to
The interval between IR updates is provided as a parameter red) of the reversed IR as compared with the color change in the
available for performance control, which increases the playability normal (forward) IR (blue to red).
of convolution. There is still a risk of huge dynamic variations
due to varying degrees of frequency overlap between input signal
and impulse response. The housekeeping scheme introduced to
maintain the IR update processes could be exploited for smarter
amplitude scaling. This is ongoing work. We would also like to
look at integration of some of the extended convolution techniques
proposed by Donahue & al [10].
Figure 8: Plugin signal flow overview The liveconvolver plugin has been used for several studio sessions
and performances from late 2016 onwards. Reports and sound ex-
amples from some of these experimental sessions can be found at
The signal on the IR record input channel is continuously writ- [19] and [20]. These reports also contain reflections on the per-
ten to a circular buffer. When we want to replace the current IR, formative difference experienced in the roles of recording the IR
we read from this buffer and replace the IR partition by partition.
The main reason for the circular buffer is to enable transformation 4 https://en.wikipedia.org/wiki/Color_wheel
DAFX-6
DAFx-244
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
and playing through it. It is notable that a difference is perceived, 12. REFERENCES
since in theory the mathematical output of the convolution process
is identical regardless of which one signal is used as the IR. The [1] Mark Dolson, “Recent advances in musique concrète at
buffering process allows continuous updates to the IR with mini- CARL,” in Proceedings of the 1985 International Computer
mal latency, and several methods for triggering the IR update has Music Conference, ICMC 1985, Burnaby, BC, Canada, Au-
been explored. This should ideally facilitate a seamless merging gust 19-22, 1985, 1985.
of the two input signals. However, the IR update still needs to be [2] Curtis Roads, “Musical sound transformation by convolu-
triggered, and the IR will be invariant between triggered updates. tion,” in Opening a New Horizon: Proceedings of the 1993
In this respect, the instrument providing the source for the IR will International Computer Music Conference, ICMC 1993,
be subject to recording, and the live input to the convolution pro- Tokio, Japan, September 10-15, 1993, 1993.
cess is allowed a continuous and seamless flow. It is natural for [3] Trond Engum, “Real-time control and creative convolution,”
a performer to be acutely aware of the distinction between being in 11th International Conference on New Interfaces for Mu-
recorded and playing live. Similar distinctions can assumably be sical Expression, NIME 2011, Oslo, Norway, May 30 - June
made in the context of all forms of live sampling performance, 1, 2011, 2011, pp. 519–522.
and in many cases of live processing as an instrumental practice.
The direct temporal control of the musical energy resides primar- [4] William G. Gardner, “Efficient convolution without input-
ily with the instrument playing through the effects processing, and output delay,” Journal of the Audio Engineering Society, vol.
to a lesser degree with the instrumental process creating the effects 43, no. 3, pp. 127, 1995.
process (recording the IR in our case here). [5] Øyvind Brandtsegg, “A toolkit for experimentation with sig-
nal interaction,” in Proceedings of the 18th International
Conference on Digital Audio Effects (DAFx-15), 2015, pp.
9. OTHER USE CASES, FUTURE WORK 42–48.
[6] Øyvind Brandtsegg, Trond Engum, Andreas Bergs-
The flexibility gained by our implementation allows parametric land, Tone Aase, Carl Haakon Waadeland, Bernt Isak
control of convolution also in more traditional effects processing Wærstad, and Sigurd Saue, “T-emp communication
applications. One could easily envision a parametric convolution and interplay in an electronically based ensemble,”
reverb with continuous pitch and filter modulation for example. https://www.researchcatalogue.net/view/48123/48124/10/10,
Parametric modulation of the IR can be done either by applying 2013.
transformations on the audio being recorded to the IR, or directly [7] Øyvind Brandtsegg, Trond Engum, Tone Aase, Carl Haakon
on the spectral data. Such changes to the IR could have been done Waadeland, and Bernt Isak Wærstad, “Evil stone circle,”
with traditional convolution techniques too, by preparing a set of https://www.cdbaby.com/cd/temptrondheimelectroacou,
transformed IR’s and crossfading between them. The possibilities 2015.
inherent in the incremental IR update as we have described allows
[8] Lars Eri Myhre, Antoine H Bardoz, Sigurd Saue, Øyvind
direct parametric experimentation, and thus is much more immedi-
Brandtsegg, and Jan Tro, “Cross convolution of live audio
ate. It also allows for automated time-variant modulations (using
signals for musical applications,” in International Sympo-
LFO’s and random modulators), all without introducing artifacts
sium on Computer Music Multidisciplinary Research, 2013,
due to IR updates.
pp. 878–885.
[9] Øyvind Brandtsegg and Sigurd Saue, “Experiments with dy-
10. CONCLUSIONS namic convolution techniques in live performance,” in Linux
Audio Conference, 2013.
We have shown a technique for incremental update of the impulse [10] Chris Donahue, Tome Erbe, and Miller Puckette, “Extended
response for convolution purposes. The technique provides time- convolution techniques for cross-synthesis,” in Proceedings
variant filtering by doing a continuous update of the IR from a of the International Computer Music Conference 2016, 2016,
live input signal. It also opens up possibilities for direct paramet- pp. 249–252.
ric control of the convolution process and as such enhancing the [11] Jonathan S Abel, Sean Coffin, and Kyle S Spratt, “A
general flexibility of the technique. We have implemented a sim- modal architecture for artificial reverberation,” Journal of
ple VST plugin as proof of concept of the live streaming convo- the Acoustical Society of America, vol. 134, no. 5, pp. 4220–
lution, and documented some practical musical exploration of its 4220, 2013.
use. Further applications within more traditional uses of convolu- [12] Jonathan S Abel and Kurt James Werner, “Distortion and
tion has also been suggested. pitch processing using a modal reverberator architecture,” in
International Conference on Digital Audio Effects (DAFx-
15), Trondheim, Norway, November/2015 2015, .
11. ACKNOWLEDGMENTS
[13] H.J. Nussbaumer, Fast Fourier Transform and convolution
algorithms, Springer, 1982.
We would like to thank the Norwegian Artistic Research Programme
for support of the research project "Cross-adaptive audio process- [14] Frank Wefers, Partitioned convolution algorithms for real-
ing as musical intervention", within which the research presented time auralization, Logos Verlag Berlin GmbH, 2015.
here has been done. Thanks also to the University of California, [15] Thomas G. Stockham, “High-speed convolution and corre-
and performers Kyle Motl and Jordan Morton for involvement in lation,” in Proceedings of the April 26-28, 1966, Spring joint
the practical experimentation sessions. computer conference, 1966, pp. 229–233.
DAFX-7
DAFx-245
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-246
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-247
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.1. Modal Approach Table 1: Partial name and intervalic relationship to the fundamen-
tal (prime)
We use modal analysis to represent each bell of the Lurie Carillon
as a sum of exponentially decaying sinusoids Partial Name Partial Interval
M Hum Octave (below)
X
x= am ej2πωm t e−t/τm , (1) Prime Fundamental
m=1 Tierce Minor Third
Quint Perfect Fifth
where am is the complex amplitude, ωm the frequency, and τm Nominal Octave
the decay rate for each mode m. An analysis block diagram can Deciem Major Third
be seen in Fig. 1, and the steps for estimating the parameters a, ω, Undeciem Fourth
and τ are as follows: Duodeciem Twelfth
1. Perform peak picking in the frequency domain Double Octave Octave
Upper Undeciem Upper Fourth
2. Form band-pass filters around each peak
Upper Sixth Major Sixth
3. Compute Root-Mean-Squared (RMS) energy envelopes of Triple Octave Octave
the band-passed signals
4. Estimate the decay rate on the energy envelopes
5. Estimate the initial complex amplitudes was found by hand as the bell recordings had large variance of
partial-signal-level and noise floor. The result of the slope fitting
can be seen in Fig. 3.
2.2. Estimating Frequency
The first step in our analysis is to estimate the modal frequencies 2.4. Estimating Complex Amplitude
of each bell. We use a peak picking algorithm in the frequency do- Once we have estimated the frequency and decay rate of each
main to identify candidate frequencies. Before taking the Fourier mode, we estimate the initial amplitude of each partial required
Transform for each bell, we high-pass filter the time domain sig- to reconstruct the original bell recording. To do this, we form a
nal half an octave below the hum tone for that bell (see Table 1 matrix where each column holds each partial independently as in
for a detailed list of carillon bell modes). The carillon is located
outdoors in a high-noise environment and this reduces the likeli-
1 ... 1
hood of picking spurious peaks that are simply background noise. e (jω1 −τ1 )
... e (jωM −τM )
Additionally, we use a portion of the time domain bell recording
M= .. .. , (2)
beginning 10ms after the onset so the bell’s noisy transient does . ... .
not produce a large number of false peaks in the FFT. We use a e(jω1 −τ1 )T ... e(jωM −τM )T
length 214 Hann window, which produces side-lobes that roll off
approximately 18dB per octave. where ωm are the frequencies, τm the decay rates, and T is the
Our algorithm identifies peaks around points in the frequency length of the time vector. We use least squares to find the complex
domain signal where the slope changes sign from both sides. We amplitudes
then integrate the energy in the FFT bins surrounding the identified
peaks and discard peaks that have low power or fall too close to one a = (M| M)−1 M| x , (3)
another. Finally, we pick the N highest candidate peaks that are
above some absolute threshold (set by inspection for each bell). where x is the original bell recording and a the vector of complex
For the lowest bells, we estimate around 50 modes and for the amplitudes.
highest bells we estimated fewer than ten.
While longer-length FFTs provide better frequency resolution,
2.5. Results
the fact that we are estimating decaying sinusoids runs counter to
this argument. With a long FFT and modes that decay quickly, we As a result of our analysis, we have estimated frequencies, decay
will amplify noise that occurs later in the recording. Therefore, rates, and initial complex amplitudes necessary to model the bells
the signal to noise ratio is best at the beginning of the recordings. as they were recorded. Fig. 4 plots these parameters for three bells
Experimentally, we found 214 samples to be an ideal length across throughout the range of the instrument.
the sixty bells of the carillon, and Fig. 2 shows the results of the
peak picking algorithm for several bells.
3. RESYNTHESIS
2.3. Estimating Exponential Decay We can resynthesize a “noiseless” copy of the original carillon
We use each frequency found in §2.2 as the center frequency for a recordings by directly plugging in the estimated amplitudes, fre-
fourth-order Butterworth band-pass filter. We find the energy en- quencies, and decay rates, into Eq. (1). We can also implement the
velope for each partial by averaging the band-pass filtered signals bell as a filter with the transfer function
using a 10ms RMS procedure. We then perform a linear fit to the M
amplitude envelope using least squares to estimate the decay rate X(z) X 2<{am } − 2e−1/τm <{am e−jωm }z −1
= , (4)
of each partial. The region over which the linear fit is performed U (z) m=1
1 − 2e−1/τm cos(ωm )z −1 + e−2/τm z −2
DAFX-2
DAFx-248
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
X Peak ω
FFT Picking
ω Frequency Domain
x Time Domain
x B Fit τ Construct M
BPF Decays Basis Estimate a
Bell
Complex
Recording
Amplitudes
x
Figure 1: Block diagram detailing the analysis process for finding the modal frequencies (ω), decay rates (τ ), and complex amplitudes (a).
-100
-100
10 2 10 3 10 4 0 2 4 6 8 10 12 14 16 18
Signal
Envelope
-50 -50 Estimate
-100
-100
0 2 4 6 8 10 12 14 16 18
10 2 10 3 10 4
Bell Number 50 Bell 7 Partial 34
0
0
Signal
Envelope
-50 -50 Estimate
-100
-100
0 2 4 6 8 10 12 14 16 18
10 2 10 3 10 4
Time (s)
Frequency (Hz)
Figure 3: Estimated decay rates for three partials of bell seven.
Figure 2: Frequency spectrum of three Bells from the Lurie Caril-
Each plot shows the recorded bell signal band-pass filtered around
lon showing the results of the peak picking algorithm (in blue).
the partial frequency (blue), the RMS smoothed energy envelope
(black), and the estimated decay rate (red). Dotted lines show the
region over which the linear fit was performed.
where U is the input. This is the same as a parallel sum of M
standard second order (biquad) transfer functions
X(z) XM
β0,m + β1,m z −1 + β2,m z −2 We find that thirty modes for a low-pitched bell and ten for a
= (5) high-pitched bell do a reasonable job reproducing the sound of the
U (z) 1 + α1,m z −1 + α2,m z −2
m=1 recordings. Using the modal data extracted from the recordings
of the Lurie Bells, we find the correlation coefficient between the
with coefficients
recorded bells and synthesized bells was on average 0.837 with a
β0,m = 2<{am } standard deviation of 0.117. Fig. 6 shows the result of this resyn-
thesis for one of the bells of the Lurie Carillon. One major advan-
β1,m = −2e−1/τm <{am e−jωm } α1,m = −2e−1/τm cos(ωm ) tage to the resynthesis is that the real bells ring for a significantly
β2,m = 0 α2,m = e−2/τm . long time. The recordings faded out once the bell is indistinguish-
able from the noise floor. The resynthesis does not have this limita-
If U is an impulse, we reconstruct the sound of the bell, and Eq. (4) tion and is therefore more impervious to amplitude modifications.
is equivalent to Eq. (1). Both synthesis models take the form of As a result of our analysis, we have a parameterized model
summing modes as seen in Fig. 5. rather than simply a means for synthesizing the sound of the orig-
DAFX-3
DAFx-249
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Bell Number 2
20
ejωm t
5 Mode
1
1
0.5 e−t/τm ×
0.25
Mode am
10 2 10 3 10 4
Bell Number 20
2 +
20 Mode m
···
Decay Rate
1
u(t) Mode y(t)
0.5
0.25 M
10 2 10 3 10 4
Bell Number 50 β0
20 + +
Mode −1
5
1 z
−α1 β1
1 + +
0.5 −1
0.25 z
Mode −α2 β2
10 2
10 3
10 4 2 +
Frequency (Hz)
Mode m
···
Figure 4: Frequency vs decay rate for three analyzed bells with
marker diameter showing each partial’s amplitude.
u(t) Mode y(t)
M
DAFX-4
DAFx-250
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 6: Spectrograms showing original recording and resynthe- Figure 8: Estimated parameters of all the bells of the Lurie Car-
sis of bell number seven. illon with common fundamental. Marker diameter corresponds to
bell-power integrated over the first second. The measured partial
frequencies are overlaid on the theoretical 12-tone equal tempera-
Scaling
ment tunings (blue lines).
&
Control
DAFX-5
DAFx-251
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-6
DAFx-252
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 10: Spectrogram of a through composed example demonstrating scaling and modulation effects.
eight seconds, the timing between each partial’s entrance decreases [2] Vesa Välimäki, Julian D. Parker, Lauri Savioja, Julius O.
and the decay rates for each partial is slowly lengthened. At ten Smith, and Jonathan S. Abel, “Fifty years of artificial rever-
seconds, the unmodified bell is sounded. The bell is struck once beration,” IEEE Transactions on Audio, Speech, and Lan-
more around second twelve with a long decay and a modulation guage Processing, vol. 20, no. 5, pp. 1421–48, July 2012.
applied on each partial. The modulation depth is increased until a [3] Tomas Rossing and Robert Perrin, “Vibrations of bells,” Ap-
frequency modulation effect is apparent before slowing back to a plied Acoustics, vol. 20, pp. 41–70, 1987.
vibrato.
Overall, this is not a comprehensive list of modal resynthesis [4] Lutz Trautmann and Rudolf Rabenstein, Digital Sound Syn-
techniques. In fact, this exploration just shows that the range of thesis by Physical Modeling Using the Functional Transform
sonic possibilities is enormous, even when constrained by a simple Method, Springer, New York, 1st edition, 2003.
model consisting of frequency, decay rate, and initial amplitude. [5] Gerhard Eckel, Francisco Iovino, and René Caussé, “Sound
synthesis by physical modelling with Modalys,” in Proceed-
4. CONCLUSIONS ings of the International Symposium on Musical Acoustics,
Dourdan, France, 1995.
In this paper we demonstrate how a modal model for sound syn- [6] Joseph Derek Morrison and Jean-Marie Adrien, “MOSAIC:
thesis can be manipulated to achieve a wide range of musical ef- A framework for modal synthesis,” Computer Music Jour-
fects. Using the carillon as a case study, we provide an analy- nal, vol. 17, no. 1, pp. 45–56, Spring 1993.
sis of the University of Michigan’s Lurie Carillon. By model-
ing each bell as a sum of decaying sinusoids at each modal fre- [7] Maarten van Walstijn and Jamie Bridges, “Simulation of dis-
quency, we provide a versatile model that lends itself well to a tributed contact in string instruments: A modal expansion
host of audio manipulations consisting primarily of scaling and approach,” in Proceedings of the 24th European Signal Pro-
shifting the parameters. As a tool for composers, we hope this cessing Conference (EUSIPCO), Budapest, Hungary, August
analysis of the Lurie Carillon will help encourage new electroa- 29 – September 2 2016, pp. 1023–7.
coustic music for carillon. The modal data generated from ana- [8] Maarten van Walstijn, Jamie Bridges, and Sandor Mehes, “A
lyzing the samples from [15] is freely available under a creative real-time synthesis oriented tanpura model,” in Proceedings
commons license and can be downloaded at https://ccrma. of the 19th International Conference on Digital Audio Effects
stanford.edu/~kermit/website/bells.html. (DAFx-16), Brno, Czech Republic, September 5–9 2016, pp.
The method used in this paper has a limited ability to resolve 175–82.
very closely spaced partials, known as “doublets” in the bell con-
[9] Clara Issanchou, Stefan Bilbao, Cyril Touze, and Olivier
text. Future work will attempt to estimate doublet parameters us-
Doare, “A modal approach to the numerical simulation of
ing nonlinear optimization. An alternate approach to modeling
a string vibrating against an obstacle: Applications to sound
doublet behavior is shown in [32, 33], where groups of closely-
synthesis,” in Proceedings of the 19th International Confer-
spaced modes are approximated using ARMA modeling.
ence on Digital Audio Effects (DAFx-16), Brno, Czech Re-
public, September 5–9 2016.
5. ACKNOWLEDGMENTS
[10] Esteban Maestre, Gary P. Scavonne, and Julius O. Smith,
“Design of recursive digital filters in parallel form by linearly
The authors thank Jonathan Abel for his helpful comments.
constrained pole optimization,” IEEE Signal Processing Let-
ters, vol. 23, no. 11, pp. 1547–50, 2016.
6. REFERENCES
[11] Jonathan S. Abel, Sean Coffin, and Kyle S Spratt, “A modal
[1] Stefan Bilbao, Numerical Sound Synthesis, Wiley, 2009. architecture for artificial reverberation,” The Journal of the
Acoustical Society of America, vol. 134, no. 5, pp. 4220,
2013.
DAFX-7
DAFx-253
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[12] Jonathan S. Abel, Sean Coffin, and Kyle Spratt, “A modal ar- [23] Wiegman van Heuven, Acoustical measurements on church-
chitecture for artificial reverberation with application to room bells and carillons, Ph.D. thesis, De Gebroeders van Cleef,
acoustics modeling,” in Proceedings of the 137th Convention The Hague, Netherlands, 1949.
of the Audio Engineering Society, Los Angeles, CA, October [24] Andre Lehr, “The system of the Hemony-carillons tuning,”
9–12 2014. Acustica, vol. 3, pp. 101–104, 1951.
[13] Jonathan S. Abel and Kurt James Werner, “Distortion and
[25] Albrecht Schneider and Marc Leman, Studies in Musical
pitch processing using a modal reverberator architecture,” in
Acoustics and Psychoacoustics, chapter Sound, Pitches and
Proceedings of the 18th International Conference on Digital
Tuning of a Historic Carillon, pp. 247–98, Springer, 2016.
Audio Effects (DAFx-15), Trondheim, Norway, November 30
– December 3 2015. [26] T. W. Parks and C. S. Burrus, Digital Filter Design, Wiley-
[14] Kurt James Werner and Jonathan S. Abel, “Modal processor Interscience, New York, NY, USA, 1987.
effects inspired by Hammond tonewheel organs,” Applied [27] Y. Hua and T. K. Sarkar, “Matrix pencil method for esti-
Sciences, vol. 6, no. 7, pp. 1421–48, 2016, Article #185. mating parameters of exponentially damped/undamped sinu-
[15] Isaac Levine, Ashton Baker, Rowena Ng, Rachael soids in noise,” IEEE Transactions on Acoustics, Speech, and
Park, Anjana Rajagopal, Tiffany Ng, and John Signal Processing, vol. 38, no. 5, pp. 814–24, May 1990.
Granzow, “Download Lurie carillon samples,” [28] Andre Lehr, “Contemporary Dutch bell-founding art,”
https://gobluebells.wordpress.com/2016/ Netherlands Acoustical Society, , no. 7, pp. 20–49, 1965.
09/21/lurie-carillon-samples/.
[29] “The Riverside Church carillon,” www.trcnyc.org/
[16] “World carillon federation,” http://www.carillon. events/carillon?/.
org/.
[30] Brian Swager, A History of the Carillon: Its origins, De-
[17] “Lurie carillon,” https://www.music.umich.edu/
velopment, and Evolution as a Musical Instrument, Ph.D.
about/facilities/north_campus/lurie/
thesis, Indiana University, 1993.
lurie.htm.
[18] “Ann & Robert H. Lurie Carillon,” http://www. [31] Jonathan Harvey, “Mortuos Plango, Vivos Voco: A realiza-
towerbells.org/data/MIANNAU2.HTM. tion at IRCAM,” Computer Music Journal, vol. 5, no. 47,
1981.
[19] Robert Perrin and T. Charnley, “A comparative study of the
normal modes of various modern bells,” Journal of Sound [32] Matti Karjalainen, Paulo A. A. Esquef, Poju Antsalo, Aki
and Vibration, vol. 117, no. 3, pp. 411–420, 1987. Mäkivirta, and Vesa Välimäki, “Frequency-zooming ARMA
[20] William Hibbert, The Quantification of Strike Pitch and modeling of resonant and reverberant systems,” Journal of
Pitch Shifts in Church Bells, Ph.D. thesis, The Open Uni- the Audio Engineering Society, vol. 50, no. 12, pp. 1012–29,
versity, 2008. December 2002.
[21] Xavier Boutillon and Bertrand David, “Assessing tuning and [33] Matti Karjalainen, Vesa Välimäki, and Paulo A. A. Esquef,
damping of historical carillon bells and their changes through “Efficient modeling and synthesis of bell-like sounds,” in
restoration,” Applied Acoustics, vol. 64, pp. 901–10, 2001. Proceedings of the 5th International Conference on Digi-
[22] Vincent Debut, Miguel Carvalho, and Jose Antunes, “An tal Audio Effects (DAFx-02), Hamburg, Germany, September
objective approach for assessing the tuning properties of his- 25–28 2002, pp. 181–6.
torical carillons,” in Proceedings of the Stockholm Music
Acoustics Conference, Stockholm, Sweden, July 2013.
DAFX-8
DAFx-254
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT The discrete summation formula (DSF) from Moorer [5] are
Using bandlimited impulse train (BLIT) synthesis, it is possible to a set of closed form solutions to discrete harmonic series (infi-
generate waveforms with a configurable number of harmonics with nite or finite). They allow a direct synthesis of harmonic signals
an equal amplitude. In contrast to the sinc-pulse, which is typically with a specified number of partials and an exponentially increas-
used for bandlimiting in BLIT and only allows to set the cutoff ing or decreasing spectral envelope. By the combination of dif-
frequency, a Hammerich pulse can be tuned by two independent ferently parametrised DSF one can create further complex spectral
parameters for cutoff frequency and stop band roll-off. Replacing envelopes. Although the DSF are more efficient than a discrete ad-
the perfect lowpass sinc-pulse in BLIT with a Hammerich pulse, ditive synthesis they still require a considerable amount of trigono-
it is possible to directly synthesise a multitude of signals with an metric function evaluations.
adjustable lowpass spectrum. Bandlimited impulse train (BLIT) synthesis [1] is another ap-
proach to create lowpass signals and relies on the fact that an im-
pulse train exhibits a flat spectrum with an infinite amount of har-
1. INTRODUCTION
monics. When the impulse train is convolved with a sinc-function,
Subtractive sound synthesis with analogue synthesisers requires a perfect bandlimited signal will be obtained, whereas the fre-
generic and spectrally rich harmonic waveforms. Traditionally, quency of the sinc-pulse determines the cutoff frequency. In prac-
these are rectangular, triangular and sawtooth waveforms. Their tical applications, the infinite length sinc-function has to be win-
flat spectrum is successively shaped by filters until it matches the dowed and limited to a reasonable length and the computationally
expectations of the musician. These filters usually have a low- expensive convolution is replaced by a summation of overlapping
pass characteristic with a variable cutoff frequency, adjustable res- sinc-pulses with a pulse distance proportional to the fundamental
onance peak and a slope of 12 or 24 dB per octave. One major frequency (sum of windowed sinc-function BLIT synthesis [1]). A
problem of digital subtractive synthesis is the fact that trivial im- windowed sinc-function will loose its perfect lowpass characteris-
plementations of the required basic waveforms leads to massive tic and stop-band ripple occurs. The selection and parametrisation
aliasing. The creation of bandlimited oscillators has been an ac- of the window and its length influences the final lowpass shape
tive research topic ever since digital sound synthesis became of in- [6]. BLIT synthesis was further developed and optimised in var-
terest. An extensive summary and discussion of various methods ious aspects (e.g. in [7, 8]), but the created signals have been al-
can be found in [1] and [2]. The latter also introduced the Poly- ways used as a bandlimited input for subtractive workflows. To the
BLEP approach (detailed in [3]) which today is a standard method knowledge of the authors, a direct BLIT synthesis of signals with
to create high quality bandlimited waveforms due to its low com- real-time configurable lowpass spectra has not been investigated
putational cost and easy implementation. so far.
Subtractive synthesis is a straightforward approach when build- The Hammerich pulse [9] was introduced as a pulse shape
ing analogue synthesisers. Today, many digital synthesisers mimic filter [10] for transmission systems and its spectral shape can be
an analogue subtractive workflow, mainly due to the fact that musi- tuned by two independent parameters for cutoff frequency and stop
cians have been used to it for decades and the use of filters to shape band roll-off. Replacing the sinc-pulses in BLIT with Hammerich
the sound is at the same time intuitive, simple and versatile. How- pulses allows to directly synthesise signals with adjustable low-
ever, for the performing musician the exact synthesis method does pass spectra. The fundamental frequency, cutoff frequency and
not matter. It is more important that the synthesiser allows to create stop band roll-off are monotonic parameters and can be modu-
musically sounding signals which can be intuitively controlled by lated smoothly. In a sound synthesis application, the few parame-
only a few but powerful parameters [4]. From an engineering point ters and inherent limitation to lowpass spectra reduces complexity
of view it does not make sense to put a lot of effort into the gen- in the user interface and offers the musician an intuitive and less
eration of aliasing-free waveforms with harmonics up to half the technical access. Nevertheless, a wide variety of sounds similar to
sampling frequency and then to remove most of the high frequency subtractive synthesisers can be created without additional filtering.
content with a lowpass filter. A method to directly synthesise the Moreover, unique sounds can be generated for example by modu-
desired signal spectra would alleviate the effort for anti-aliasing lating the filter roll-off which would not be possible with classical
strategies as such a signal will contain less high frequency content analogue synthesisers.
from the beginning. Additive synthesis as a discrete summation The idea to create a versatile oscillator with Hammerich pulse
of amplitude-weighted sine waves would offer full control to the shapes was already briefly described in [11]. In this paper, the fo-
creation process of perfectly bandlimited signals but due to the re- cus will be on a detailed discussion of such a lowpass bandlimited
sulting computational complexity it is rarely used in practice. impulse train (LP-BLIT) oscillator in the context of sound synthe-
DAFX-1
DAFx-255
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
hH (t)
create more complex spectra by combination of several oscillator 0.5
outputs, leaky integration and parameter modulation before Sec-
tion 4 provides a short discussion and conclusion.
0
|HH (f )| in dB
exhibits a spectrum −20
α = 1.0
∞
X 2π −40
D(jω) = δ(ω − l ω0 ), ω0 = (2)
T0
l=−∞ −60
with harmonics at multiples of ω0 . The fact that the harmonic
−80
interval ω0 is directly related to T0 can be utilised to synthesise a 0 1 2 3 4
signal with an infinite amount of harmonics and a fundamental fre- f /fc →
quency F0 = 1/T0 . Sampling of d(t) for digital implementations
would lead to massive aliasing due to its infinite bandwidth. Ap- (b) Frequency domain
plying an ideal lowpass filter with a cutoff frequency ωc = 2πfc
Figure 1: Hammerich pulse for various values of α.
sin(ωc t) ω
h(t) = ↔ H(jω) = rect (3)
ωc t 2 ωc
results in a bandlimited signal and permits an intuitive adjustment of the lowpass characteristic
by two independent parameters for cutoff frequency ωc and stop
x(t) = d(t) ∗ h(t) (4) band slope α. Reasonable parameter ranges are 0 < α < 10 and
ωc > 2πF0 . For small values of α, the impulse response
which can then be sampled at a sample rate fs > 2 fc without
aliasing [1]. In the frequency domain, the convolution from Eq. 4 sin(ωc t)
will create a harmonic spectrum lim hH (t) = = sinc(ωc t) (9)
α→0 ωc t
X(jω) = H(jω) · D(jω)
∞
! converges towards a sinc-function and for α → ∞ the resulting
X stop band slope as well as the pulse width converges to zero. Ex-
= H(jω) · δ(jω − l ω0 ) (5)
emplary impulse and frequency responses are depicted in Fig. 1 for
l=−∞
selected values of α. The -6 dB point is fairly accurate set by ωc ,
weighted with the spectral shape H of the filter. By replacing or whereas the linear slope beyond this point is only controlled by α.
modifying the filter, it is possible to synthesise any bandlimited Figure 2 depicts a time-domain pulse train after convolution with
harmonic signal with an arbitrary spectral shape. a Hammerich impulse response together with the corresponding
Instead of a computationally expensive time-domain convolu- spectrum. The parameters of the Hammerich filter were chosen as
tion, a direct summation of time-shifted impulse responses fc = 4 F0 and α = 0.4. All harmonics follow the shape of the
∞ pulse spectrum as it was expected based on Eq. 5.
X
x(t) = d(t) ∗ h(t) = h(t − l T0 ) (6)
l=−∞
2.1. Discrete-time implementation with finite pulse length
will lead to an identical result. This is particularly useful if the im-
By sampling the Hammerich pulse from Eq. 7 with a sample rate
pulse response can be simply calculated with a closed-form equa-
fs , the discrete-time pulse
tion as it is the case with the sinc-function.
Other pulses with a lowpass characteristic, as for example the
α · sin(Ωc n) fc
raised cosine pulse, offer a more detailed adjustment of their fre- hH (n) = , Ωc = 2π , (10)
quency response than the sinc-function. The Hammerich pulse was sinh(α · Ωc n) fs
introduced in [10] as a pulse shape filter for transmission systems.
is obtained. The cutoff frequency fc = Nh · F0 can also be
Its impulse and frequency response is given by
expressed as a multiple of the fundamental frequency, whereas
α · sin(ωc t) Nh ≥ 1 determines the number of harmonics before the filter starts
hH (t) = (7)
sinh(α · ωc t) to roll off. This yields a pulse
1 |ω| − ωc
HH (jω) = 1 − tanh (8) α · sin(Nh · Ω0 n) F0
4fc 4αfc hH (n) = , Ω0 = 2π , (11)
sinh(α · Nh · Ω0 n) fs
DAFX-2
DAFx-256
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
d(t)
−50 −50
x(t) Single
hH (t) Double
0.5
x(t)
−100 −100
0
−1 0 1 2 3
−150 −150
t/T0 →
where the number of evaluated terms K determines the accuracy. yields a signal where every i-th harmonic is cancelled. For i = 2,
For the periodic sine, the argument has to be wrapped to a range which is equivalent to using a bipolar impulse train as source signal
x ∈ [−π/2, π/2] to minimize the required order of the Taylor [1], a signal xO (n) with only odd harmonics remains (Fig. 4 b).
DAFX-3
DAFx-257
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 0.1 20
0
|Xsaw (f )| in dB
|X(f )| in dB
0.05 0
0.5 −20
xsaw (t)
−20
x(t)
−40 0
0 −40
−60 −0.05 −60
−0.5 −80 −0.1 −80
0 1 2 3 0 5 10 15 20 0 1 2 3 0 5 10 15 20
t/T0 → f /F0 → t/T0 → f /F0 →
(a) (a)
0.1
0 0
|Xrect (f )| in dB
|XO (f )| in dB
0.5
0.05
−20 −20
xrect (t)
xO (t)
−40 0 −40
0
−60 −0.05 −60
−0.5 −80 −0.1 −80
0 1 2 3 0 5 10 15 20 0 1 2 3 0 5 10 15 20
t/T0 → f /F0 → t/T0 → f /F0 →
(b) (b)
0.1 20
0
|Xtri (f )| in dB
|XE (f )| in dB
0.5 0.05 0
−20 xtri (t)
−20
xE (t)
−40 0
0 −40
−60 −0.05 −60
−0.5 −80 −0.1 −80
0 1 2 3 0 5 10 15 20 0 1 2 3 0 5 10 15 20
t/T0 → f /F0 → t/T0 → f /F0 →
(c) (c)
Figure 4: Unipolar pulse train with full number of harmonics (a), Figure 5: Sawtooth (a), rectangular (b) and triangular waveforms
bipolar pulse train with only odd harmonics (b) and signal with (c) created by a combination of oscillators and leaky integration.
even harmonics (c) as a sum of two different unipolar pulse trains.
DAFX-4
DAFx-258
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
20
40 5. ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers for their
Frequency in kHz
Magnitude in dB
15
20 extensive and valuable feedback which helped to improve this pa-
10 per in many aspects.
0
5 6. REFERENCES
Magnitude in dB
15 ally informed synthesis of bandlimited classical waveforms
20
using integrated polynomial interpolation,” The Journal of
10 the Acoustical Society of America, vol. 131, no. 1, pp. 974,
0 2012.
5
[4] John Lazzaro and John Wawrzynek, “Subtractive Synthesis
0
without Filters,” in Audio Anecdotes II - Tools, Tips, and
0 1 2 3 Techniques for Digital Audio, pp. 55–64. 2004.
Time in seconds
[5] James A. Moorer, “The Synthesis of Complex Audio Spec-
(b) Step-modulated number of harmonics (NH = 3, 5, 9, 13). tra by Means of Discrete Summation Formulas,” Journal of
the Audio Engineering Society, vol. 24, no. 9, pp. 717–727,
Figure 6: Example spectrograms showing a modulation of the fun- 1976.
damental frequency and number of harmonics.
[6] Jussi Pekonen, Juhan Nam, Julius O. Smith, Jonathan S.
Abel, and Vesa Välimäki, “On Minimizing the Look-Up Ta-
ble Size in Quasi-Bandlimited Classical Waveform Oscilla-
4. CONCLUSION tors,” in Proc. of the 13th Int. Conference on Digital Audio
Effects, 2010.
Usually, in bandlimited impulse train (BLIT) synthesis, sinc-pulses [7] Juhan Nam, Jonathan S. Abel, and Julius O. Smith, “Efficient
are used to filter a pulse train and to obtain a spectrum with a de- Antialiasing Oscillator Algorithms Using Low-Order Frac-
fined number of harmonics of equal magnitude. In this paper it tional Delay Filters,” IEEE Transactions on Audio, Speech
was proposed to replace the sinc-pulse with a Hammerich pulse and Language Processing, vol. 18, no. 4, pp. 773–785, 2010.
as its spectral shape can be directly controlled by two indepen-
[8] Stéphan Tassart, “Band-limited impulse train generation
dent parameters for cutoff frequency and filter roll-off. The closed
using sampled infinite impulse responses of analog filters,”
form equation for the Hammerich pulse can be evaluated per sam-
IEEE Transactions on Audio, Speech and Language Process-
ple, does not require the creation of a wavetable and an immediate
ing, vol. 21, no. 3, pp. 488–497, 2013.
modulation of the pulse parameters is possible. As all harmon-
ics in a pulse are in phase, differently configured oscillators can [9] Edwin Hammerich, “A Generalized Sampling Theorem for
be easily combined to create more complex spectral shapes. It Frequency Localized Signals,” Sampling Theory in Signal
was shown how to synthesise spectra with odd or even harmonics and Image Processing, vol. 8, no. 2, pp. 127–146, 2007.
and together with a leaky integrator, various standard waveforms [10] Edwin Hammerich, “Design of Pulse Shapes and Digital
(rectangular, triangular, sawtooth) can be created. The Hammerich Filters Based on Gaussian Functions,” 2009.
pulse considerably expands the BLIT principle to become a full-
featured synthesis procedure and despite the restriction to lowpass [11] Udo Zölzer, “Pitch-based digital audio effects,” in Proc.
spectra, a wide variety of useful sounds and waveforms can be cre- of the 5th Int. Symposium on Communications, Control and
ated without additional filtering. Signal Processing (ISCCSP), 2012.
The possibilities and limitations of this new waveform gener- [12] Eli Brandt, “Hard Sync without Aliasing,” in Proc. of the
ation algorithm still have to be explored in practical musical ap- Int. Computer Music Conference (ICMC), 2001.
plications. First tests with a real-time modulation of the pulse pa-
rameters were quite promising. In particular the simple interface
with only a few but very expressive parameters supports an intu-
itive and creative workflow. For the future it might be in particular
interesting to find further pulse shapes which can be calculated and
parametrised in a similar fashion as the Hammerich pulse but ex-
hibit a different frequency response, e.g. highpass, bandpass or
resonant lowpass.
DAFX-5
DAFx-259
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Gianpaolo Evangelista
Institute for Composition, Electroacoustics and Sound Engineering Education
MDW University of Music and Performing Arts
Vienna, Austria
evangelista@mdw.ac.at
Linear representations play a central role in sound synthesis, dig- where hf, gi denotes the scalar product in the signal space1 . We
ital audio effects, feature extraction, coding and music informa- call “time-something” the generic representation whose represen-
tion retrieval. In most of these representations, one of the co- tative elements are obtained by cascading time-shift with another
ordinates in the transformed domain is time-shift, as in the case operator O acting on a prototype function g.
of time-frequency representations [1], e.g., based on the STFT In the STFT based representations the operator O corresponds
(Short-Time Fourier Transform, also known as phase VOCODER) to modulation
or as in the case of time-scale representations based on the WT [Mν s] (t) = ej2πνt s(t), (4)
(Wavelet Transform) [2, 3]. Other examples, based on different while in the WT based representations the operator O corresponds
signal transformations such as the MDCT (Modified Discrete Co- to dilation
sine Transform) [4, 5], windowed Hankel (Fourier-Bessel) [6, 7], 1 t
[Da s] (t) = √ s , (5)
to name a few, are available. a a
In this paper we are mostly concerned with audio signal rep- which we wrote here as a unitary operator.
resentations that are exact, i.e., for which an inverse or pseudo- Notice that in some definitions the time-shift and the O oper-
inverse exists such that perfect reconstruction of the original signal ators appear in reverse order with respect to that in (3). However,
is possible from the analysis data. Other analysis-only methods, with the introduction of a phase change or of a modification of the
such as the constant Q transformation [8] may enjoy easier com- transform coordinates, in most useful cases one can redefine the
putation but may introduce data loss or have inefficient reconstruc- transformation as in (3). For example, in the STFT, we have:
tion algorithms so they are not directly suitable for the synthesis
from analysis. [Tτ Mν g] (t) = ej2πν(t−τ ) g(t − τ ) = e−j2πντ [Mν Tτ g] (t).
Generally speaking, mathematical transforms operate on a (6)
one-dimentional audio signal and produce a multidimentional rep- 1 Typically this is the space L2 (R) of square integrable functions (finite
resentation. For example the STFT yields a 2D signal represen- energy signals) on the real line, however, the choice might depend on the
tation in which one coordinate can be interpreted as time (more signal representation we are considering.
DAFX-1
DAFx-260
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The phase factor e−j2πντ is irrelevant, it cancels out when analysis warped Gabor frames [21, 22, 23]. On the other hand, computation
and synthesis is performed and does not need to be included in the of discrete-time frequency warping as a digital audio effect can be
computation. For the wavelet transform, one can show that: eased by the use of the STFT [24].
In this paper we extend our methods to other generic time-
Da Tτ /α = Tτ Da , (7) something representations of which wavelet expansions are an ex-
ample.
with obvious reinterpretation of scaled time-shift τ /α. The paper is organized as follows: in Section 2 we discuss
Most of the mentioned representations can be studied in terms the concept of unitary equivalence, applying it to unitary warping
of the time and frequency spreading of the transform nucleus in Section 3. In Section 4 we discuss redressing methods for the
generic unsampled and sampled time-something representations,
Kτ,σ (t) = [Tτ Oσ g] (t), (8)
with examples to the wavelet expansion. Finally, in Section 5 we
as the parameters τ and σ, which are the coordinates of the trans- draw our conclusions.
formed domain, vary. Directly or indirectly (via the relationship Examples and experimental code will be made available at the
between the coordinate σ and the frequency), this leads to a tiling author’s web page:
of the time-frequency plane, in which the specific form of the time- http://members.chello.at/~evangelista/.
frequency tiles carries a loose meaning in terms of the signal com-
ponents that get detected or trapped in a certain time-frequency 2. UNITARY EQUIVALENCE
zone.
However, the tiling is often dictated by mathematical simplifi- In order to generalize the linear signal representations and be able
cations for the definition of the operations used in generating the to adapt the representing elements to important physical or percep-
transform nucleus. Thus, for example, in the STFT based time- tual characteristics, one can resort to further transformation using
frequency representations time and frequency resolutions are both invertible operators. This is shown in Figure 1 where the analysis
uniform and in the time-scale WT based representations time-shift and synthesis blocks allow for perfect reconstruction. However, in
is uniform in each scale. Also, in the WT, scale and time are in- the scheme in the figure, the synthesis of the signal s is achieved
terrelated so that the uncertainty product of the representative el- by further operating with an invertible operator U. In other words,
ements is constant throughout the tiling. In the applications often if in the original analysis-synthesis scheme the signal was synthe-
one would like to achieve higher degree of flexibility in allocating sized in terms of a continuous or discrete set of functions ψα (t), in
the time-frequency spread of the representative elements in order the new scheme the signal is synthesized in terms of the functions
to produce a tiling prescribed, e.g., by psychoacoustic or physical [Uψα ] (t), where α is either the variable or the index of superpo-
characteristics. sition for the synthesis, e.g.,
In previous work, frequency warping has been considered
in conjunction with signal representations in order to increase X
s(t) = Cα [Uψα ] (t), (9)
the flexibility and adaptivity of the analysis-synthesis scheme
α
[10, 11, 12, 13, 14]. Since frequency warping introduces disper-
sion in time-localization, methods for reducing or eliminating this in which Cα denote the coefficients of the expansion. Here, in
effect are desired. In recent work [15, 16, 17], we proposed tech- order to simplify the notation, in the variable α we incorporate
niques, the so called redressing methods, that are directly suitable all of the transform coordinates α1 , ..., αN , such as time-shift and
for the STFT based time-frequency representation once the warp- frequency or time-shift and scale.
ing map is defined. These basically consist in the application of Clearly, in order to provide the correct analysis algorithm, in
suitable operators aimed at linearizing the phase of the dispersive the new scheme one needs to operate on the signal with the inverse
delays resulting from the application of warping to the time-shift operator U −1 and obtain the analysis coefficients Cα in (9) from
operator. the signal U−1 s (t). However, in general, perfect reconstruction
Since, over the continuum, a 2D representation of a 1D signal is not guaranteed and one needs to prove properties like the norm
is generally redundant, sampling can be introduced by selecting a convergence of the expansion on the right hand side of (9) to the
grid of points in the transform domain. Provided that certain con- signal on the left hand side.
ditions on the grid and on the window or wavelet are satisfied, the Perfect reconstruction is guaranteed, with no further proof, if
original signal can be reconstructed from the transform samples the operator U is unitary, i.e., if U is a bounded linear operator
evaluated at the grid points only. This leads to a signal expansion satisfying
in terms of a set of functions constituting a frame or otherwise to
UU† = U† U = I, (10)
a bi-orthogonal / orthogonal and complete set for the signal space.
Sampling introduces some complications in the application where I is the identity operator and U† represents the adjoint of
of the redressing methods and may impose limitations on the U, i.e., the operator satisfying
choice of the window in generalized Gabor frames considered in
[15, 16, 17]. There, discrete-time frequency warping operators hUf, gi = hf, U† gi (11)
were applied in the sampled time STFT domain in order to par-
tially eliminate dispersion. The results show that for bandlimited for any pair of functions f and g belonging to the signal space. In
windows, the dispersion elimination procedure can be made exact this case, the operator U† is both the left and right inverse of U and
and leads to the same results as ad-hoc methods for the construc- U realizes a surjective isometry in the function space. Therefore,
tion of non-stationary Gabor frames [18, 19, 20]. properties such as orthogonality, norm and norm convergence are
Transform domain discrete-time frequency warping also leads preserved, so one can simply invoke unitary equivalence to verify
to approximated algorithms for the computation of the generalized perfect reconstruction [11].
DAFX-2
DAFx-261
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
3.1. Warping Operators where ν denotes frequency. We assume henceforth that all warping
maps are almost everywhere increasing so that the magnitude sign
Warping operators perform function composition in order to remap can be dropped from the derivative under the square root.
the abscissa. Thus, a time warping operator Wγ remaps the time The unitary warping operator (17) is a special case of weighted
axis by means of the following action: composition operator [28]. It turns out that the weight func-
stw = Wγ s = Cγ s = s ◦ γ, (14) tion – the derivative of the map in our case – helps releasing
some constraints on the maps that guarantee the continuity of the
where stw is the time-warped version of the signal s, γ is the time weighted composition operator. A classical example [28] is the
warping map, Cγ is the composition by γ operator and ◦ denotes map γ(x) = x2 in the space L2 ([0, 1]). The map does not define
function composition. a bounded composition operator (γ 0 is not bounded from below).
DAFX-3
DAFx-262
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
p √
However, with the weight γ 0 (x) = 2x one obtains a bounded 4
q(n)
2. 5 ×10
(unitary) weighted composition operator. More general conditions
for the definition of unitary weighted composition operators can 2
be found in [29].
Incidentally, note that the dilation operator introduced in (5) is 1.5
a particular case of unitary time warping operator Uγ where the
warping map γ(t) = t/a is linear. By the Fourier scaling theorem, 1
this is also equivalent to a frequency warping unitary operator Uθ̃
0.5
with map θ(ν) = aν.
When a unitary frequency warping operator Uθ̃ acts as mod-
0
ifier of a time-shift based signal representation, the time-shift op-
erator is affected by a similarity transformation as in (13). By a -0.5
calculation similar to the one conducted in the Appendix one can
conclude that -1
Z +∞
-1.5
Uθ̃ Tτ U−1
θ̃
s (t) = dνej2πνt e−j2πθ(ν)τ ŝ(ν), (18)
−∞
-2
which shows how, by the effect of warping, the delay τ converts
into a dispersive delay represented, in the frequency domain, by -2.5
-2.5 -2 -1.5 -1 - 0.5 0 0.5 1 -1.5
. 2 2.5
the factor e−j2πθ(ν)τ . n [Hertz] × 10 4
3.2. Example: Warping Map for Change of Scale in Sets of Figure 2: Frequency warping map to convert octave-band to
Wavelet halftone-band wavelets. The circles in the plot denote the halftone
mapping points.
Sets of wavelets for signal expansions are obtained by choosing a
grid of points for sampling the WT in time-scale space of the type
(an m, an ), where m and n are integers and a is a constant scale systems in order to preserve the sampling rate, in which case we
factor. It is well known that the dyadic scheme, which fixes a = 2 set:
and octave band resolution, is the easiest way to generate orthogo- C = νN a− logb νN , (24)
nal and complete sets of wavelets. For sound and music processing
scopes, it is highly interesting to improve the frequency resolution, where νN is the Nyquist frequency, so that θ(νN ) = νN and the
for example to half-tones. One can achieve this by means of a overall bandwidth is preserved.
warping map θ that allows for a change of the scale factor from a Notice that for the map in (23) we have:
to b < a. In other words, we are requiring the map to satisfy:
dθ alogb |ν|
= C logb (a) , (25)
aθ(ν) = θ(bν), (19) dν |ν|
so that, by repeated applications of (19) one can show that so that, for the case of interest where a > b > 1, which corre-
r r sponds to higher frequency resolution warped wavelets, we have
dθ √ n n
√ dθ limν→0 θ0 (ν) = 0, so that θ0 is not essentially bounded from be-
a ψ̂ (a θ(ν)) = b n ψ̂ (θ(bn ν)) (20)
dν dν low and therefore the corresponding composition operator may not
be bounded. However, in representations for sound and music sig-
where ψ̂ is the Fourier transform of a scale-a mother wavelet. This nals, we are not interested in extreme accuracy around zero fre-
shows that the frequency warped wavelet ψ̃, whose Fourier trans- quency. In order to obtain a bounded composition operator, one
form is r can therefore replace the warping map (23) with one that is identi-
ˆ dθ cal to it except that, on a small interval around zero frequency, the
ψ̃(ν) = ψ̂ (θ(ν)) (21) original map is replaced by a linear segment continuously going
dν
properly scales through powers of b to the warped versions of the from 0 to the value of the original map at the extremes of the inter-
a-scaled versions of the original mother wavelet ψ: val. Warping with this map will still yield an exact representation,
with the desired frequency band allocation.
r A map suitable for the conversion of octave-band wavelets into
√ ˆ n dθ √ n
bn ψ̃(b ν) = a ψ̂ (an θ(ν)) . (22) halftone-band wavelets is shown in Figure 2. On a log-log scale
dν
the map would appear as linear. The Fourier transforms of the cor-
A particular map that satisfies the generalized homogeneity condi- responding frequency warped wavelets based on an octave-band
tion (19) is given by the exp-log function: pruned tree-structured FIR Daubechies filter bank [3] of order 31
are shown in Figure 3.
θ(ν) = C sgn (ν)alogb |ν| , (23) The dispersion pattern of 1/3-octave resolution frequency
warped wavelets is shown in Figure 4, where each area delim-
where sgn (ν) is the signum function and C is an arbitrary constant ited by the band edge limits (horizontal lines) and two adjacent
that can be set, for example, by constraining the warping map to group delay curves roughly represents the time-frequency resolu-
fix a specific frequency. This capability is important in sampled tion or, more precisely, the uncertainty zone of the corresponding
DAFX-4
DAFx-263
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2
0.014
0.012
1.5
0.01
Frequency [Hz]
0.008
... 1
0.006
0.004
0.5
0.002
0
0 0.5 1 1.5 2 2.5 0
2 4 6 8 10 12 14
Frequency [Hz] x 10 4 x 10
-4
Time [sec.]
Figure 3: Magnitude Fourier transform of frequency warped Figure 4: Uncertainty zones delimited by the group delay disper-
wavelets with halftone frequency resolution. sion profiles of 1/3-octave warped wavelets without redressing.
wavelet. The fact that the time boundaries are curved is not good
news for the time localization of signal components or features. † (t) (t)†
Fortunately, redressing methods, which we will illustrate in the re-
s(t) Uq~ Analysis Uq~ Uq~ Synthesis Uq~ s(t)
DAFX-5
DAFx-264
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-6
DAFx-265
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2
[4] J. P. Princen, A. W. Johnson, and A. B. Bradley, “Subband
1. 8 transform coding using filter bank designs based on time
domain aliasing cancellation,” in IEEE Proc. Intl. Confer-
1. 6 ence on Acoustics, Speech, and Signal Processing (ICASSP),
T2 q (n) 1987, pp. 2161–2164.
1. 4
[5] H. S. Malvar, “Modulated QMF filter banks with perfect
1. 2 reconstruction,” Electronics Letters, vol. 26, no. 13, pp. 906–
907, June 1990.
1
[6] S. Ghobber and S. Omri, “Time-frequency concentration of
0. 8 the windowed Hankel transform,” Integral Transforms and
Special Functions, vol. 25, no. 6, pp. 481–496, 2014.
0. 6
[7] C. Baccar, N.B. Hamadi, and H. Herch, “Time-frequency
0. 4 analysis of localization operators associated to the windowed
Hankel transform,” Integral Transforms and Special Func-
0. 2
tions, vol. 27, no. 3, pp. 245–258, 2016.
0 [8] J. Brown, “Calculation of a constant Q spectral transform,”
0 0.5 1 1.5 2 2.5 3
~ Journal of the Acoustical Society of America, vol. 89, no. 1,
T2 n pp. 425–434, January 1991.
Figure 6: Segment of the map ϑ2 for the redressing of 1/3-octave [9] Y.A. Abramovich and C.D. Aliprantis, An Invitation to Op-
warped wavelets obtained from dyadic ones. erator Theory, vol. 50 of Graduate studies in mathematics,
American Mathematical Society, 2002.
[10] A. V. Oppenheim, D. H. Johnson, and K. Steiglitz, “Com-
with total variation 1/2. The remaining part of each map ϑr is then putation of spectra with unequal resolution using the Fast
extended by odd symmetry about 0 frequency and modulo 1 pe- Fourier Transform,” Proc. of the IEEE, vol. 59, pp. 299–301,
riodicity. Actually, in this specific example, in view of property Feb. 1971.
(19), the maps are the same for any r.
Of course, in general, the detailed approximation margins [11] R. G. Baraniuk and D. L. Jones, “Unitary equivalence : A
depend on the particular time-something representation at hand. new twist on signal processing,” IEEE Transactions on Sig-
However, the flexibility in the choice of the sampling of the time- nal Processing, vol. 43, no. 10, pp. 2269–2282, Oct. 1995.
shift variable in the transform domain can lead to good results as [12] G. Evangelista and S. Cavaliere, “Frequency Warped Filter
in the case of generalized Gabor frames [23, 21]. Banks and Wavelet Transform: A Discrete-Time Approach
Via Laguerre Expansions,” IEEE Transactions on Signal
5. CONCLUSIONS Processing, vol. 46, no. 10, pp. 2638–2650, Oct. 1998.
[13] G. Evangelista and S. Cavaliere, “Discrete Frequency
In this paper we have revisited recently introduced redressing Warped Wavelets: Theory and Applications,” IEEE Trans-
methods for mitigating dispersion in warped signal representa- actions on Signal Processing, vol. 46, no. 4, pp. 874–885,
tions, with the objective of generalizing the procedure to generic Apr. 1998, special issue on Theory and Applications of Fil-
time-something representations like wavelet expansions. ter Banks and Wavelets.
We showed that redressing methods can be extended to gen-
eralized settings in both unsampled and sampled transform do- [14] G. Evangelista, “Dyadic Warped Wavelets,” Advances in
mains, introducing notation to simplify their application to signal Imaging and Electron Physics, vol. 117, pp. 73–171, Apr.
analysis-synthesis scheme. 2001.
The phase linearization techniques lead to approximated [15] G. Evangelista, M. Dörfler, and E. Matusiak, “Phase
schemes for the computation of the warped and redressed signal vocoders with arbitrary frequency band selection,” in Pro-
representations, where one can pre-warp the analysis window or ceedings of the 9th Sound and Music Computing Conference,
mother wavelet and compute the transform without the need for Copenhagen, Denmark, 2012, pp. 442–449.
further warping.
[16] G. Evangelista, M. Dörfler, and E. Matusiak, “Arbitrary
phase vocoders by means of warping,” Music/Technology,
6. REFERENCES
vol. 7, no. 0, 2013.
[1] K. Gröchenig, Foundations of Time-Frequency Analysis, Ap- [17] G. Evangelista, “Warped Frames: dispersive vs. non-
plied and Numerical Harmonic Analysis. Birkhäuser Boston, dispersive sampling,” in Proceedings of the Sound and Mu-
2001. sic Computing Conference (SMC-SMAC-2013), Stockholm,
[2] Stephane Mallat, A Wavelet Tour of Signal Processing, Third Sweden, 2013, pp. 553–560.
Edition: The Sparse Way, Academic Press, 3rd edition, 2008. [18] P. Balazs, M. Dörfler, F. Jaillet, N. Holighaus, and G. A. Ve-
[3] Ingrid Daubechies, Ten Lectures on Wavelets, Society for lasco, “Theory, implementation and applications of nonsta-
Industrial and Applied Mathematics, Philadelphia, PA, USA, tionary Gabor Frames,” Journal of Computational and Ap-
1992. plied Mathematics, vol. 236, no. 6, pp. 1481–1496, 2011.
DAFX-7
DAFx-266
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[19] G. A. Velasco, N. Holighaus, M. Dörfler, and T. Grill, We are going to prove our result under the assumption that the
“Constructing an invertible constant-Q transform with non- map θ is almost everywhere increasing and has odd parity:
stationary Gabor frames,” in Proceedings of the Digital Au-
dio Effects Conference (DAFx-11), Paris, France, 2011, pp. θ(−ν) = −θ(ν). (38)
93–99. In this case, the first derivative
[20] N. Holighaus and C. Wiesmeyr, “Construction of warped
time-frequency representations on nonuniform frequency dθ
θ0 (ν) = (39)
scales, Part I: Frames,” ArXiv e-prints, Sept. 2014. dν
[21] T. Mejstrik and G. Evangelista, “Estimates of the Recon- is almost everywhere positive and has even parity:
struction Error in Partially Redressed Warped Frames Ex-
θ0 (−ν) = θ0 (ν). (40)
pansions,” in Proceedings of the Digital Audio Effects Con-
ference (DAFx-16), Brno, Czech Republic, 2016, pp. 9–16. From (17) we have:
[22] T. Mejstrik, “Real Time Computation of Redressed Fre- Z +∞
quency Warped Gabor Expansion,” Master thesis, Uni- [Uθ̃ s] (t) = dαKθ (t, α)s(α) (41)
versity of Music and Performing Arts Vienna, tomm- −∞
sch.com/science.php, 2015.
where
[23] G. Evangelista, “Approximations for Online Computation of Z r
+∞
Redressed Frequency Warped Vocoders,” in Proceedings of dθ j2π(νt−θ(ν)α)
Kθ (t, α) = dν e
the Digital Audio Effects Conference (DAFx-14), Erlangen, −∞ dν
Germany, 2014, pp. 1–7. Z r (42)
+∞
dθ−1 j2π(θ−1 (ν)t−να)
[24] G. Evangelista and S. Cavaliere, “Real-time and efficient = dν e
−∞ dν
algorithms for frequency warping based on local approxima-
tions of warping operators,” in Proceedings of the Digital Au- where we have used the fact that the map θ is invertible almost
dio Effects Conference (DAFx-07), Bordeaux, France, Sept. everywhere, so that
2007, pp. 269–276. θ−1 (θ(ν)) = ν (43)
[25] R.K. Singh and Ashok Kumar, “Characterizations of invert- far almost all ν ∈ R, from which it follows that
ible, unitary, and normal composition operators,” Bulletin
of the Australian Mathematical Society, vol. 19, pp. 81–95, d θ−1 (θ(f )) dθ−1 dθ
1= = (44)
1978. df dα α=θ(f ) df
[26] R.K. Singh and J.S. Manhas, Composition Operators on
almost everywhere.
Function Spaces, vol. 179, North Holland, 1993.
Thus, given any finite energy signal s(t), we have
[27] R.K. Singh, “Invertible composition operators on L2 (λ),” h i
Proceedings of the American Mathematical Society, vol. 56, (τ )
Uθ̃ Uθ̃ Tτ s (t) =
no. 1, pp. 127–129, April 1976. Z +∞ Z +∞
[28] James T. Campbell and James E. Jamison, “On some classes = dαKθ (τ, α) dβKθ (t, β)s(β − α) (45)
of weighted composition operators,” Glasgow Mathematical −∞ −∞
Z Z
Journal, vol. 32, no. 1, pp. 87–94, 1990. +∞ +∞
= dαKθ (τ, α) dηKθ (t, η + α)s(η)
[29] V.K. Pandey, Weighted Composition Operators and Repre- −∞ −∞
sentations of Transformation Groups, Ph.D. thesis, Univer-
sity of Lucknow, India, 2015. Using (42), by direct calculation, it is easy to show that
Z +∞
[30] O. Christensen, “Frames and pseudo-inverses,” Journal of
Mathematical Analysis and Applications, vol. 195, no. 2, pp. dαKθ (τ, α)Kθ (t, η + α) = Lθ (t − τ, η), (46)
−∞
401–414, 1995.
[31] P. W. Broome, “Discrete orthonormal sequences,” Journal where Lθ (t, α) is the nucleus of the non-unitary warping operator
of the ACM, vol. 12, no. 2, pp. 151–168, Apr. 1965. Wθ̃ , i.e.,
Z +∞
[32] L. Knockaert, “On Orthonormal Muntz-Laguerre Filters,”
IEEE Transactions on Signal Processing, vol. 49, no. 4, pp. Lθ (t, α) = dνej2π(νt−θ(ν)α)
−∞
790 –793, apr 2001. Z +∞ (47)
dθ−1 j2π(θ−1 (ν)t−να)
= dν e .
(τ ) −∞ dν
7. APPENDIX: PROOF THAT Uθ̃ Uθ̃ Tτ = Tτ Wθ̃
Thus, (45) becomes
In this Appendix we prove an important result on which the re- h i Z +∞
dressing methods for the warped time-shift operators are based. (τ )
Uθ̃ Uθ̃ Tτ s (t) = dηLθ (t − τ, η)s(η)
We show that the application of the same warping operator Uθ̃ but −∞ (48)
with respect to time-shift τ instead of time t, which we denote by = [Tτ Wθ̃ s] (t),
(τ )
Uθ̃ , results in the commutation of the warping operator Uθ̃ with
the time-shift operator Tτ in its non-unitary form. which is what we needed to prove.
DAFX-8
DAFx-267
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-268
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2. ATOM PROPERTIES
0 nm nI n
2.1. General Form
We generalize the form of a causal asymmetric atom as Figure 1: An example envelope of the form (2) overlaid with
an exponential envelope (blue), where nI is the influence
x[n] = E[n]eiωc n , (1) time and nm is the time location of the envelope maximum.
where
E[n] = A[n]e−αn u[n] (2) whose envelopes are closer to those of natural sounds. We
quantify concentration in time and frequency by the time
is the atom’s envelope, α ∈ R≥0 is the damping factor, spread, T , and frequency spread, B, respectively [13]. The
ωc = 2πfc is the normalized angular frequency of oscil- Heisenberg-Gabor inequality states BT ≥ 1.
lation (0 ≤ fc ≤ 12 ), u[n] is the unit step function, n is dis-
Moreover, we prefer an atom that has a unimodal spec-
crete time, and A[n] is an attack envelope that distinguishes
trum: a spectrum X(ω) is unimodal if |X(ω)| is monoton-
each atom (A[n] ∈ R≥0 ∀ n ∈ N). We introduce the atoms
ically increasing for ω ≤ ωc and monotonically decreasing
as discrete time signals because, in practice, dictionaries are
for ω ≥ ωc . A function that is truncated in time with a rect-
composed of finite sampled (discretized) atoms. We estab-
angular window admits a non-unimodal spectrum because
lish mathematical properties of the atoms (e.g., their deriva-
the truncation is equivalent to convolving the spectrum with
tives) from their continuous time counterparts.
a sinc function whose oscillations introduce multiple local
In the literature, the formant-wave-function and gam-
maxima/minima [14]. Multiple local maxima/minima in the
matone are typically defined with real sinusoids rather than
spectrum can complicate spectral parameter estimation. We
complex ones. We choose to adopt a complex form for
prefer an infinitely differentiable atom (i.e., of class C ∞ , as
mathematical ease of manipulation and concise represen-
defined in [15]) because its spectrum is unimodal.
tation in transform domains. Moreover, it is necessary to
parametrize phase for real valued but not for complex val-
ued atoms: expansion coefficients from a signal decompo- 2.2.2. Control & Flexibility
sition using complex atoms will be complex and will thus
We modulate the damped sinusoid with A to enhance the
provide both the magnitude and phase [5].
atom’s adaptability to natural sounds. A damped sinusoid’s
damping factor α indirectly controls its T and B. Smooth-
2.2. Desired Atom Properties ing the damped exponential’s initial discontinuity with A
Our comparison criteria are grouped into three categories: concentrates its frequency localization in exchange for a
time-frequency properties, control & flexibility, and algo- more spread time localization. We want a parametrization
rithmic efficiency. of A that enables precise control over its time and frequency
characteristics, controllability being an essential aspect of
audio synthesis. Furthermore, the attack portion of an au-
2.2.1. Time-Frequency Properties
dio signal often contains dense spectral content that allows
A dictionary of atoms with varying degrees of time and fre- humans to characterize its source [16].
quency concentration is important for creating a sparse rep- Influence time has a major effect on the atom’s overall
resentation overall. For example, a sustained piano note be- perceived sound as it controls the degree to which the initial
gins with a short attack, which is best represented with con- discontinuity is smoothed [16]. We define influence time nI
centrated time (spread frequency) resolution, followed by as the duration that A influences the atom: nI is the largest
a long decay, which requires an atom with long time sup- value of n for which e−αn (A[n] − 1) > δ is true (in this pa-
port and a concentrated spectrum. Multi-resolution analysis per δ = .001, see Figure 1). The effects of varying influence
involves decomposing a signal onto a set of analyzing func- time are intuitively linked to T and B. In the frequency do-
tions whose time-frequency tiling is non-uniform [11] [12]. main, influence time mostly controls the spectral envelope
We are going one step further by considering that some far from its center frequency (skirt width as defined in [3]).
sounds require excellent time localization in the transient Increasing influence time spreads the atom’s time localiza-
region and concentrated frequency resolution in the decay tion and concentrates its spectrum.
region. We aim at representing both regions with atoms An important quantity to compare between the atoms is
DAFX-2
DAFx-269
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
the time ∆I = nI − nm , where nm is the time location of inner products at every iteration prohibit speed. Analytic
E’s maximum. nm is often called a temporal envelope’s at- formulas make the process tractable.
tack time in sound synthesis [17]. We find nm by setting E’s Another way to increase the efficiency of a sparse de-
continuous time derivative equal to zero and solving for n. composition program is to use parametric atoms, then refine
For a continuous E whose α > 0, nm precedes nI (i.e., A atom parameters using an estimator. Finding a more adapted
influences E even after nm ). To compare atoms along this atom at every iteration may require less iterations overall.
criteria, we equalize their nm values then compare their ∆I Developing parametric estimation techniques sometimes re-
values. ∆I indicates the amount of influence that varying lies on having analytic discrete Fourier transform (DFT) for-
the skirt width will have on the bandwidth. We prefer an mula. For example, in derivative methods, two spectra are
atom with a small ∆I value because its 3 dB bandwidth divided to solve for one or more variables [18]. We include
(set through α) is not affected much by the structure of A. each atom’s analytic DFT formula in [19] and [20].
An envelope with a small ∆I also reflects those produced Finally, [5] explains how recursion may be exploited to
by many acoustic instruments: an exciter increases the sys- calculate the convolution of damped sinusoidal atoms with
tem’s energy and then releases (at nI ), which results in a a signal: since the impulse response of a complex one-pole
freely decaying resonance. filter is a damped complex exponential sinusoid, a recursive
We do not want to complicate the definition of the atom filter can efficiently calculate the correlation. We provide
when modulating the damped sinusoid by A either; we en- each atom’s Z-transform to indicate its causal filter simplic-
courage time-domain simplicity. The damped sinusoid’s sim- ity and therefore practicality for calculating the correlation.
ple definition enables us to solve for its parameters alge- Besides, the Z-transform is useful for source-filter synthesis
braically. Classic parametric estimation techniques can be and auditory filtering.
used to adapt the damped sinusoid to an arbitrary signal
[18]. We want to retain these desirable properties even af- 3. EXISTING FUNCTIONS
ter introducing A. An atom’s time-domain simplicity will
depend on how its A marries with the complex damped si- 3.1. Damped Sinusoid
nusoid. Finally, after modulating the damped sinusoid with
A, we want the atom’s envelope to match well with those in 3.1.1. Background
actual musical signals. The damped sinusoid (DS) is essential in audio as it repre-
sents a vibrating mode of a resonant structure. The use of
a DS model in the context of analysis dates back to Prony’s
2.2.3. Algorithmic Efficiency
method [21], according to our knowledge, and was the first
Fast algorithms are one of the focuses of sparse represen- asymmetric atom used in the context of sparse representa-
tations research, as they aim to make sparse decomposition tions [5].
processes more tractable. Amid publications dedicated to
creating faster algorithms, some reported techniques have 3.1.2. Properties
become widely adopted [10]. Specifically, certain analytic
Staying with the predefined generic atom expression (1):
formulas are known to increase the algorithm speed because
they avoid some of the algorithm’s most time consuming ADS [n] = 1 (3)
numerical calculations (e.g., the inner product).
and thus nm = nI = 0. Its continuous-time Fourier trans-
An envelope shape that enables the inner product of two
form is well known,
atoms to be expressed as an analytic formula is required for
a fast matching pursuit algorithm [1]. A summary of the 1
XDS (ω) = (4)
relevant algorithm steps are included for justification. α + i(ω − ωc )
In matching pursuit, a dictionary D is compared with a as is the DFT [20], and finally, the Z-transform,
signal s by calculating and storing the inner product hs, gγ i
1
of each atom gγ and the signal. The atom that forms the XDS [z] = (5)
1− e−α+iωc z −1
largest inner product gq is picked as the signal’s best fit.
Then hgq , gγ i is computed and subtracted from hs, gγ i. This The DS’ spectrum is unimodal but not concentrated.
continues until some stopping criteria is met.
Dictionary inner products can be calculated and stored 3.2. Gammatone
once when the dictionary is static. However, when atom
3.2.1. Background
parameters are refined within the iterative loop these in-
ner products cannot be precomputed and, therefore, must be Auditory filter models are designed to emulate cochlea pro-
computed at each iteration. Numerical calculations of many cessing and are central to applications like perceptual audio
DAFX-3
DAFx-270
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
coding, where auditory filters are used to determine which for frequency spread. Thus, the adaptability of this model
sounds should be coded or not according to auditory mask- to a range of sound signal behavior is limited.
ing principles. Auditory filter modeling has a variety of ap- We establish an analytic formula for the gammatone’s
plications in bio-mechanics and psychoacoustic research. Z-transform that supports an arbitrary integer p > 0:
The most popular auditory filter model is the gamma- Pp
p r
tone (GT) filter due to its heritage and simple time domain r=1 r−1 a
XGT [z] = (8)
expression. Originally described in 1960 as a fitting func- (1 − a)p+1
tion for basilar displacement in the human ear [8], the gam-
p
matone filter was later found to precisely describe human where a = e−α+iωc z −1 , and the Eulerian number r−1 =
Pr
auditory filters, as proven from psychoacoustic data [22]. j=0 (−1) j p+1
(r −j)p
. The gammatone’s analytic inner
j
[9] shows that atoms learned optimally from speech and nat- product formula is complicated and described in [20].
ural sounds resemble gammatones. Designing gammatone
filters remains a focus in audio signal processing [23].
3.3. Formant-Wave-Function
More recently, filter models closely related to the gam-
matone filter have been proposed, such as the all-pass gam- 3.3.1. Background
matone filter and the cascade family [24]. Added features
of these variants do not overlap with our criteria so they are In the source-filter model, an output sound signal is con-
not included for comparison. sidered to be produced by an excitation function sent into a
(resonant) filter, referred to as a source-filter pair [3]. Most
acoustic instruments involve an exciter, either forced or free,
3.2.2. Properties and a resonator [4]. When an instrument’s exciter and res-
onator are independent, or have only a small effect on one
We assign the gammatone as the prototypical auditory fil-
another, its sound production mechanism may be described
ter model. A single variable polynomial envelope function
sufficiently with a source-filter model. An example is the
shapes the gammatone:
voice production system, where excitations produced by glot-
AGT [n] = np (6) tal pulses are filtered within the vocal tract.
Source-filter synthesis involves sending an excitation
In literature, p + 1 is called the filter order. AGT does not function through one or more resonant filters in parallel.
converge (its derivative is strictly positive), and thus admits The filters are typically one or two pole and defined by their
the largest ∆I in this study, as nm = αp and nI > 2nm . No auto-regressive filter coefficients. The excitation function
part of the gammatone is, strictly speaking, a freely decay- can be an impulse, but is more often an impulse smoothed
ing sinusoid (excluding when p = 0, in which case it is a by a window to emulate natural excitation. The window
DS), though it asymptotically approaches a DS as n → ∞. shape effects the transient portion of the time-domain out-
We demonstrate the filter order’s effect by applying the put from the system, and the skirts of the spectral envelope.
Fourier transform frequency differentiation property to ex- The filter coefficients control the shape of the spectral enve-
press its spectrum parametrized by p: lope near the resonant peak.
Time-domain formant-wave-function synthesis describes
p! the output of the source-filter model by a single function in
XGT (ω) = (7) the time domain. The amplitude envelope of the function
(α + i(ω − ωc ))p+1
is designed to generically match the output envelope of a
From its frequency representation, we see that the filter or- source-filter pair: a damped exponential (filter) for which
der determines the denominator polynomial order. Finally, the initial discontinuity is smoothed (excitation). The ad-
referencing the convolution property of the Fourier trans- vantage of this approach is twofold: direct parametrization
form, the gammatone impulse response is a DS convolved of the spectral envelope, and efficient synthesis by table
with itself p times. lookup [3].
Frequency spread B decreases with respect to the model Creating a sustained sound (e.g., a voice) from this model
order, while the time spread T increases. A gammatone of involves filtering a sparse excitation signal made of a (pos-
order four (p = 3) correlates best with auditory models [23]. sibly) periodic sequence of short duration signals. Likewise,
The gammatone’s spectrum is unimodal and concentrated. synthesizing a percussive sound (e.g., a piano) involves sum-
The attack envelope is not parametrized, and therefore ming the output of several resonant filters with compara-
cannot be controlled independently of α. After setting p, bly long decay times from a single excitation. In the time-
controlling the atom is solely through α and ωc . Influence domain method, this means that the model synthesizes a sig-
time (or skirt width) is not directly controllable, so one can- nal s as a linear combination of time-shifted resonant filter
not tune the atom to have time concentration in exchange impulse responses (i.e., time-frequency atoms). Formally,
DAFX-4
DAFx-271
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
P
we express this as s[n] = hλ [n − τ ]vλ,τ = Dv, where 3.5. Recapitulation
D is a dictionary of atoms hλ,τ [n] = hλ [n − τ ] that are in-
dexed by λ and time shift τ , and v contains their amplitude Each existing function has several desired properties miss-
coefficients vλ,τ . Thus, the source-filter model is a sparse ing. While the gammatone’s unimodal frequency spectrum
synthesis representation. and time-domain simplicity are appealing, expressing its
DFT and inner product is complicated. Most importantly,
without a parameter to control influence time, the gamma-
3.3.2. Properties tone is not flexible enough to sparsely represent a variety of
The formant-wave-function (FOF) is ubiquitous with time- signal features. On the other hand, the FOF’s attack func-
domain wave-function synthesis. It was introduced for its tion enables precise control over its spectral envelope, how-
desirable properties: a concentrated spectral envelope that ever, its piecewise construction is problematic: spectral rip-
can be controlled rather precisely using two parameters. The ples result from a truncation in time, refining its parameters
FOF’s A is defined as: is difficult, and its frequency, Z-transform, and inner prod-
( uct expressions are complicated.
1
(1 − cos(nβ)) for 0 ≤ n ≤ πβ ,
AF OF [n] = 2 (9)
1 for πβ < n. 3.6. Towards a New Atom
where β ∈ R>0 controls influence time. Decreasing β in- The starting goal of this paper was to design a C ∞ A that is
creases influence time, nI ≈ πβ , and the time location of the similar to AF OF . While piecewise construction is the rea-
maximum, 2 2 son for the FOF’s shortcomings, approximating the raised
−β
nm = β1 cos−1 α α2 +β 2 (10) cosine with a C ∞ function does not necessarily improve the
situation because many functions admit complicated frequen-
∆I and α β are positively correlated. cy-domain and Z-domain formulas once a unit step is intro-
A raised cosine is an excellent attack shape in terms of 2
concentration, however, since it is piecewise (its value must duced. For example, e−βn has a compact bell shape that
be held at one after half of a period) some other design cri- seems to be, at first inspection, a good candidate to replace
teria suffer. the raised cosine. However, when a unit step is introduced,
π
it admits a non-algebraic Fourier transform expression (a
β2 1 + e− β (α+i(ω−ωc )) special function defines the imaginary part). Many bell-
XF OF (ω) =
2 (α + i(ω − ωc ))((α + i(ω − ωc ))2 + β 2 ) shaped functions have the same problem (e.g., tanh (βn)2 ).
(11) On the other hand, there are A options that are simple
The FOF’s spectrum is non-unimodal when the piecewise but have ∆I that are large compared to the FOF for equal
transition occurs within the window of observation. More- nm . In fact, any C ∞ function will have a larger ∆I than
over, it is difficult to estimate the FOF’s parameters and its the FOF’s for equal nm . Therefore, our goal became more
analytic inner product formula is complicated [7]. specific: define a C ∞ A that admits simple mathematical
We establish the FOF’s DFT and Z-transform by con- expressions when married with a complex damped expo-
verting the cosine function into a sum of complex exponen- nential, and whose ∆I is close to that of the FOF’s for equal
tials and using the linear property: nm . After an exhaustive search, we resolved that designing
a function to satisfy all of the design criteria is difficult.
N1 iβ N
1 1−(ae ) 1 1−(ae−iβ )N1
XF OF [z] = 12 1+a1−a − 4 1−ae iβ + 1−ae −iβ
(12)
4. THE NEW ASYMMETRIC ATOM
where a = e−α+iωc z −1 , and N1 = [ πβ ]. From (12), we see
that a FOF filter may be implemented as a sum of three com-
We have designed a new atom specifically for sparse au-
plex pole-zero filters. The time-varying input delay compli-
dio representations. All the aforementioned criteria were in
cates controlling attack shape.
mind when constructing this new atom.
DAFX-5
DAFx-272
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
x[n] y[n] 0 0
+ +
Amplitude [dB]
-10 -10
-20 -20
a0 z −1 -30 -30
br + + -40 -40
-50 -50
ar z −1 -60 -60
-0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4
+ Frequency [cycles/sample] Frequency [cycles/sample]
ap z −1
Figure 3: XF OF [ωk ] (red) and XREDS [ωk ] (blue) for fixed
β, where p = 2 on the left, p = 10 on the right, and α = .05.
Figure 2: REDS filter
diagram, where ar = e −α−rβ+iωc
and br = (−1)r pr .
A sum of p + 1 complex one-pole filters in parallel will thus
are achieved. The main idea is that the linear property of output a REDS (see Figure 2).
the Fourier transform and Z-transform can be exploited and The REDS has a concentrated and unimodal spectrum.
each exponential has a transform that is simple and well Similarly to the FOF, it is possible to precisely control the
known. REDS’ spectrum: by varying β one may exchange con-
centration in time for frequency, and vice versa. The FOF
4.2. Properties has greater time concentration than the REDS because the
raised cosine attack function has a fast uniform transition
We define REDS concisely in the time-domain by express- from zero to one, while the REDS attack envelope is bell-
ing AREDS [n] polynomially as (1 − e−βn )p : shaped. Formally, nIREDS > nIF OF when nmREDS =
p nmF OF . The REDS’ spectral concentration surpasses the
x[n] = 1 − e−βn en(−α+iωc ) u[n] (14) FOF’s as p increases (see Figure 3).
We established analytic inner product and convolution
where β controls the influence time (or skirt width) and p+1 formulas for two REDS atoms that support the case when
is the order. atoms have different lengths N (see [20]). Considering that
nm = β1 log(1 + pβ α ) (15) these formulas for Gabor atoms and FOFs provide an effi-
and nI ≈ − β1 log(1 − (1 − δ)1/p ), where δ is the same as ciency boost in existing programs [7], and the REDS for-
in Section 2.2.2. mulas are simpler than those, we assume that using the for-
Like in the gammatone model, order is often constant mulas are more efficient than numerical computations.
within an application: we may choose the order, for exam-
ple, to match with auditory data or to approximate a frame 4.3. Connection to Existing Functions
condition [23]. Given that the order is a constant, the num-
ber of control parameters and their effect are the same as the A REDS is approximately equal to a GT when β is very
FOF. To summarize, the REDS parameter set is a conflation small:
p
of the source-filter and auditory filter models. lim β1p 1 − e−βn = np (19)
We express the REDS in binomial form to reveal its sum β→0
DAFX-6
DAFx-273
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5.1. Synthesis
Sparse synthesis via the source-filter model generally in-
3000
Frequency [Hz]
volves sending a short excitation periodically through one 2500
or more filters, typically one per formant. In this example 2000
in [3], see Table 1. We set p = 2 (see (19)) for better com- 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
parison with FOF, which have demonstrated their aptitude Time [samples]
for singing voice synthesis. After normalization (see nor-
malization factor in [19] or [20]), a gain G (0 ≤ G ≤ 1) Figure 4: Sparse representation of a vibraphone (transient
tunes the filter output amplitude. Results show the quality part shown) from a decomposition onto 50 REDS atoms.
of the synthesized sound as well as the ability to match the
spectral envelopes, even in the valleys, mainly controlled by
β (see [19]).
Frequency [Hz]
6000
5000
νc is center frequency in Hz and p = 2. 4000
3000
2000
νc 260 1764 2510 3100 3600 1000
α .005 .006 .006 .009 .011 0.5 0.52 0.54 0.56 0.58 0.6 0.62
β .018 .059 .034 .011 .008 Time [seconds]
G 1.0 .501 .447 .316 .056
Figure 5: Source-filter sparse representation of a singing
voice. Atom spacing expands/contracts reflecting vibrato.
5.2. Decomposition
Table 2: Decomposition results. Each audio signal is sam-
We decomposed a set of real audio signals using a standard pled at 44100 Hz. ds is the signal’s duration in seconds.
matching pursuit algorithm1 . We selected the audio signal
set to reflect a range of the source-filter model: it includes a SNR (dB)
vocal sound (sustained, relatively high damping and smooth
ds Atoms N DS GT FOF REDS
attack per atom), a vibraphone (not-sustained, made of low
damping and short attack per atom), and a violin (interme- Vocal 1.0 10 4
2 8
30.7 35.9 37.8 38.8
diate situation). Violin 1.6 104 29 20.0 14.6 27.7 28.0
We created damped sinusoid dictionaries to fit with each Vibes 5.5 50 217 17.6 32.1 36.9 37.1
signal’s content. Then we made dictionaries for each atom
class by modulating the damped sinusoid dictionaries with
a set of each atom’s A[n]. We superimposed each selected
atom’s Wigner-Ville distribution to show their time-frequen- tent represented vocal formants rather than the sinusoidal
cy footprint, such as in [6] & [7]. partials (see Figure 5). Regarding the vibraphone, we cre-
We can represent a signal as time-varying sinusoidal tra- ated a dictionary whose damped sinusoids had large time
jectories per the additive model, or as filtered excitation se- support with low decay rates.
quences per the source-filter model, by decomposing it onto For each test, the REDS dictionaries provided higher
a dictionary of REDS atoms with constrained damping fac- SNR values for the same number of iterations, see Table 2.
tors. We chose to demonstrate the ability of the REDS to For the singing voice, the gammatone and REDS were close
analyze the signal set from the source-filter viewpoint. For in performance because the formant time-domain envelopes
the singing voice, if the dictionary contained atoms with had very smooth attacks. REDS matched the vibraphone’s
small damping (long time support) then the selected atoms envelope tightly, while the gammatone caused pre-echo be-
would represent the sinusoidal partials of the signal. We cause of its greater amount of symmetry (see [19]). The
set the damping to be high and in doing so, successfully ex- reconstructed signal from the REDS decomposition had an
tracted the excitation sequence of atoms whose spectral con- SNR of 38.8 dB, and consisted of 50 atoms (.04% of the
1 Standard in the sense that the dictionary was static and it did not in- signal length) (see Figure 4). Our companion website in-
volve fast algorithms or parameter refinements. We implemented the algo- cludes audio files for each signal approximation and further
rithm based on [1]. details the decomposition study results [19].
DAFX-7
DAFx-274
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-275
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-276
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
!
DKL (X|X̂) = X ⊙ log X ⊘ X̂ − X + X̂ (2)
" #− " #+ ! Figure 1: Flowchart of the proposed method for the task of
∂DKL (X|X̂) ∂DKL (X|X̂) harmonic-percussive sound separation in single-channel music
Z←Z⊙ ⊘ (3)
∂Z ∂Z recordings.
3. PROPOSED METHOD
The proposed method uses standard NMF to separate percussive 3.1. Obtaining activations
sounds xp (t) from harmonic sounds xh (t) in single-channel music Standard NMF is applied to the magnitude spectrogram X using
mixtures x(t). The magnitude spectrogram X of a mixture x(t), the cost function DKL (X|X̂) previously mentioned in section 2.
calculated from the magnitude of the short-time Fourier transform The update rules are defined as follows,
(STFT) using a N -sample hamming window w(n) and a J-sample
hop size, is composed of F frequency bins and T frames. We as-
sume that percussive and harmonic sounds are mixed in an approx- !
imately linear manner, that is, x(t) = xp (t) + xh (t) in the time H←H⊙ W T X ⊘ X̂ ⊘ W T 1F,T (5)
domain or X = Xp + Xh in the magnitude frequency domain. As
a result, X can be factorized into two separated spectrograms, X̂p
(an estimated spectrogram only composed of percussive sounds) !
and X̂h (an estimated spectrogram only composed of harmonic
W ←W⊙ X ⊘ X̂ H T ⊘ 1F,T H T
(6)
sounds),
DAFX-2
DAFx-277
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
we normalise each ith basis function using the L2 -norm, that is, 16
Wi
W̃i = kWi k , being
W̃i
=1.0.
14
Components
2 2 12
Standard NMF can only ensure convergence to local minima,
10
which enables the reconstruction of the mixture but cannot dis-
tinguish by itself if the ith component represents a percussive or 8
harmonic sound. 6
0.2 0.2
0.1 0.1
0 0
RL (τ ) −0.1 −0.1
R̃L (τ ) =
RL (τ )
(8) 0 20 40 0 20 40
(m) (n)
2 0.4 0.4
0.2 0.2
The analysis of R̃L (τ ) indicates that R̃4 (τ ) provides reliable 0
−0.2
0
−0.2
0 10 20 0 10 20
information to discriminate harmonic and percussive sounds as can
be observed in Figure 2 and Figure 4. Figure 2 and Figure 4 show
the matrix H of activations of a music excerpt composed of har-
monic and percussive sounds. It can be observed that the compo- Figure 3: The column of the left represents the percussive compo-
nents 4, 5, 9, 14 and 17 of Figure 2 and the components 14, 15, nent 14 of Figure 2: (a) R̃0 (τ ), (c) R̃1 (τ ), (e) R̃2 (τ ), (g) R̃3 (τ ),
17 and 18 of Figure 4 represents predominant percussive sounds (i) R̃4 (τ ), (k) R̃5 (τ ), (m) R̃6 (τ ). The column of the right rep-
modeled by rhythmic patterns. Both Figure 3 and Figure 5 show
resents the harmonic component 1 of Figure 2: (b) R̃0 (τ ), (d)
that R̃L (τ ) is still showing a set of peaks even as the order L in-
creases when a percussive sound is analyzed. However, this does R̃1 (τ ), (f) R̃2 (τ ), (h) R̃3 (τ ), (j) R̃4 (τ ), (l) R̃5 (τ ), (n) R̃6 (τ ).
not occur when analyzing harmonic sounds since these tend to be
represented using only one valley in a half-cycle waveform when
the order L is increased. It can be observed that R̃4 (τ ) is optimal Based on the above observation, we use R̃4 (τ ) as the basis
(see section 4.4) to discriminate between percussive and harmonic for automatically discriminating between percussive and harmonic
sounds because R̃4 (τ ) clearly shows a set of peaks considering sounds. A component, obtained from NMF decomposition, is clas-
percussive sounds and only one valley in a half-cycle waveform sified as percussive if a set of Np or more peaks are found at the
considering harmonic sounds. As a result, R̃L<4 (τ ) could model output of the R̃4 (τ ). Preliminary results indicated that the best
harmonic sounds as percussive sounds because R̃L<4 (τ ) can rep- separation performance was obtained when Np ≥ 2. In any other
resent harmonic sounds with more than one peak as shown in Fig- case, a component is classified as harmonic.
ures 3(b), 3(d), 3(f) and 3(h) and Figures 5(b), 5(d), 5(f) and 5(h). We have developed two approaches based on the classifica-
However, R̃L>4 (τ ) could model percussive sounds as harmonic tion of the rhythmic activations as shown in Figure 1. In the first
sounds because R̃L>4 (τ ) tends to remove most of the peaks as approach called Proposed_A, ŴpA and ĤpA represent the esti-
shown in Figure 3(k) and Figure 3(m) and Figure 5(k) and Figure mated bases and activations classified as percussive. However,
5(m). ŴhA and ĤhA represent the estimated bases and activations clas-
DAFX-3
DAFx-278
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
10
are denoted as |X̂p |2 and |X̂h |2 , respectively, then each ideally es-
8
timated source x̂p (t) or x̂h (t) can be estimated from the mixture
6
x(t) using a generalized time-frequency mask over the STFT do-
main. To ensure that the reconstruction process is conservative, a
Wiener filtering has been used as in [5]. In this manner, a percus-
4
2
sive or harmonic Wiener mask represents the relative percussive or
200 400 600 800
Frames
1000 1200 1400 1600 1800
harmonic energy contribution of each type of sound with respect
to the energy of the mixture defined as follows,
Figure 4: Activations H from the excerpt ’So lonely’ (Table 1), !
using K=20 components. 2 2
X̂p = |X̂p | ⊘ |X̂p | + |X̂h | 2
⊙X (9)
!
X̂h = |X̂h |2 ⊘ |X̂p |2 + |X̂h |2 ⊙X (10)
(a) (b)
0.1 0.15
0.05
0.1
0.05
The estimated percussive and harmonic signals x̂p (t), x̂h (t)
0 0 are obtained computing the inverse overlap-add STFT from the
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
0.2
(c)
0.2
(d)
final estimated percussive and harmonic magnitude spectrograms
0.1 0.1 X̂p , X̂h using the phase spectrogram of the mixture.
0 0
0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 (f)500 600 700 800 900
(e)
0.15
0.1 0.1
0.05
4. EXPERIMENTAL RESULTS
0.05 0
0
−0.05
0 50 100 150 200 250 300 350 400 450
(g)
0 50 100 150 200 250 300 350 400 450
(h) 4.1. Data and metrics
0.2
0.1 0.1
0 0 Evaluation has been performed using two databases DBO and
−0.1
0 50 100
(i)
150 200 0 50 100
(j)
150 200 DBT . Both databases are composed of single-channel real-world
0.2
0.1
0.2
0.1
music excerpts taken from the Guitar Hero game [14] [15] as can
0
−0.1
0
−0.1 be seen in Table 1 and Table 2. Each excerpt has a duration about
0 20 40 60
(k)
80 100 0 20 40
(l)
60 80 100
30 seconds and it was converted from stereo to mono and sampled
0.1
0
0.3
0.2
0.1
at fs =16 kHz.
0
−0.1
−0.2 −0.1 The database DBO has been used in the optimization and the
0 10 20 30 40 50 0 10 20 30 40 50
(m) (n) database DBT has been used in testing. Note that the database
0.4 0.4
0.2 0.2
0
used in the optimization is not the same as that used in the testing
0
−0.2
0 5 10 15 20 25
−0.2
0 5 10 15 20 25
to validate the results. The subscript ph is related to percussive
and harmonic instrumental sounds without adding the original vo-
cal sounds. In this case, each percussive signal xp (t) is composed
of percussive sounds (drums) and each harmonic signal xh (t) is
Figure 5: The column of the left represents the percussive compo- composed of harmonic instrumental sounds. However, the sub-
nent 17 of Figure 4: (a) R̃0 (τ ), (c) R̃1 (τ ), (e) R̃2 (τ ), (g) R̃3 (τ ), script phv is related to percussive, harmonic and vocal sounds. As
(i) R̃4 (τ ), (k) R̃5 (τ ), (m) R̃6 (τ ). The column of the right rep- a result, each percussive signal xp (t) is composed of percussive
resents the harmonic component 2 of Figure 4: (b) R̃0 (τ ), (d) sounds (drums) and each harmonic signal xh (t) is composed of
R̃1 (τ ), (f) R̃2 (τ ), (h) R̃3 (τ ), (j) R̃4 (τ ), (l) R̃5 (τ ), (n) R̃6 (τ ). harmonic and vocal sounds.
DAFX-4
DAFx-279
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
K
TITLE ARTIST 30
Hollywood Nights Bob Seger & The Silver Bullet Band 20 2
Hotel California Eagles 10
Hurts So Good John Mellencamp 5
La Bamba Los Lobos 10 20 30 50 100 150
Make It Wit Chu Queens Of The Stone Age Niter
Ring of Fire Johnny Cash
Rooftops Lost prophets 200 5
150
Sultans of Swing Dire Straits
100 4
Under Pressure Queen 50 SDR (dB)
K
30 3
20
10
2
5
The assessment of the performance of the proposed method 10 20 30 50 100 150
has been performed using the metrics Source to Distortion Ra- Niter
tio (SDR), Source to Interference Ratio (SIR) and Source to Ar-
tifacts Ratio (SAR) [16] [17] which are widely used in the field of Figure 6: Optimization of the parameters K and N iter (using
sound source separation. Specifically, SDR provides information L=4) jointly averaging percussive and harmonic SDR. (top) eval-
on the overall quality of the separation process. SIR is a measure uating the database DBOph , (bottom) evaluating the database
of the presence of percussive sounds in the harmonic signal and DBOphv
vice versa. SAR provides information on the artifacts in the sepa-
rated signal from separation and/or resynthesis. Higher values of
these ratios indicate better separation quality. More details can be
Figure 6(top) shows that the parameters K=50 and N iter=100
found in [16].
maximizes the average percussive and harmonic SDR only con-
sidering mixtures composed of percussive and harmonic sounds
4.2. Setup without vocal sounds. It can be observed that using K<20 and
N iter<30 provides the worst separation performance of the method.
An initial evaluation has been performed taking into account the Results suggest that using a small number of components does
computation of the STFT in order to optimize the frame size N = not allow sufficient separation of repetitive and non-repetitive el-
(1024, 2048 and 4096 samples) using the sampling rate fs previ- ements, thereby causing poor separation quality. Further, a small
ously mentioned. Preliminary results indicated that the best sep- number of iterations does not allow to converge the NMF decom-
aration performance was achieved using (N ,J)=(2048,256) sam- position.
ples. Figure 6(bottom) shows that K=150 and N iter=30 maximize
A random initialization of the matrices W and H was used the average percussive and harmonic SDR only considering mix-
and the convergence of the NMF decomposition was evaluated us- tures composed of percussive, harmonic and vocal sounds. Figure
ing N iter iterations. Due to the fact that standard NMF is not 6(bottom) shows that a higher number of components is necessary
guaranteed to find a global minimum, the performance of the pro- to obtain the highest SDR. The effect of adding vocal sounds in-
posed method depends on the initial values W and H obtaining dicates the presence of a higher variety of spectral patterns so the
different results. For this reason, we have repeated three times for proposed method needs a higher number of components to repre-
each excerpt and the results in the paper are averaged values. sent the mixture adequately..
Using the optimal parameters in the databases DBOph and
4.3. State-of-the-art methods DBOphv , the optimization of the parameter L is shown in Fig-
ure 7. Comparing the separation performance of the parameter
Two reference state-of-the-art percussive and harmonic separation L, L=4 provides the highest robustness to discriminate harmonic
methods have been used to evaluate the proposed method: HPSS and percussive sounds evaluating audio mixtures with or without
[1] and MFS [2]. These were both implemented for the evaluation vocal sounds. The principal reason for this is that R̃4 (τ ) retains
of this paper. The ideal separation, called Oracle, is provided to sufficient peaks in the repetitive basis functions to allow discrimi-
better compare the quality of the proposed methods. The Oracle nation from the non-repetitive basis functions which tend to have
separation has been computed using the ideal soft masks extracted a single valley at L=4.
from the original percussive and harmonic signals applied to the
input mixture.
4.5. Results
4.4. Optimization Figure 8(a) and Figure 8(b) show SDR, SIR and SAR results evalu-
ating the databases DBTph and DBTphv for the proposed method
An optimization of the parameters K, N iter and L is performed and the reference state-of-the-art methods. Each box represents
in the databases DBOph and DBOphv as shown in Figure 6 and nine data points, one for each excerpt of the test database. The
Figure 7. For this purpose, a hyperparametric analysis is applied lower and upper lines of each box show the 25th and 75th per-
to each parameter of the proposed method as occurs in [5]. In this centiles for the database. The line in the middle of each box repre-
work, K= (5, 10, 20, 30, 50, 100, 150, 200), N iter=(10, 20, 30, sents the median value of the dataset. The left (blue), center (red)
50, 100, 150) and L=(0, 1, 2, 3, 4, 5, 6). and right (black) boxes are related to the estimated percussive, har-
DAFX-5
DAFx-280
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
3
SIR results. Specifically, the proposed methods remove most of
1
the vocal sounds from the estimated percussive signal, while HPSS
−1 and MFS do not. In some excerpts, this fact implies that the vo-
−3 cal sounds are unintelligible in the estimated harmonic signal. The
0 1 2 3 4 5 6
L promising behavior of the proposed methods with respect to vo-
6 cal sounds can be explained by the fact that vocal sounds tend to
5 exhibit less repetition, both in terms of melody and in terms of
modulations than other sound sources.
SDR (dB)
3
1
−1 5. CONCLUSIONS AND FUTURE WORK
−3
0 1 2 3 4 5 6
L This paper presents a novel approach for separating harmonic and
percussive sounds using only temporal information extracted from
Figure 7: Optimization of the parameters L jointly averaging per- activations by means of non-negative matrix factorization. The
cussive and harmonic SDR. (top) evaluating the database DBOph basic idea is to classify automatically those activations that ex-
using the optimal parameters K=50 and N iter=100; (bottom) hibit rhythmic and non-rhythmic patterns. We assume that a rhyth-
evaluating the database DBOphv using the optimal parameters mic pattern models repetitive events which are typically associated
K=150 and N iter=30 with percussive sounds. However, a non-rhythmic pattern models
non-repetitive events typically associated with harmonic or vocal
sounds. Some of the advantages of this approach are (i) simplicity;
(ii) no prior information about the spectral content of the musical
monic and average between percussive and harmonic signals. instruments and (iii) no prior training.
Figure 8(a) indicates that MFS and the proposed method out- Evaluating instrumental mixtures without vocals, results in-
perform the separation performance of HPSS in percussive and dicate that the proposed methods obtain promising audio separa-
harmonic SDR but HPSS can be considered as competitive method. tion. The performance of Proposed_B is slightly better than Pro-
Although MFS and Proposed_B exhibit a similar behavior in per- posed_A because the update of the semi-supervised NMF allows
cussive SDR, MFS (SDR = 5.9dB) is slightly better than Pro- convergence to a better solution which provides higher quality re-
posed_B (SDR = 5.0dB) in harmonic SDR. However, Proposed_B construction of the estimated harmonic signals. Moreover, our ap-
(SIR = 9.8dB) is slightly better than MFS (SIR = 9.1dB) consid- proach obtains higher polyphonic richness of the harmonic sounds
ering the average between percussive and harmonic SIR. HPSS because it captures most of the onsets of the harmonic sounds.
obtains the highest percussive SIR at the expense of introducing However, a weakness of the proposed method is that highly repet-
a high amount of artifacts, as shown by having the worst percus- itive harmonic instruments can occasionally be classified as per-
sive and harmonic SAR. This does not occur with either MFS or cussive.
the proposed methods. Further, HPSS loses most of the transients Evaluating instrumental mixtures containing vocals, results show
associated with the beginning of the harmonic sounds, thereby ob- that the proposed method gives a more robust performance both in
taining low harmonic SDR compared to the other methods. Al- percussive and harmonic SDR and SIR compared to the reference
though Proposed_A and Proposed_B show similar separation per- state-of-the-art methods.The proposed method extracts most of the
formance, it seems that Proposed_B is slightly better than Pro- vocal sounds from the estimated percussive signal unlike the other
posed_A. This can be explained because the semi-supervised NMF, reference state-of-the-art methods evaluated.
initialized with the harmonic bases from Proposed_A, tends to Future work will be focused on three directions. Firstly, we
converge to a better solution, obtaining a higher quality reconstruc- will try to improve the audio quality of the estimated sources using
tion of the estimated harmonic signals. Informal listening tests another time-frequency representation that provides a better reso-
suggest that the proposed methods achieves higher polyphonic rich- lution in the low frequency bands. Secondly, we will attempt to
ness of the harmonic sounds compared to HPSS and MFS because remove residual harmonic sounds that have been factorized in per-
it is able to capture most of the onsets of the harmonic sounds, cussive NMF components. Thirdly, we will address how to sepa-
such as onsets played by bass guitar or lead guitar. Nevertheless, a rate harmonic sounds, e.g. bass guitar, that show temporally repet-
weakness of the proposed method is that it classifies as a percus- itive characteristics from percussive sounds which have been fac-
sive sound those harmonic sounds, e.g., bass guitar, that exhibit a torized in the same NMF component, by looking into other ways
very strong rhythmic pattern, especially if the bass guitar is play- of extracting the periodicity of the activations.
ing a repeated note which has been factorized together with a part
of a percussive sound in the same NMF component.
Figure 8(b) shows the separation performance for the database 6. REFERENCES
DBTphv . Comparing with Figure 8(a), it can be observed that the
addition of vocal sounds significantly worsens the SDR, SIR and [1] N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and
SAR obtained by HPSS and MFS. However, the proposed meth- S. Sagayama, “Separation of a monaural audio signal into
ods still perform strongly, even with the addition of vocals. Re- harmonic/percussive components by complementary diffu-
sults indicate that Proposed_A and Proposed_B show similar or sion on spectrogram,” in Proceedings of the European Signal
even better SDR, SIR and SAR results compared to their separa- Processing Conference (EUSIPCO), 2008, pp. 25–29.
DAFX-6
DAFx-281
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
15 15
12 12
SDR (dB)
SDR (dB)
9 9
6 6
3 3
0 0
HPSS MFS Proposed_A Proposed_B Oracle HPSS MFS Proposed_A Proposed_B Oracle
25 25
20 20
SIR (dB)
SIR (dB)
15 15
10 10
5 5
0 0
HPSS MFS Proposed_A Proposed_B Oracle HPSS MFS Proposed_A Proposed_B Oracle
18 18
15 15
SAR (dB)
SAR (dB)
12 12
9 9
6 6
3 3
0 0
HPSS MFS Proposed_A Proposed_B Oracle HPSS MFS Proposed_A Proposed_B Oracle
Figure 8: SDR, SIR and SAR results of the proposed methods and the state-of-the-art methods. The left (blue), center (red) and right
(black) boxes are related to the estimated percussive, harmonic and average between percussive and harmonic signals.
[2] D. Fitzgerald, “Harmonic-percussive separation using me- [10] D. Lee and S. Seung, “Algorithms for non-negative matrix
dian filtering,” in Proceedings of Digital Audio Effects factorization,” in Proceedings of Advances in Neural Inf.
(DAFX), 2010. Process. System, 2000, pp. 556–562.
[3] J. Yoo, M. Kim, K. Kang, and S. Choi, “Nonnegative matrix [11] S. Raczynski, N. Ono, and S. Sagayama, “Multipitch anal-
partial co-factorization for drum source separation,” in Pro- ysis with harmonic nonnegative matrix approximation,” in
ceedings of the IEEE International Conference on Acoustics, 8th International Conference on Music Information Retrieval
Speech, and Signal Processing (ICASSP), 2010, pp. 1942– (ISMIR), 2007.
1945.
[12] C. Fevotte, N. Bertin, and J. Durrieu, “Nonnegative matrix
[4] M. Kim, J. Yoo, K. Kang, and S. Choi, “Blind rhythmic factorization with the itakura-saito divergence with applica-
source separation: Nonnegativity and repeatability,” in IEEE tion to music analysis,” Neural Computation, vol. 21, no. 3,
International Conference on Acoustics Speech and Signal pp. 793–830, 2009.
Processing (ICASSP), 2010.
[13] B. Zhu, W. Li, R. Li, and X. Xue, “Multi-stage non-negative
[5] F. Canadas, P. Vera, N. Ruiz, J. Carabias, and P. Cabanas,
matrix factorization for monaural singing voice separation,”
“Percussive-harmonic sound separation by non-negative ma-
IEEE Transactions on Audio, Speech, and Language Pro-
trix factorization with smoothness-sparseness constraints,”
cessing, vol. 21, no. 10, pp. 2096–2107, 2013.
Journal on Audio, Speech, and Music Processing, vol. 2014,
no. 26, pp. 1–17, 2014. [14] Activision Neversoft, “Guitar hero 5,”
[6] D. Fitzgerald, A. Liukus, Z. Rafii, B. Pardo, and L. Daudet, http://gh5.guitarhero.com, September 2009.
“Harmonic-percussive separation using kernel additive mod- [15] Activision Neversoft, “Guitar hero world tour,”
elling,” in 25th IET Irish Signals and Systems Conference, http://worldtour.guitarhero.com/us/, November 2008.
2014.
[16] C. Fevotte, R. Gribonval, and E. Vincent, “Bss_eval toolbox
[7] J. Driedger, M. Muller, and S. Disch, “Extending harmonic- user guide - revision 2.0,” in Technical report 1706, IRISA,
percussive separation of audio signals,” in 15th International 2005.
Society for Music Information Retrieval Conference (ISMIR),
2014. [17] E. Vincent, C. Fevotte, and R. Gribonval, “Performance mea-
surement in blind audio source separation,” IEEE Transac-
[8] J. Park and K. Lee, “Harmonic-percussive source separation
tions on Audio, Speech, and Language Processing, vol. 14,
using harmonicity and sparsity constraints,” in 16th Inter-
no. 4, pp. 1462–1469, 2006.
national Society for Music Information Retrieval Conference
(ISMIR), 2015.
[9] F. Canadas-Quesada, P. Vera-Candeas, N. Ruiz-Reyes,
A. Munoz-Montoro, and F. Bris-Penalver, “A method to sep-
arate musical percussive sounds using chroma spectral flat-
ness,” in The First International Conference on Advances in
Signal, Image and Video Processing (SIGNAL), 2016.
DAFX-7
DAFx-282
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-283
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
= w ∗ |α|2 γ
y = sS + sT + e = Φα + Ψβ + e, (2)
The generic choice τ = 1 leads to the Windowed Group Lasso [16],
which can be stated as a dual-layer sparse regression problem [14], which constitutes a natural extension of the classic Least Absolute
n1 o Shrinkage/Selection Operator (LASSO) [20], for which k · k? =
min ky − Φα − Ψβk22 + λkαk1 + µkβk1 , (3) | · |. Here, we focus on τ = 2, which withdraws less energy
α,β 2 from the signal compared to τ = 1 and has been called empirical
Wiener operator [21] or non-negative garotte shrinkage [22]. For
with sparsity parameters λ, µ > 0 . The solution of this prob-
single-layered expansions, previous research has shown that both
lem is approximated by the Iterative Shrinkage/Thresholding Al-
the inclusion of neighbourhoods that extend across a few coeffi-
gorithm (ISTA) [13, 19], and more specifically its multi-layered
cients in time and the choice of τ = 2 significantly improve audio
extension [14, 15]:
noise removal [23] and declipping [24].
(n+1)
α = Sλ α(n) − Φ∗ (y − Φα(n) − Ψβ (n) )
(4)
β (n+1) = Sµ β (n) − Ψ∗ (y − Φα(n) − Ψβ (n) ) , 2.3. Structured shrinkage via modulation-filtering
Instead of defining the neighbourhood weights directly, they can
with iteration index n and initializations α(0) = β (0) = 0. Here
also be defined in terms of their effect on the modulation spectrum
and in the following, the notation (x)+ = max(x, 0) denotes
of the shrinkage operation in (6) [25]. Let F2 denote the two-
the positive part of any x ∈ R and all operations are understood
dimensional discrete Fourier transform on CK×L , then (8) can be
component-wise, i.e., per time-frequency index γ = (k, l). For a
directly reformulated as
complex-valued vector α, the operator S then refers to a shrinkage
operation that is usually called soft-thresholding, h i
kαk2? = F2−1 F2 (w) · F2 (|α|2 ) (9)
i arg α
e (|α| − λ) : |α| ≥ λ λ
Sλ (α) = =α 1− ,
0 : |α| < λ |α| + The x-axis of the resulting modulation spectrum F2 (|α|2 ) corre-
(5) sponds to temporal modulation measured in Hz, the y-axis to spec-
Whereas the optimization problem in (3) together with ISTA tral modulation with unit cycles per Hz. This means the choice of
in (4) provide a theoretical footing for sparse multilayer decompo- the neighborhood w is equivalent to the choice of desired tem-
sition, this approach has several drawbacks in practical situations. poral and spectral modulation frequencies to be captured by the
DAFX-2
DAFx-284
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
modulation filter W := F2 (w). For example, the usage of a rect-
angular neighbourhood corresponds to modulation filtering with #Sines in carrier
a two-dimensional sinc-function centered at zero, i.e., a form of 0.5
10 ms
modulation low-pass filtering.
Amplitude
We here follow previous suggestions [26] and use an addi- 0
tional log-nonlinearity for computing the modulation spectrum:
DAFX-3
DAFx-285
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ii) The level of the transient relative to the stationary compo- signal-to-distortion ratio (SDR). For the stationary layer sS , for
nent was adjusted between -30 and 0 dB. instance, this corresponds to
iii) For the stationary component, harmonic tone complexes ksS k
comprised 1 to 50 sinusoids (i.e., corresponding to the spar- SDR(sS , Φα̂) = 20 log10 .
ksS − Φα̂k
sity of the stationary signal).
The Gamma-envelope was of the form We present results in terms of the SDR improvement compared to
the unprocessed signal, i.e., ∆SDR = SDR(sS , Φα̂)−SDR(sS , y).
e(t) = t(Q−1) exp(−2πbt),
Fig. 2 shows the mean ∆SDR values for all 2 × 3 × 3 al-
where the order Q = 4 was fixed and t denotes time in seconds. gorithmic variants, averaged across the three stimulus factors of
The scale parameter b was set as b = Q − 1/(2πη), where η transient duration, level, and number of sinusoids in the station-
is the underlying Gamma distribution’s mode (or the tone’s rise ary carrier signal (each with four levels, such that every data point
time). From this, it can be inferred that the effective duration of corresponds to a mean across 64 test signals). Both for the station-
any resulting sound (with a threshold of −60 dB) amounts to 4.9η. ary and transient components, the ISTA and ICS methods yield
Transients were delayed by 10 ms in order to ensure even for similar performance in conjunction with fixed thresholds. At the
short transients a significant overlap with the energy of the slower same time, fixed thresholds yield negative ∆SDR values for the
rising envelope of the stationary components (i.e., to prohibit the stationary layer, which indicates that this strategy has significant
possibility of overly simplistic separation). The additive combina- shortcomings. Whereas ISTA works best in conjunction with the
tion of the stationary and transient components was used in con- quantile-based update of the thresholds and fails for the dynamic
junction with a white Gaussian noise floor at a signal-to-noise ra- update, ICS seems to particularly profit from this latter strategy.
tio of 40 dB. Throughout all numerical simulations reported in this Furthermore, ∆SDR is generally higher with neighbourhoods com-
paper, audio signals were sampled at 44.1 kHz. Fig. 1 shows an pared to the independent handling of coefficients and best perfor-
illustration of the stationary and transient components of a test sig- mance is reached for modulation filtering.
nal. Overall, the figure clearly illustrates the superior performance
of ICS dyn with gains of around 10 dB SDR compared to the
best-performing variant of ISTA (quant) for both stationary and
4.2. Algorithm settings transient layers. Compared to the initial reference algorithm—the
We chose Gabor dictionaries with a tight Hann (raised cosine) win- dual-layered ISTA with independent coefficients originally pro-
dow and a hop size of a quarter of the window length. In order posed in [14, 15]—we were thus able to achieve improvements
to capture stationary components, Φ was chosen with a window of more than 20 dB SDR.
length of 2048 samples (46 ms). For capturing transient compo- In order to additionally compare our algorithm to an orthog-
nents, Ψ was chosen with a window length of 128 samples (3 ms). onal transform, we used the dual-layered expansion of the Modi-
All simulations were performed with shrinkage exponent τ = 2. fied Discrete Cosine Transform, which also serves as a dual-layer
As outlined in Sec. 3.2, three strategies for choosing the thresh- decomposition example of the Large Time Frequency Analysis
olds λ(n) and µ(n) were considered: i) A fixed threshold at the Toolbox [29]. This configuration achieved an average ∆SDRS =
80% quantile of each layer’s initial analysis coefficients (denoted −2.8 and ∆SDRT = 4.5, thus markedly worse for transient es-
as fix), ii) a sequence of quantiles, linearly decreasing from the timation compared to the most rudimentary variant of our Gabor-
99% quantile to the 80% quantile of the initial analysis coefficients dictionary-based ISTA approach with fixed thresholds and inde-
(quant), and iii) dynamically decreasing thresholds (dyn). pendent coefficients.
The utility of exploiting inter-coefficient structure was inves- Fig. 3 depicts a more detailed picture of the performance of
tigated by comparing neighbourhood-based shrinkage (denoted as the three best-performing algorithmic variants: ISTA and ICS with
Neigh.) and modulation-based shrinkage (Modul.) to shrinkage quantile-based thresholds, and ICS with dynamic threshold update.
with independent-coefficients (Indep.). For shrinkage of the sta- These figures demonstrate that performance drops as the durations
tionary layer Sλ , neighbourhoods comprised two coefficients for- of the transients grow, which is expected, given that the differences
ward and two coefficients backward of the centre coefficient. For in time-scale of stationary and transient components increasingly
shrinkage of the transient layer Sµ , the neighbourhood extended vanish. Yet, most algorithmic variants are still able to achieve a
for three coefficients above and three coefficient below the cen- substantial SDR improvement even for the longest transients of
tre coefficient. For modulation-based shrinkage, we chose W as 49 ms, which span more than a whole Gabor atom of the stationary
a separable two-dimensional Gaussian distribution centred at the layer (46 ms length). The dependency of ∆SDR on the transient
origin of the modulation spectrum. In order to tune in on spectral level appears to be fairly linear. As expected, the biggest improve-
information for shrinkage of the stationary layer, the Gaussian’s ments of the stationary ∆SDR occur for conditions with strongest
standard deviations were set to (1, 0.1) for the spectral scale and transients, and vice versa for the transient ∆SDR. Finally, when
temporal rate axes, respectively, and the distribution was evalu- it comes to the number of sinusoidal components in the station-
ated across the range [−1, 1] × [−1, 1]. For the transient layer, we ary carrier signal, that is, its sparsity, the dynamic ICS appears to
chose the reverse settings (0.1, 1), mainly tuning in on temporal be particularly robust, whereas the rest of the algorithms yields a
information. sharp drop in performance already for four sinusoids.
4.3. Results Regarding the iterative behavior of the algorithms, Fig. 4 shows
an example of the three best-performing algorithmic variants (us-
We measured the performance in terms of the estimation accuracy ing modulation filtering) across iterations as well as the corre-
of the 100th iterate of the above presented algorithms using the sponding evolution of thresholds (rightmost panel). It is clearly
DAFX-4
DAFx-286
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Stationary Transient
50 50
ISTA ICS ISTA ICS
40 40
30 30
∆SDR / dB
∆SDR / dB
20 20
10 10
0 0
Indep. Indep.
-10 Neigh. -10 Neigh.
Modul. Modul.
-20 -20
fix quant dyn fix quant dyn fix quant dyn fix quant dyn
λ sequence λ sequence
Figure 2: Mean SDRs improvement across all acoustic conditions for different algorithms (ISTA, ICS), different shrinkage operators
(Indep., Neigh., Modul.), and different threshold update strategies (fix, quant, dyn).
Stationary
40 40 40
ISTA quant Indep.
ICS quant Neigh.
30 ICS dyn 30 30 Modul.
∆SDR / dB
20 20 20
10 10 10
0 0 0
0 20 40 60 -40 -20 0 0 20 40 60
Transient duration / ms Transient level / dB #Sines
Transient
60 60 60
ISTA quant Indep.
ICS quant Neigh.
50 ICS dyn 50 50 Modul.
∆SDR / dB
40 40 40
30 30 30
20 20 20
0 20 40 60 -40 -20 0 0 20 40 60
Transient duration / ms Transient level / dB #Sines
Figure 3: SDR improvement for estimates of stationary and transient components as a function of the algorithm type (indicated by symbols
and line color, see left legend) and inter-coefficient structure (indicated by linetype, see right-hand-side legend).
40
∆SDR / dB
40
Thresholds
dynamic
30 30 10-2
20 20
ISTA quant ISTA quant 10-3
10 ICS quant 10 ICS quant
ICS dyn ICS dyn
0 0 10-4
20 40 60 80 100 20 40 60 80 100 0 20 40 60 80 100
# Iterations # Iterations # Iterations
Figure 4: Exemplary iteration. Left and center panel: SDR improvement across 100 iterations for modulation filtering and operator types
(see color legend). The sound’s stationary part comprised 4 sinusoids and the transient was of 10 ms duration at an RMS level of 0 dB.
Rightmost panel: Evolution of thresholds of stationary components (solid lines) and transient components (dotted lines). The threshold
update is color-coded, see legend.
DAFX-5
DAFx-287
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
visible how after every 10 iterations, decreases of thresholds are considering either the energy of time-frequency neighbourhoods
accompanied by increases in ∆SDR. Notably, the ICS with dy- or modulation spectrum regions instead of individual coefficients.
namic threshold update appears to converge much faster, likely due These shrinkage operators were specifically tuned to the presumed
to a steeper slope of thresholds across iterations for the dynamic sparsity and persistence properties of stationary and transient com-
update. Also note that the dynamic threshold update reaches the ponents, exploiting the basic observation that stationary compo-
same absolute threshold values for both layers, which stands in nents are sparse in frequency and persistent over time, and vice
contrast to the other two methods for which the 80% quantile of versa for transients. This step extends the usage of structured
the stationary layer is much smaller compared to that of the tran- shrinkage operators to the context of dual-layer decomposition.
sient layer. We also proposed a novel iteration scheme, Iterative Cross-Shrink-
age, which appears to work particularly well in conjunction with a
Due to their rather heuristic nature, our findings motivate fur- dynamic update of the thresholds. In experiments with artificial
ther mathematical and experimental inquiries into the formal roots test signals, the proposed scheme improved stationary/transient
of the proposed algorithms. For instance, it is currently unclear separation by surprisingly large margins by about 10 dB SDR com-
why ISTA appears to be incompatible with the dynamic thresh- pared to the dual-layered ISTA with neighbourhood/modulation
old update whereas ICS profits substantially from it. It is also persistence. Compared to the dual-layered ISTA with indepen-
questionable whether this gain in performance (from ICS quant dent coefficients, we were able to achieve improvements of more
to ICS dyn) is due to the update’s actual adaptivity or solely based than 20 dB SDR. In addition, the application of Iterative Cross-
on a steeper decrease of thresholds over iterations. Finally, it is Shrinkage to recorded sounds from acoustic musical instruments
left completely open whether it could be beneficial to replace the provided a perceptually convincing separation of transients.
STFT dictionary with short window lengths by a wavelet basis,
which may be better suited to account for the stochastic nature of
transients [10]. On the other hand, the proposed algorithms ap- 7. REFERENCES
pear to be robust enough already in order to be useful for station-
ary/transient separation in practical situations, as described in the [1] J. W. Beauchamp, “Analysis and synthesis of musical instru-
next section. ment sounds,” in Analysis, Synthesis, and Perception of Mu-
sical Sounds, J. W. Beauchamp, Ed. Springer, 2007, pp.
5. EXAMPLES WITH RECORDED AUDIO 1–89.
[2] S. J. Godsill and P. J. Rayner, Digital audio restoration.
We considered natural recorded instrumental tones produced by a New York, NY: Springer, 2002.
violoncello, a vibraphone, and a harpsichord. Each sound was of
500 ms duration, fundamental frequency 311 Hz, and of equal per- [3] M. Müller, D. P. Ellis, A. Klapuri, and G. Richard, “Sig-
ceptual loudness as used in [30]. Due to its continuous mode of ex- nal processing for music analysis,” IEEE Journal of Selected
citation, the violoncello is a quasi-harmonic sound without marked Topics in Signal Processing, vol. 5, no. 6, pp. 1088–1110,
attack transient, yet with low-energy noise components stemming 2011.
from the bow. As an impulsively excited sound, the vibraphone [4] S. Handel, “Timbre perception and auditory object identifi-
features strong attack transient at its onset. Interestingly, the harp- cation,” in Hearing, ser. Handbook of Perception and Cog-
sichord comprises one transient component at its onset, but also nition, B. C. Moore, Ed. San Diego, CA: Academic Press,
one at the release of its hopper while other sinusoidal components 1995, vol. 2, pp. 425–461.
sustain. Fig. 5 presents the waveform of these three sounds in se-
quence, their spectrograms, as well as the separation provided by [5] C. Laroche, M. Kowalski, H. Papadopoulos, and G. Richard,
ICS with dynamic threshold update and modulation filtering.1 “A structured nonnegative matrix factorization for source
For this example, the stationary layer of the ISTA algorithm separation,” in Proc. of the 23rd European Signal Process-
swallows the transient layers, i.e., the separation fails. On the con- ing Conference (EUSIPCO). Nice, France: IEEE, 2015,
trary, the ICS algorithm provides non-zero estimates of transients pp. 2033–2037.
for all three sounds, even for the onset noise components of the [6] C. Laroche, H. Papadopoulos, M. Kowalski, and G. Richard,
cello bow. As visible in the figure, the vibraphone contains the “Genre specific dictionaries for harmonic/percussive source
strongest transient component, which is clearly separated from the separation,” in Proc. of the 17th International Society for Mu-
remaining sinusoidal components. Finally, the two transients of sic Information Retrieval Conference (ISMIR), New York,
the harpsichord sound are well separated, even though they fully NY, 2016.
overlap in time. The latter sound once again illustrates that sep-
aration performance is not at all relying on temporal separation, [7] M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval,
but on distinct spectrotemporal shapes of stationary and transient and M. E. Davies, “Sparse representations in audio and mu-
signal components. sic: from coding to source seperation,” Proceedings of the
IEEE, vol. 98, no. 6, pp. 995–1005, 2010.
6. CONCLUSION [8] G. Evangelista, “Pitch-synchronous wavelet representations
of speech and music signals,” IEEE Transactions on Signal
In this paper, we presented novel strategies for stationary/transient Processing, vol. 41, no. 12, pp. 3313–3330, 1993.
signal separation. Several shrinkage operators were defined by
[9] P. Polotti and G. Evangelista, “Analysis and synthesis of
1 These and additional audio examples can be accessed via pseudo-periodic1/f-like noise by means of wavelets with ap-
http://www.uni-oldenburg.de/en/mediphysics-acoustics/ plications to digital audio,” EURASIP Journal on Applied
sigproc/research/audio-demos/. Signal Processing, no. 1, pp. 1–14, 2001.
DAFX-6
DAFx-288
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Original Decomposition
20 0.2 Stationary
Transient
-50
Freq. / kHz
0.1
Amplitude
15
-100 0
10
-0.1
5 -150
-0.2
0
0 0.5 1 1.5 0 0.5 1 1.5
Time / s Time / s
Stationary coefficients Transient coefficients
0 -20
10 -40
200
Freq. [bins]
Freq. [bins]
-50 20 -60
400
30 -80
600 -100 40 -100
800 50 -120
1000 -150 60 -140
20 40 60 80 100 120 500 1000 1500 2000
Time [bins] Time [bins]
Figure 5: The ICS algorithm with dynamic threshold update and modulation filtering for a sequence of three recorded musical instrument
tones (from left to right: violoncello, vibraphone, harpsichord). Top left figure shows the original signal spectrogram. Top right figure
shows the time-domain representation of the separation. Bottom left and right figures show the estimated magnitude coefficients for the
stationary (left) and transient (right) components. Axes units are in bins to highlight the different time-frequency resolutions.
[10] L. Daudet and B. Torrésani, “Hybrid representations for [17] M. Kowalski, K. Siedenburg, and M. Dörfler, “Social spar-
audiophonic signal encoding,” Signal Processing, vol. 82, sity! Neighborhood systems enrich structured shrinkage op-
no. 11, pp. 1595–1617, 2002. erators,” IEEE Transactions on Signal Processing, vol. 61,
no. 10, pp. 2498–2511, 2013.
[11] S. Molla and B. Torresani, “A hybrid scheme for encoding
audio signal using hidden markov models of waveforms,” [18] M. Dörfler, “Time-frequency analysis for music signals a
Applied and Computational Harmonic Analysis, vol. 18, pp. mathematical approach,” Journal of New Music Research,
137–166, 2005. vol. 30, no. 1, pp. 3–12, 2001.
[12] C. Févotte, B. Torresani, L. Daudet, and S. J. Godsill, [19] A. Beck and M. Teboulle, “A fast iterative shrinkage-
“Sparse linear regression with structured priors and applica- thresholding algorithm for linear inverse problems,” SIAM,
tion to denoising of musical audio,” IEEE Transactions on Journal of Imaging Sciences, vol. 2, no. 1, pp. 183–202,
Audio, Speech and Language Processing, vol. 16, no. 1, pp. 2009.
174–185, 2008. [20] R. Tibshirani, “Regression shrinkage and selection via the
[13] I. Daubechies, M. Defrise, and C. De Mol, “An iterative lasso,” Journal of the Royal Statistical Society. Series B (Sta-
thresholding algorithm for linear inverse problems with a tistical Methodology), vol. 58, no. 1, pp. 267–288, 1996.
sparsity constraint,” Communications on Pure and Applied [21] S. P. Ghael, A. M. Sayeed, and R. G. Baraniuk, “Improved
Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004. wavelet denoising via empirical Wiener filtering,” in Pro-
[14] G. Teschke, “Multi-frame representations in linear inverse ceedings of SPIE, vol. 3169. San Diego, CA, 1997, pp.
problems with mixed multi-constraints,” Applied and Com- 389–399.
putational Harmonic Analysis, vol. 22, pp. 43–60, 2007. [22] A. Antoniadis et al., “Wavelet methods in statistics: Some re-
[15] M. Kowalski, “Sparse regression using mixed norms,” Ap- cent developments and their applications,” Statistics Surveys,
plied and Computational Harmonic Analysis, vol. 27, no. 3, vol. 1, pp. 16–55, 2007.
pp. 303 – 324, 2009. [23] K. Siedenburg and M. Dörfler, “Persistent time-frequency
shrinkage for audio denoising,” Journal of the Audio Engi-
[16] M. Kowalski and B. Torresani, “Sparsity and persistence:
neering Society (AES), vol. 61, no. 1/2, pp. 29–38, 2013.
mixed norms provide simple signal models with dependent
coefficients,” Signal, Image and Video Processing, vol. 3, [24] K. Siedenburg, M. Kowalski, M. Dörfler et al., “Audio de-
no. 3, pp. 251–264, 2008a. clipping with social sparsity,” in Proc. of the IEEE Interna-
DAFX-7
DAFx-289
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-290
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT The first challenge that arises from this objective is one of de-
sign, i.e. determining what kind of physical model configuration
The virtual exploration of the domain of mechano-acoustically pro- is appropriate as a testbed. Drawing inspiration from several rele-
duced sound and music is a long-held aspiration of physical mod- vant DAFx studies [9, 5, 8, 10] and partly building on earlier ideas
elling. A physics-based algorithm developed for this purpose com- [11, 12], the proposed model takes the form of a string and a plate
bined with an interface can be referred to as a virtual-acoustic connected by a parameterised bridge element, with a local damper
instrument; its design, formulation, implementation, and control fitted on the string (see Fig. 1). The bridge can be parametrically
are subject to a mix of technical and aesthetic criteria, including configured to simulate different types of linear and nonlinear cou-
sonic complexity, versatility, modal accuracy, and computational pling, including mass-like behaviour, spring stiffening and contact
efficiency. This paper reports on the development of one such phenomena (i.e. rattling and buzzing).
system, based on simulating the vibrations of a string and a plate
coupled via a (nonlinear) bridge element. Attention is given to
formulating and implementing the numerical algorithm such that
any of its parameters can be adjusted in real-time, thus facilitating
musician-friendly exploration of the parameter space and offering
novel possibilities regarding gestural control. Simulation results
are presented exemplifying the sonic potential of the string-bridge-
plate model (including bridge rattling and buzzing), and details re-
garding efficiency, real-time implementation and control interface
development are discussed.
1. INTRODUCTION
DAFX-1
DAFx-291
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
F [N ]
in § 4, which also reports on real-time implementation aspects and 0 0 0 0
on preliminary findings with the initial control configuration. Fi-
nally, § 5 offers concluding remarks and future perspectives. -1 -1 -1 -1
-10 0 10 -10 0 10 -10 0 10 -10 0 10
u [7m] u [7m] u [7m] u [7m]
2. STRING-BRIDGE-PLATE MODEL
Figure 2: Examples of spring force-distance curves. (a): linear
2.1. System Equations spring [kL = 105 , k`± = 0]. (b): stiffening spring [kL = 6 × 104 ,
k`+ = k`− = 4 × 1014 , α = 3]. (c) one-sided spring [kL = 0,
Consider transverse vibrations (u) of a stiff string and a thin rect-
k`+ = 0, k`− = 1×108 , α = 1.5] (d): asymmetric spring [kL = 0,
angular plate, both under simply supported boundary conditions,
k`+ = 1 × 108 , k`+ = 2 × 107 , α = 1.5].
coupled via a bridge element of a mass mb with parameterised
spring elements, and with local damping applied at z = zd along
the string axis (see Fig. 1). The string is characterised by its length
Ls , mass density ρs , cross-sectional area As , Young’s modulus
Es , moment of inertia Is , and damping factor γs of yet to be de-
termined wave-number dependency. The plate is of dimensions
Lx × Ly × hp , mass density ρp , and experiences damping accord-
ing to γp . Two of its material properties are encapsulated in the
parameter Gp = (Ep h3p )/(12(1 − νp2 )), where νp is the Poisson
ratio. The dynamics of the coupled system are governed by
Figure 3: Bridge configuration examples. (a): massless bridge
∂ 2 us ∂ 2 us ∂ 4 us ∂us [mb ρs As Ls , k1± = k2± ]. (b): mass-spring bridge [k1± = k2± ].
ρs As 2 = Ts − E s I s − γs
∂t ∂z 2 ∂z 4 ∂t (c): ‘flat’ bridge [kL = 0, k1+ , k2+ , k2− > 0, k1− = 0]. (d): rattling
+ ψc (z)F1 (t) + ψx (z)Fx (t) + ψd (z)Fd (t), (1) bridge [kL = 0, k1+ , k2+ > 0, k1− = k2− = 0].
∂ 2 ub ∂ub
mb = −rb − F1 (t) + F2 (t) + Fb (t), (2)
∂t2 ∂t 2.2. Modal Expansion
∂ 2 up ∂up
ρ p hp = −Gp ∇4 up − γp − Ψc (x, y)F2 (t), (3) The displacement of each of the two distributed linear sub-systems
∂t2 ∂t
can be written as a modal expansion:
where, for κ = c, x, d, ψκ (z) = δ(z − zκ ) and Ψ(x, y) =
Ms
δ(x − xc , y − yc ) are single-point spatial distributions, and ∇4 = X
(∂ 4 /∂x4 + ∂ 4 /∂y 4 ) is a biharmonic operator. It is readily seen us (z, t) = vs,i (z) ūs,i (t), (6)
that (1) models a beam rather than a string when Ts Es Is . i=1
and α ≥ 1 is a power law exponent. The specific form of (4) in- the string (κ = s) and the plate (κ = p), the modal displacements
cludes various types of linear and nonlinear restoring-force based (ūκ ) in (6) and (7) are governed by second-order differential equa-
connections, allowing to parametrically remove or add the push- tions of the form
ing or pulling force of each of the springs through adjusting k`+ , ∂ 2 ūκ,l ∂ ūκ,l
and k`− , respectively; Fig. 2 shows a few example force-distance mκ = −kκ,l ūκ,l (t) − rκ,i + F̄κ,l (t), (9)
∂t2 ∂t
curves. Modelling of the bridge in this manner facilitates the simu- where we have introduced a new modal index l, which maps as
lation of a variety of connection configurations, four example cases l = i for the string and as l = (My − 1)i + j for the plate, and
of which are shown in Fig. 3. Note that while the bridge mass can- where the respective modal parameters are
not be set to zero in our model, it can nonetheless be made negli-
gible by making it very small compared with the string mass. For ρs As Ls Ls Ls
ms = , ks,l = Es Is βl4 + Ts βl2 , rs,l = γs (βl ),
audio ouput, we define the plate momentum at K pickup positions 2 2 2
(xa,k , ya,k ): (10)
ρp hp Lx Ly Lx Ly Gp 4 Lx Ly
∂ mp = , kp,l = βl , rp,l = γp (βl ),
pa,k (t) = ρp hp up (xa,k , ya,k , t). (5) 4 4 4
∂t (11)
DAFX-2
DAFx-292
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
q
The modal frequencies are ωκ,l = kκ,l /mκ − ζκ,l2
, and the The damper and spring connection forces (` = 1, 2) are discretised
modal attenuation rates are parameterised here in the convenient as
n+ 1 δud n+ 1 µu` δV`
low-parameter frequency-dependent form Fd 2 = −rd , F ` 2 = kL + , (24)
∆t 2 δu`
rκ
ζκ,` = = σκ,0 + σκ,1 β` + σκ,3 β`3 , (12) the latter utilising the nonlinear spring element potentials
2mκ
+ −
which retrospectively defines the damping parameters γs and γp k` k`
V` (u` ) = bu` cα+1 + b−u` cα+1 . (25)
in equations (1) and (3) as wave-number dependent. For the two α+1 α+1
sub-systems, the modal force term in (9) is
Z Ls Note that in (21) and (23) the modal elasticity and damping con-
h i
stants of the string and plate have been substituted as follows:
F̄s,l (t) = vs,l (z) ψc (z)F1 (t) + ψx (z)Fx (t) + ψd (z)Fd (t) dz
0
∗ 4mκ a∗κ,l ∗ 2mκ b∗κ,l
= gs,l F1 (t) + gx,l Fx (t) + gd,l Fd (t), (13) kκ,l → kκ,l = , rκ,l → rκ,l = . (26)
Z Lx Z Ly ∆2t ∆t
F̄p,l (t) = − vp,l (x, y)ψ(x, y)F2 (t)dydx where, with Rκ,l = exp(−ζκ,l ∆t ) and Ωκ,l = cos(ωκ,l ∆t),
0 0
= −gp,l F2 (t), (14) 2 2
1 − 2Rκ,l Ωκ,l + Rκ,l 2 1 − Rκ,l
a∗κ,l = , b ∗
κ,l = .
where (for κ = e, d) 2
1 + 2Rκ,l Ωκ,l + Rκ,l 2
1 + 2Rκ,l Ωκ,l + Rκ,l
(27)
gs,l = vs,l (zc ), gκ,l = vs,l (zκ ), gp,l = vp,l (xc , yc ). (15) As explained in previous work [12], this eliminates numerical dis-
For audio output, the plate momenta are modally expanded as persion and attenuation at the sub-system resonance frequencies.
In addition, each modal weight gκ,l (κ = x, s, p) has been substi-
Mp ∗
X ∂ ūp,l tuted in (21-23) with gκ,l = Wr ( 21 ωκ,l /π) gκ,l , where
pa,k = ρp hp ga,k,l , (16)
∂t
l=1 1 : f < fr
where ga,k,l = vp,l (xa,k , ya,k ) and Mp = Mx My . Wr (f ) = (fn − f )/(fn − fr ) : fr ≤ f < fn (28)
0 : f ≥ fn
3. NUMERICAL FORMULATION
is a frequency window in which fn = 12 ∆−1 t is the Nyquist fre-
quency and fr < fn is the highest mode frequency to be rendered
3.1. Discretisation in Time
in full amplitude. This allows variation of system parameters over
The differential equations (9) and (2) are first re-written in first- time without mode aliasing and - control rate permitting - without
order form: causing significant discontinuities in the output sigal. For audio
∂ ūs,l p̄s,l ∂ p̄s,l ∂ ūs,l rates of 44.1 kHz or higher, a sensible choice is to set fr = 20
= , = −ks,l ūs,l − rs,l kHz.
∂t ms ∂t ∂t
+ gs,l F1 + gx,l Fx + gd,l Fd , (17)
3.2. A Vector-Matrix Update Form
∂ub pb ∂pb ∂ub
= , = −rb − F1 + F2 + Fb , (18) Stacking all modal states of the string and plate in column vectors
∂t mb ∂t ∂t
∂ ūp,l p̄p,l ∂ p̄p,l ∂ ūp,l (ūn n n n
s , q̄s ) and (ūp , q̄p ), the system modal state vectors can be
= , = −kp,l ūp,l − rp,l − gp,l F2 , (19) defined as
∂t mp ∂t ∂t
T n T T T n T T
where pκ denotes momentum. Gridding time as un = u(n∆t ), the
∧
ūn = [(ūn
s ) , ub , (ūp ) ] , q̄n = [(q̄n
s ) , qb , (q̄p ) ] , (29)
following sum and difference operators
where q̄κ,l = (∆t /(2mκ ))p̄κ,l is a convenient change of variable.
µu = un+1 + un , δu = un+1 − un , (20) The complete discrete-time system can then be written
are then employed to discretise these equations at t = (n+ 12 )∆t , δ ū = µq̄, (30)
n+ 1 n+ 1
which (with Fx 2
= 21 µFx , Fb 2
= 12 µFb ) yields n+ 1 n+ 1 n+ 1
δq̄ = − (Aµ + Bδ) ū + Ξ Gc F 2 + Ge Fe 2 + Gd F d 2 ,
δ ūs,l µp̄s,l δ p̄s,l ∗ µūs,l ∗ δ ūs,l
= , = −ks,l − rs,l (31)
∆t 2ms ∆t 2 ∆t
∗ n+ 1 ∗ n+ 1 ∗ n+ 1
+ gs,l F1 2
+ gx,l Fx 2
+ gd,l Fd 2
, 1 n+ 1 n+ 1 n+ 1 n+ 1 n+ 1
where Fn+ 2 = [F1 2 F2 2 ]T and Fe 2 = [Fx 2 Fb 2 ]T are
(21) force vectors and where the matrices
δub µpb δpb δub n+ 1 n+ 1 n+ 1
= , = −rb − F1 2 + F2 2 + Fb 2 , As 0s 0sp Bs 0s 0sp
∆t 2mb ∆t ∆t
(22) A= 0 0 0 , B = 0 bb 0 ,
0ps 0p Ap 0ps 0p Bp
δ ūp,l µp̄p,l δ p̄p,l ∗ µūp,l ∗ δ ūp,l n+ 1
= , = −kp,l − rp,l ∗
− gp,l F2 2 . (32)
∆t 2mp ∆t 2 ∆t contain submatrices Aκ and Bκ that are Mκ × Mκ diagonal ma-
(23) trices with diagonal elements Aκ,l,l = a∗κ,l and Bκ,l,l = b∗κ,l ,
DAFX-3
DAFx-293
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
respectively, and with bb = (rb ∆t )/(2mb ). The other matrices where w̄ = ē − θd Hd GTd ē and Φc = Hc − θd Hd GTd Hc , with
in (31) are θd = rd /(rd GTd Hd + ∆t ). Next, the modal coordinate equa-
tion (42) is transformed into an equation in the spring connection
gs 0s gx 0s coordinates by pre-multiplying (42) with −GTc , which gives
Gc = −1 1 , Ge = 0 1 ,
1
0p −gp 0p 0p MFn+ 2 = w − s, (43)
gd ξs Is 0s 0sp where w = −GTc w̄, s = δu (with un = [un n T
1 , u2 ] ), and
Gd = 0 , Ξ = 0 ξb 0 , (33)
0p 0ps 0p ξp Ip (ξb + φs ) −ξb
M = GTc Φc = , (44)
−ξb (ξb + φp )
where matrices Gc,e,d feature modal column vectors gκ which
have Mκ elements defined by (15), ξκ = ∆2t /(2mκ ), and where with φs = gTs hs −θd gTs hd gTd hs and φp = gTp hp . From the second
Iκ are Mκ × Mκ identity matrices. By setting s̄ = δ ū = µq̄ from equation in (24), one may write
(30), equation (31) can be reworked into
1 1
Fn+ 2 = λ(s) + kL s + un , (45)
1 n+ 1 2
s̄ = ē + Hc Fn+ 2 + Hd Fd 2
, (34)
where λ(s) = [λ1 (s1 ), λ2 (s2 )]T with, for ` = 1, 2
where
n+ 1
ē = 2C (q̄n − Aūn ) + He Fe 2
, (35) V` (s` + un n
` ) − V` (u` )
λ` (s` ) = . (46)
and s`
After substituting (45) into (43), a system of two coupled nonlinear
hs 0s hx 0s
simultaneous equations is obtained:
Hc = −ξb ξb , He = 0 ξb ,
0p −hp 0p 0p 1
Mλ(s) + I + kL M s + ι = 0, (47)
hd Cs 0s 0sp 2
Hd = 0 , C = 0 (1 + bb ) −1
0 , (36) where ι = kL Mun − w. These can be solved iteratively using
0p 0ps 0p Cp Newton’s method. Once s is solved, the spring forces are updated
−1
from (43), after which the step vector s̄ is calculated with (42).
with, for κ = s, p, we have Cκ = (Iκ + Aκ + Bκ ) , and Besides updating the states with (38) and the output with (39), it
hs = ξs Cs gs , hp = ξp Cp gp , hd = ξs Cs gd ,
hx = ξs Cs gx , is also required that the connection displacement vector is updated
(37) as un+1 = s + un .
Once known, the step vector s̄ is employed to update the modal
states with 3.4. Uniqueness and Convergence
DAFX-4
DAFx-294
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
3.5. Stability Table 1: Tunable parameters, the constraints imposed upon them
for real-time operation, and example values.
The system energy at time instant t = n∆t is the sum of the modal
energies of the string and plate plus the kinetic energy of the bridge Fig. 4 Fig. 5 Fig. 6
mass and the potential energies in the linear and nonlinear spring STRING
elements: fundam. freq. [Hz] 0 < f˜s < ∆π 100 47.3 80
t
" # Mp " # inharmonicity coeff. 0 ≤ Bs 10−5 10−5 10−5
Ms
X X (p̄n
n (p̄n
s,l )
2 ∗
ks,l (ūn
s,l )
2
s,l )
2 ∗
kp,l (ūn
s,l )
2 damping [s−1 ] 0 ≤ σs,0 1 2 0.5
H = + + + damping [m/s] 0 ≤ σs,1 10−3 4·10−4 10−2
2ms 2 2mp 2
l=1 l=1 damping [m3 /s] 0 ≤ σs,3 10−5 4·10−6 10−4
(p̄n )2 kL 2 damper [s−1 ] 0 ≤ σd 500 0 0
+ b + u1 + u22 + V1 (un n
1 ) + V2 (u2 ). (49) connection position 0 < zc0 < 1 0.87 0.93 0.98
2mb 2
damper position 0 < zd0 < 1 0.99 0.99 0.99
Given that all the individual terms in (49) are non-negative, it fol- excitation position 0 < zx0 < 1 0.07 0.5 0.5
lows that H n ≥ 0. In vector-matrix form, H n can be written BRIDGE
modal mass ratio 10−4 < Rbs 60.6 10−4 1
n T n T kL damping [s−1 ] 0 ≤ ζb 1 2 10−2
n n n
H = (q̄ ) Ξ q̄ + (ū ) Ξ Aū + kun k2 + kV(un )k1 .
−1 −1
DAFX-5
DAFx-295
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
pa [g m s -1 ]
(a) (b)
0.1 0.5
0 0
-0.1 -0.5
0 1 2 3 0 1 2 3
time [s] time [s]
(c)
50
magnitude [dB]
-50
-100
0 500 1000 1500 2000 2500 3000
frequency (Hz)
Figure 4: Example of linear coupling between the string, bridge Figure 5: Example of plate-like sounds via strong string-plate cou-
and plate (see Table 1 for the model parameters). (a): large bridge pling (see Table 1 for the model parameters). Left: plate momen-
mass (Rbs = 6). (b): small bridge mass (Rbs = 0.6). (c): cor- tum spectrogram for η = 0. Right: plate momentum spectrogram
responding magnitude spectra. The upper curve (Rbs = 0.6) is for η = 1.
offset by 50dB for clarity. The vertical dotted lines indicate the
positions of the string mode frequencies. 0.1
Fx [N]
0
-0.1
certain levels which would cause an excessive amount of iterations 0 0.05 0.1 0.15
P a [g m s -1 ]
0
The simplest configuration of the bridge is to use strictly linear
spring connections (i.e. η = 0). The plots in Fig. 4 show signals -0.1
and magnitude spectra of such a case when driving the string with
a short smooth pulse. With a relative large bridge mass (setting 0 0.05 0.1 0.15
Rbs = 6), strong standing waves develop in the string, and the time [s]
transfer of energy is largely uni-directional, i.e. from the string to
the plate, much in the way conventional string instruments oper- Figure 6: Example of a rattling bridge (see Table 1 for the model
ate. As can be seen in the corresponding spectrum (lower curve parameters). Top: string excitation force signal. Middle: plate
in Fig. 4(c)), the output also exhibits plate modes, which are more momentum monitored at the pick-up position. Bottom: displace-
damped than the string resonances. Setting the bridge mass to a ment at the connection point of the string, plate, and bridge mass.
smaller value (Rbs = 0.6) leads to increased coupling between
the plate and the string, and the system energy is then dissipated
more quickly (see Fig. 4(b)), yielding an output in which the string (fully nonlinear springs), with α = 3. As can be seen, the second
modes are less dominant over the plate modes (see upper curve of case exhibits quick growth of densely spaced high-frequency par-
Fig. 4(c)). tials, indicating strong nonlinear inter-mode coupling, and yielding
The notion of effecting strong string-plate coupling can be a somewhat cymbal-like sound output.
taken to its extreme by setting Rbs to a negligibly small value and An altogether different type of nonlinear behaviour is explored
Rps to unity, while using similar damping values for the plate and when imparting asymmetries in either or both of the spring con-
string. The coupling is now very much bi-directional, and any nections, which introduces rattling effects. For example, setting
standing waves on the string no longer correspond to the string G− − + +
1 = G2 = 0 and G1 = G2 = 1 amounts to a bridge that
eigenmodes. Although different than that of the modelled plate, is in contact with the plate and the string at equilibrium but that
the distribution of system modes across the frequency axis is nev- - in the absence of forces that pull it back to either object when
ertheless similar to that of a plate, so the perceived sound output it moves away from them - is free to rattle during vibration. Spe-
is plate-like. Hence in this configuration, the string is effectively cial effects are obtained when letting the string vibrate at a low
merely a mechanical interface to a plate-like instrument. An in- frequency, as such enabling the generation of slowly evolving rat-
teresting feature to explore with this type of connection is to intro- tling patterns which partly play out at sub-sonic rates. An example
duce nonlinear behaviour by varying the η parameter. Fig. 5 shows case is demonstrated in Fig. 6, with the lower plot showing the
the spectrograms resulting for η = 0 (linear springs) and η = 1 complex motion of the bridge mass and its impactive interaction
DAFX-6
DAFx-296
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-297
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-298
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT
DAFX-1
DAFx-299
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Evaluating equation (9) and converting to the modal form, the fol-
2.2. Modal version lowing equations can be found
In order to eliminate numerical dispersion a modal framework is F̄z,i (t) = gz,i Fz (t), (10)
used. This involves using the equation
and
M M
X
X
y(x, t) = vi (x)ȳi (t), (2) yz (t) = gz,i ȳi (t), (11)
i=1 i=1
where
where ȳi (t) are the modal displacements, vi (x) = sin(βi x) are π 2 sin(βi xz ) cos( βi2wz )
the modal shapes and βi = iπ . For this to be a perfect representa- gz,i = . (12)
L π 2 − β i 2 wz 2
tion of the string the number of modes (M ) would be infinity.
After substituting equation (3) into (1) and integrating spa- 2.4. Discretisation in time
tially over the string the following equation is found which de-
scribes the dynamics of a single mode To solve equations (5) and (6) the following discretisation opera-
tors are used to find the system dynamics
∂ 2 ȳi (t) ∂ ȳi (t) X
m = −ki ȳi (t) − ri + F̄,i (t), (3) ∂φ(t)
∂t 2 ∂t δφ(t) = φn+1 − φn ≈ ∆t (13)
∂t t=(n+ 1 )∆t
2
where m = ρAL 2
, ki = L2 (T0 βi2 + EIβi4 ), ri = L
2
γ(βi ) and µφ(t) = φn+1 + φn ≈ 2φ|t=(n+ 1 )∆t , (14)
2
= {tm, c, b, e}. γ(βi ) is now expressed as
using ∆t = 1/fs . When discretising equations (5) and (6) it is
γ(βi ) = 2ρA σ0 + (σ1 + σ3 βi2 )|βi | . (4) ∆t n
more compact to use the substitution q̄in = 2m p̄i to give the equa-
tions
σ0 , σ1 and σ3 shape the frequency dependence to match to a real
n+ 1 1
n+ 2
string as closely as possible [14]. F̄,i , the modal driving forces, δ ȳi 2
= µq̄i (15)
are derived in different ways according to how the physical effect n+ 1 n+ 1 n+ 1 X n+ 1
being considered is modelled. Each one will be explained in its δ q̄i 2 = −ai µȳi 2 − bi δ ȳi 2 +ξ F̄,i 2 (16)
relevant section.
DAFX-2
DAFx-300
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
where Ri = exp(−αi ∆t) and Ωi = cos(ωi ∆t). 2.5.2. Discretisation of tension modulation equation in time
To solve equations (15) and (16) computationally they are rewrit-
ten as vectors in the following form: Equation (27) can be discretised in three ways and equation (28)
shows the option that is used. This choice was made as it preserves
1 1
δ ȳn+ 2 = µq̄n+ 2 (18) the uniqueness of the solution to the iterative solver, this will be
X n+ 1
proven in Section 3.1.
n+ 1 n+ 1 n+ 1
δq̄ 2 = −Aµȳ 2 − Bδ ȳ 2 +ξ F̄,i 2 (19)
1 Γ
= − (µ(ȳ| β2 ȳ))β2 µȳ.
n+ 2
F̄tm (28)
In these equations Aii = a∗i
and Bii = b∗i
with all off-diagonal 4
elements in both matrices being zero. Using the same procedure After the discretisation operator is applied and using equation (20)
1
taken in previous papers [14] of setting s̄ = δ ȳn+ 2 = µq̄n+ 2
1
the following equation is obtained
and then using the relations: Γ
F̄tm (s̄) = − ((ȳn +s̄)| β 2 (ȳn +s̄)+(ȳn )| β 2 ȳn )β 2 (2ȳn +s̄).
ȳn+1 = s̄ + ȳn and q̄n+1 = s̄ − q̄n , (20) 4
(29)
equation (19) can be re-written as: This can then be inserted into equation (21). For ease of differen-
X n+ 1 tiation (29) is rewritten as
G = (I + A + B)s̄ − 2(q̄n − Aȳn ) − ξ F̄,i2
= 0, (21)
Γ
F̄tm (s̄) = − (α2 (s̄)U(s̄) + α1 U(s̄)), (30)
Once the tension modulation, bridge, thread and input modal forces 4
are discretised properly and put into this equation it can be solved where
iteratively using Newton’s method. The Jacobian required for this
is X U(s̄) = β 2 (2ȳn + s̄) (31)
J = Js − ξ J , (22) α2 (s̄) = (ȳn + s̄)| β 2 (ȳn + s̄) (32)
α1 = (ȳn )| β 2 ȳn
where (33)
Js = I + A + B, (23) To differentiate α2 (s̄) with respect to s̄ the vector calculus rule:
and J are the parts of the Jacobian due to the other forces. All of
the parts of the Jacobian will be derived below. ∂g(x)| Ch(x) ∂h(x) ∂g(x)
= g(x)| C + h(x)| C| , (34)
To prevent spectral fold over in the model the number of modes ∂x ∂x ∂x
(M) must be limited, setting the sizes of ȳ and q̄, so that the fre- is used. In the equation being considered β 2 which corresponds to
quency of the highest partial is below the Nyquist frequency set by C is diagonal so it is equal to its own transpose. This gives:
the sampling frequency.
∂α2 (s̄)
= 2(s̄ + ȳn )| β 2 . (35)
2.5. Tension modulation ∂x
2.5.1. Continuous domain Differentiating U(s̄) with respect to s̄ gives
The force density present in the string due to tension modulation ∂U(s̄)
= β2 . (36)
is represented in the following equation ∂s̄
Z 2
EA L ∂y ∂2y Then using the product rule the Jacobian of the tension modulation
Ftm (x, t) = dx 2 . (24) part of the G function can be written as
2L 0 ∂x ∂x
Following the procedure for converting to the modal domain this Γ
Jtm (s̄) = − (α2 β 2 + U(s̄) 2(s̄ + 2ȳn )| β 2 + α1 β 2 ). (37)
can be rewritten as 4
" M #" M #
EA X X As Jtm is a full matrix the addition of the tension modulation sig-
Ftm (x, t) = − ȳi (t)vi (x)βi2 (βj ȳj (t))2 . (25) nificantly increases computational time due to the linear system
4 i=1 j=1
solving required in each iteration.
This is found using the orthogonal nature of the sinusoidal basis It has been shown by Bilbao [8] that, using a similar modal
set within the integral. From this the desired modal force can be formulation, significant pitch glide in the wrong direction can arise
derived by multiplying with the modal shape and integrating over due to tension modulation if the number of modes used creates
the string giving partials up to the Nyquist frequency. However it was found that
" M # this spurious effect could not be replicated in the model even for
X
2
F̄tm,i (t) = −Γȳi (t)βi 2
(βj ȳj (t)) , (26) extreme amplitudes.
j=1
2.6. Thread
where Γ = EAL
8
.This equation shows how the modes are mixed,
with tension modulation having the effect of diffusing energy be- 2.6.1. Continuous domain
tween modes. This is then rewritten in vector form to give:
In the model the thread is treated as described in Section 2.3. The
F̄tm = −Γ(β 2 ȳ)ȳ| β 2 ȳ. (27) equation for the force is
Where β 2 is a diagonal M × M matrix with entries along the ∂yc (t)
diagonal of βi2 . Fc (t) = Kc (hc − yc (t)) − Rc , (38)
∂t
DAFX-3
DAFx-301
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
where Kc is the spring stiffness of the thread, yc is the vertical The bridge is defined to have K points along its length, vi,k gives
displacement of the string at the thread point, t is the time, hc each modal amplitude at the spatial position k in the domain of the
is the thread equilibrium point and Rc is the thread loss parame- barrier and Fb,k (xk , t) is the contact force density at point k and
ter. hc was set to zero for all modelling purposes and as such will time t.
be omitted in the following equations. When re-expressed in the
modal domain this equation reads 2.7.2. Discretisation of bridge equation in time
M
X ∂ ȳi (t) Writing the potential density at time step n gives
F̄c,i (t) = −gc,i gc,i Kc ȳi (t) + Rc . (39)
∂t n kb
i=1
Vb,k (t) = bhb,k − ykn c2 , (47)
2
2.6.2. Discretisation of thread equation in time here ykn is at the corresponding bridge point to hb,k in each case.
Using the discretisation operators on the force function and the Equation (44) is discretised in time to give
equation ȳin+1 = s̄i + ȳin the equation for the modal force at time n+1 n
1 Vb,k − Vb,k
step n + 21 can be written n+ 2
Fb,k =− . (48)
ykn+1 − ykn
n+ 1 Kc Rc
F̄c 2 = −gc gc| (s̄ + 2ȳn ) + s̄ , (40) These equations are combined to give
2 ∆t
in vector form. To solve for s̄ the Jacobian of this vector is re- n+ 1 kb bhb,k − (sk + ykn )c2 − bhb,k − ykn c2
Fb,k2
=− . (49)
quired. This is: 2 sk
Kc Rc Where again sk is at the corresponding barrier point to hb,k . Fol-
Jc = − + gc gc| . (41) lowing from equation (49) and vectorising gives the equation
2 ∆t
1 n+ 1
= ∆xV| F b
n+ 2
2.7. Distributed Barrier F̄b 2
, (50)
2.7.1. Continuous domain and spatial discretisation where V is a K × M matrix which holds all of the modal shapes
n+ 1
at each point along the bridge. F̄b 2 must be differentiated with
To model the bridge a contact law is used to simulate the compres-
respect to s̄ for use in Newton’s method so this is done by using
sion of the string and barrier. The contact law used is
the chain rule, giving the Jacobian for the barrier as
Fb (x, t) = kb bhb (x) − y(x, t)cα , (42) 1
n+ 2
| ∂F b ∂s
where kb is the barrier stiffness per unit length, hb (x) is the height Jb = ∆xV , (51)
∂s ∂s̄
of the bridge at position x and α is the exponent that governs the
bridge interaction. Following [14] the exponent is set according to where
Hertzian theory as α = 1 and the formulation is carried out with y = Vȳ and s = Vs̄. (52)
this in place. Other choices of α ≥ 1 are possible and all of the Obviously ∂s
∂s̄
= V. Explicitly Jb is written as
terms required for solving related to the bridge can be derived in
this general case. The force density and the string displacement in kb ∆x |
Jb = V ζV, (53)
this equation are defined over [xb − 21 ωb , xb + 12 ωb ] which is the 2
spatial domain of the bridge; the force density will be zero outside where ζ is a diagonal matrix with the dimensions K × K. The
this region. The potential energy density for the bridge will also diagonal entries in the matrix range from ζ1,1 which is the closest
be required bridge point to the thread to ζK,K which is the furthest bridge point
kb from the thread are:
Vb (x, t) = bhb (x) − y(xb , t)c2 , (43)
2
as will another definition for the force bhb,k − (sk + ykn )c2 − bhb,k − ykn c2 2bhb,k − (sk + ykn )c
ζk,k = − .
s2k sk
∂Vb (54)
Fb (x, t) = − . (44)
∂y
For use in equation (21) the modal force vector F̄b must be 2.7.3. Single point bridge
found. This would usually be done by integrating over the length For comparison the distributed bridge was made replaceable by a
of the string as follows single point bridge in the model. The single force bridge has a
Z L different equation for the force density which is
F̄b,i (t) = vi (x)Fb (x, t)dx, (45)
0 Fb (x, t) = δ(x − xb )kb bhb (x) − y(x, t)c, (55)
but due to the non-analytic nature of Fb (x, t) it is approximated as xb here is the spatial position of the bridge point. When this is put
a Riemann sum [9] into equation (45) the integral becomes exactly solvable and gives
K a modal force for mode i of
X
F̄b,i (x, t) ≈ vi,k Fb,k (xk , t)∆x. (46) F̄b,i (t) = vi (xb )kb bhb (xb ) − y(xb , t)c. (56)
k=1
DAFX-4
DAFx-302
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.8. Plucking signal Expanding the remaining matrix and using β |2 = β 2 where appro-
priate and writing λ = s̄ + 2ȳn
The plucking signal used is a single point force spatially distributed
following the procedure outlined in Section 2.3. The input force
U(s̄) 2λ| β 2 = 2β 2 λλ| β |2 . (65)
used [14] is
After rearranging it can be seen that the following must be true.
π
Fe (t) = Ae sin2 he (t − te ) , (57)
τe 2Γz| β 2 λλ| β |2 z ≥ 0 (66)
where Therefore −Jtm is positive semi-definite. As all of the compo-
1 τe sinh(Ωt)/τe nent matrices of J are positive definite or positive semi-definite
he (t) = cos t + ) . (58)
2 sinh(Ωt the uniqueness of the solution is proven.
Fe (t) is zero outside of the time domain te : te + τe where te is the
excitation time and τe is the length of the excitation. This gives a 3.2. Stability
malleable curve based on sin2 whose central point and gradients
on either side can be altered according to the choice of Ω. This is The total numerical energy of the system at time-step n can be
to a certain extent an arbitrary choice and was chosen to simulate expressed as a summation of the total energies of the modes of the
the plucking typically used in the playing of tanpura. stiff string with the potentials due to the tension modulation, thread
As Fe is known at all time steps it can simply be inserted into and barrier added on. The potential due to the barrier at time step
equation (21) by using the modal weight ge and discretisation op- n can be easily found using equation (47)
erator µ
n+ 1 1 Vbn = ∆x(1K )| V n
b. (67)
F̄e 2 = ge µFe . (59)
2
Where 1K is a vector of size K with 1 at every entry. The potential
due to the tension modulation is
3. MODEL PROPERTIES
Γ
3.1. Uniqueness Vtmn = ((ȳn )| β 2 ȳn )2 , (68)
4
For the scheme to have a unique existent solution the Jacobian and the potential due to the thread is
must be positive definite. In this scheme the Jacobian is equation
(22). For a N × N symmetric matrix (M) to be positive definite it Kc | n 2
is required that Vcn = (gc ȳ ) , (69)
2
z| Mz > 0, (60)
which in conjunction with the energy
for all real column vectors z of size N which are non-zero. It
is useful in this instance to note that the addition of a positive Hsn = ξ −1 ((ȳn )| Aȳn + (q̄n )| q̄n ), (70)
semi-definite matrix to a positive definite matrix results in a ma-
trix which is still positive definite. The definition for a positive give the total numerical energy at time-step n
semi-definite matrix being
H n = Vbn + Vtmn + Vcn + Hsn . (71)
z| Mz ≥ 0, (61)
To get a power balance the right hand side of equation (19) is mul-
Js is positive definite as every matrix in it has only real, positive tiplied by (δ ȳn )| and the left hand side by (µq̄n )| . Following var-
entries and they are all diagonal. For the thread term −Jc can be ious re-arrangements the equation for the energy balance is found
proven to be positive semi definite simply by considering that any to be
negatives in either z or gc are squared in equation (62), proving 1 1
∆t−1 δH n+ 2 = P n+ 2 − Qn+ 2 .
1
(72)
that 1
z| gc gc | z ≥ 0, (62) The input power P n+ 2 is
The term −ζ is positive semi-definite as −ζk,k is a convex 1 1 | n+ 12 n+ 1
function [5] of s̄ which means that −Jb is also positive semi- P n+ 2 = ge δ ȳ µFe 2 , (73)
2∆t
definite due to the following relation
1
and the power losses Qn+ 2 are
∆xkb |
− z| Jb z =− V| z ζ n V| z ≥ 0. (63)
2 1 (δ ȳn+ 2 )| Bδ ȳn+ 2
1
Rc
1
n+ 1
Qn+ 2 = + (δyc 2 )2 . (74)
To prove that −Jtm is positive semi-definite it is easiest to con- ξ∆t ∆t2
sider it in its three parts as seen in equation (37). α2 , α1 and Γ are 1 1
all positive and β 2 is a diagonal matrix with only positive entries. As both A and B are positive definite H n+ 2 and Qn+ 2 can only
Following from this the following must be true for the first and be greater than or equal to zero so the change in the numerical
third matrices on the right hand side of equation (37) energy of the system can only be conserved or decline outwith
periods where energy is being directed into the system via the con-
z| α2 Γβ 2 z ≥ 0 and z| α1 Γβ 2 z ≥ 0. (64) trolled input force.
DAFX-5
DAFx-303
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Fn [N]
0 0
L 1.0 1.0 m
ρ 7850 8000 kg/m3 -0.5 -0.5
A 6.16 × 10−8 2.83 × 10−7 m2 0 1 2 0 1 2
E 2.0 × 1011 1.0 × 1011 N/m2 time [s] time [s]
−16 0.5 0.5
I 3.02 × 10 6.36 × 10−15 kg/m2
Fn [N]
Fn [N]
T0 33.1 38.7 N 0 0
σ0 0.6 0.8 s−1
σ1 6.5 × 10−3 6.5 × 10−3 m/s -0.5
0.018 0.02 0.022 0.024
-0.5
0.154 0.156 0.158 0.16
σ3 5 × 10−6 5 × 10−6 m3 /s time [s] time [s]
40
Table 1: Physical parameter values used for C3 and C2 string.
|Pn| [dB]
0
-40
-80
4. RESULTS 0 0.5 1 1.5 2 2.5 3 3.5 4
frequency (kHz)
4.1. String and contact parameters
Figure 2: Plots showing how the sampling frequency affects the
The string parameters used in the model are detailed in Table 1. output from the model. The top two graphs show the nut force
Parameters chosen to represent a steel C3 string were the same as over time for two sampling frequencies. The middle plots display
used in previous papers studying the tanpura [14]. For modelling a portions of the nut force signal over a short time frame. These
bronze C2 string the values for σ1 and σ3 were set to be the same can be compared to the black dashed line which represents Fs =
as for the steel string for convenience whilst the value for σ0 was 352.8 kHz but the number of modes is truncated to be the same as
altered to get an empirically more realistic response. the number of modes in the Fs = 44.1 kHz case. The final plot
The bridge profile was taken to be shows the radiated pressure due to the nut force over a range of
frequencies with the same colour scheme as the other graphs.
hb (x) = 3(x − xb )2 , (75)
DAFX-6
DAFx-304
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 3: Plots showing two seconds of simulation after the start of the pluck. The top row shows the nut force envelope over time for the
four cases indicated by the note on top of the graph. The second row shows the radiated pressure from the nut envelope over time. The
third row shows spectrograms of the nut force.
signal to be generated from the model which when analysed will start of the tail can be seen to occur at roughly 1 second when the
give information on how the changes being investigated will affect barrier is distributed and roughly 0.1 s later with the point barrier.
the perceived sound from the modelled instrument. Although they start to decay at different times they end up with a
Figure 3 shows the nut force, radiation pressure from the nut very similar level of energy left in the system.
force and spectrograms for the four different configurations of in- When the tension modulation and distributed bridge are com-
cluding tension modulation and having a point or distributed bridge. bined the spectrogram is distinct from all three of the others, in-
These were all set up with the same parameter values (barring the dicating that both additions make a difference to the output of the
changes made to investigate) and identical plucking conditions. model. This spectrogram shows the higher starting point frequency
The simplest case presented here with no tension modulation and and plateau frequency for the jvari from the tension modulation
a point bridge is used as a point of comparison when studying and the more rounded, shorter plateau from the distributed barrier.
the effects of adding tension modulation and a distributed barrier. The extra effect of having both together is difficult to see in plots,
Within the spectrograms the jvari is the time-dependent formant of but when listening to sound examples1 the combination appears
pronounced frequencies which have the drop in spectral centroid to give a "livelier" sound than any of the signals generated by the
followed by a plateau followed again by a drop. As can be seen other model options.
from Figure 3 all of the spectrograms show this behaviour, how- The effect due to the tension modulation can be seen more
ever there are differences in the characteristics of the jvari shapes. clearly in Figure 4. The plots show spectrograms of the nut force
In the radiated pressure time-domain graph the tension mod- when the model is run with C2 string parameters and a greater
ulation can be seen to significantly alter the envelope in the first excitation amplitude of Ae = −0.8 N. It can be observed by com-
half second of the simulation and the loss of energy at the end of paring (a) and (b) to (c) and (d) that the tension modulation is no-
the signal also follows a different shape. From the spectrograms it ticeable when the barrier is absent but has a larger effect when the
can be observed that the addition of tension modulation causes the barrier is present. With these parameters the change in the shape of
spectral centroid of the jvari to start at a higher frequency and the the jvari is more significant than in the C3 case and is heard more
plateau is also at a higher frequency relative to the situation with- clearly in sound examples.
out the tension modulation. The effect on the tail of the jvari is
hard to see in the spectrograms. The tension modulation also has 5. CONCLUSIONS
the expected effect of causing pitch glide in the partials. This can
be seen by zooming in on the partials above 3 kHz in the second This paper shows how tension modulation, a distributed barrier
and fourth spectrograms. and the thread can be included in a model of the tanpura. The
Distribution of the bridge forces has the effect of shortening model is proven to be unconditionally stable and to have a unique
the plateau of the jvari when compared to the point barrier and it solution to the non-linear vector equation that has to be solved at
also rounds off the top right corner of the plateau. This effect can
be seen in both the spectrogram and the radiated pressure plot. The 1 http://www.socasites.qub.ac.uk/mvanwalstijn/dafx17c/
DAFX-7
DAFx-305
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6. REFERENCES
DAFX-8
DAFx-306
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-307
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-2
DAFx-308
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
when the shear layer re-attaches around the trailing edge, as shown Using the the Green’s function Howe [14] derived the
in Fig. 1. A closed cavity is a longer shallow cavity when the non-dimensional far field acoustic pressure frequency spectrum,
shear layer attaches to the bottom of the cavity prior to the trailing Φ(ω, x) ≈
edge and does not produce the same dipole output. The shear layer
should always reattach to the trailing edge in a deep cavity.
A condition that can indicate whether a cavity is closed or M 2 (ωδ ∗ /U )5/3
open is the ratio between the length and depth. [15] found that (1 + M cos φ)2 {(ωδ ∗ /U )2 + τp2 }3/2
cavities become closed when the ratio L/d > 8, whilst quoting a 2 (5)
separate study which found L/d > 11. C2 sin(κd)
+ i(cos φ − M )
cos{κ(d + ζ + i(κA/2π))}
4.2. Frequency Calculations
In 1878 the Czech physicist Vincenc Strouhal defined one of the where x is the distance along the x axis of the cavity. The first
fundamental relationships of fluid dynamics which linked a physi- and second terms in the magnitude brackets correspond to the
cal dimension L, airspeed u and frequency f of the tone produced monopole and dipole radiation respectively. A = bL is the area of
to a dimensionless variable known as the Strouhal number St . This the cavity mouth,
p τp is a constant set to 0.12 and ζ is an end cor-
is given in Eqn 1: rection set to πA/4. The end correction is the effective length
to which the cavity must be extended to account for the inertia of
St u fluid above the mouth of the cavity also set into reciprocal motion
f= (1)
L by the cavity resonance (see [14] for a more detailed explanation).
where for the case of the cavity tone, L represents the cavity A fixed value for δ ∗ is set in [14], making the second Rossiter
length. The first main research that predicted the cavity tone fre- frequency dominant. The value of δ ∗ has a large influence over the
quencies was done by Rossiter [10] defining the following formula dominant mode [19]. Dipole sources model Rossiter modes, with
to determine the Strouhal number: frequencies obtained from Eqn 4. The monopole source frequency,
relating to the lowest order mode, is calculated when the complex
λ−α frequency satisfies:
Stλ = 1 (2)
K
+M
where α is a constant representing the difference in phase between κA π
κ d+ζ +i = (6)
the acoustic wave arriving at the leading edge and the vortex being 2π 2
shed. It was fitted to the measured experimental data and a value of
0.25 was found. The constant K represents the ratio of convection
velocity of vortices to the free stream flow speed. Rossiter gave 4.4. Tone Bandwidth
K the value of 0.57, again to fit the observed data. λ is an integer To date no research has been found that describes the relationship
number representing the mode. between the peak frequency and the bandwidth of the tone. The
Rossiter’s equation, Eqn. 2, was extended by Heller et al. [16] Reynolds number is a dimensionless measure of the turbulence in
to include the correction for the speed of sound within the cavity. a flow given by:
This is shown in Eqn. 3
λ−α ρair d u
Stλ = (3) Re = (7)
1
+ M µair
K 1
[1+ γ−1 M2] 2
2
It is known that the higher the Reynolds number, the smaller
where γ is the ratio of specific heats; [16] gives γ = 1.4. Rewriting scale the vortices will be. This would imply that the higher the
Eqn. 1 to include the Strouhal number calculated by Eqn. 3 we Reynolds number the wider the tone bandwidth.
obtain the frequency of each Rossiter mode fλ :
λ−α u 5. IMPLEMENTATION
fλ = 1 M
(4)
+ 1 L
K
[1+ γ−1
2
M2] 2 The equations in this section are discrete compared to the
continuous formulas previously stated. The discrete time
4.3. Acoustic Intensity Calculations operator [n] has been omitted until the final output (sec-
tion 5.6), to reduce complexity. It should be noted that the
Pioneering research into aeroacoustics was carried out by airspeed u is sampled at 44.1KHz to give u[n] and hence,
Lighthill [17, 18] who took the fundamental fluid dynamics equa- M [n], ω[n], κ[n], St[n], f [n], Re[n], Φ[n], δ ∗ [n], Q[n] and a[n].
tions, Navier Stokes equations, and defined them for the wave Our synthesis model was implemented using the software Pure
equation to predict the acoustic sound. In [14] Howe uses a Data. This was chosen due to the open source nature of the code
Green’s Function to solve Lighthill’s Acoustic Analogy in relation and ease of repeatability rather than high performance computa-
to the acoustic intensity of cavity tones. This solution provides a tions. Airspeed, cavity dimensions and observer position are ad-
relationship defining the acoustic intensity based on the tone fre- justable in real-time. Properties like air density, speed of sound
quency, airspeed, cavity dimensions and propagation angle. The etc., are set as constants for design purposes but can be adjusted
Green’s function has components relating to a dipole field gener- with minimal effort if a real-time evolving environment is re-
ated by the feedback system GD and a monopole field created by quired. All intermediate calculations are performed in real-time
the cavity resonance GM . and no preprocessing is required.
DAFX-3
DAFx-309
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Values of the ratio L/θ0 for different L/d values. where σ is the Gortler Parameter calculated from [19] to lie be-
tween 5 and 7. We chose σ = 6 for this model. x is the distance
Reference L/d L/θ0 along the cavity length. We need to select a value of x to obtain
2 52.8 the shear layer thickness for the calculations. Since the dipole is
[19] 4 60.24 positioned near the trailing edge [19], we set x = 0.75L.
4 86.06 The relationship between δ ∗ and δ is given in [21] as:
6 90.36
1
[20] 4 82 δ∗ = δ (14)
1+n
where n = 7. From Eqn. 14 we can calculate δc∗ using δc as found
from either Eqn. 12 or 13. The total shear layer effective thickness
5.1. Conditions of Activation is obtained from Eqn. 15.
The change from open to closed cavity occurs when the length to
depth ratio passes a value around L/d > 8 → 11 (Section 4.1). δ ∗ = δc∗ + δ0∗ (15)
This is implemented by a sigmoid function, indicating whether the A critical value for ReL of 25000 is set [22], and implemented
cavity is open or closed when L/d passes a value between 8 and with a sigmoid function providing a transition from laminar to tur-
11. If the cavity is flagged as closed then Rossiter mode dipoles bulent flow.
do not produce and sound.
5.4. Acoustic Intensity Calculations
5.2. Frequency Calculations
The acoustic intensity is calculated from a real-time discrete ver-
A discrete implementation of Eqn. 4 is used with λ = [1, 4] to sion of Eqn. 5 for the previously calculated monopole and dipole
give the first 4 Rossiter frequencies fDλ . These relate to the dipole frequencies, Eqn. 9 and a discrete version of Eqn. 4. To achieve
sources associated with the feedback loop. this the real and imaginary parts of the equation within the mag-
To calculate the monopole resonant mode frequency a solution nitude brackets are separated out. The denominator of the first
to the real part of Eqn. 6 is found; shown in Eqn. 8: component is:
π
κ(d + ζ) = (8) κA
2 cos κ d + ζ + i (16)
2π
with κ = ω/c and ω = 2πfM . Rearranging reveals the monopole
frequency, fM as: multiplying κ into the brackets becomes:
2
c κ A
fM = (9) cos κ(d + ζ) + i (17)
4(d + ζ) 2π
Using the identity:
5.3. Shear Layer Thickness
[19] and [20] state values for the ratio of L/θ0 and d/θ0 for a
number of different ratios of L/d. These are shown in Table 1. A eb + e−b eb − e−b
cos(a + ib) = cos a − i sin a
linear regression line is calculated for this data giving the relation- 2 2
ship: Let
L L
= 9.39 + 36.732 (10)
θ0 d κ2 A κ2 A
e 2π + e− 2π
Since the parameters L and d are set we can calculate a pre- X = cos κ(d + ζ)
2
dicted value for θ0 . The shear layer effective thickness at separa-
κ2 A κ2 A
tion, δ0∗ and θ0 are related by Eqn. 11 [21]: e 2π − e− 2π
Y = sin κ(d + ζ)
2
H = δ0∗ /θ0 (11)
where H is a shape factor given in [20] as 1.29 for turbulent flow Expanding the magnitude brackets, the discrete implementation of
and 2.69 for laminar. Using Eqn. 10 and the relationship with the Eqn. 5 becomes:
shape factor we are able to calculate δ0∗ .
The shear layer thickness over the cavity δc is stated in [21] M 2 (ωδ ∗ /U )5/3
and given as: Φ(ω, x) ≈
(1 + M cos φ)2 {(ωδ ∗ /U )2 + τp2 }3/2
1 " 2 2 # (18)
xL 2 XC2 sin(κd) Y C2 sin(κd)
δc = (12) + + (cos φ − M )
ReL X2 + Y 2 X2 + Y 2
for a laminar flow, where ReL is the Reynolds number with respect
to the cavity length. For turbulent flow, Howe [14] gives a value of C2 = 1.02 for a cavity with d/L
ratio = 0.5 and τp is given as 0.12. To enable us to identify
x gains for the dipole sounds and monopole sound the second ele-
δc = √ (13) ment within the large brackets is expanded giving:
σ 8
DAFX-4
DAFx-310
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Reference Q ReL
5 5.64 x 105
[13] 4.5 4.52 x 105
4.5 3.45 x 105
[24] 25 3.10 x 105
[25] 11.875 1.27 x 105
6 1.47 x 105
Figure 3: Far field spectrum gains calculated by Eqn. 20 & Eqn. 21 [26] 14 3.54 x 106
as compared to Howe’s derivation from [14]. L = b = 0.03m, 14 7.08 x 106
d = 0.015m and u = 3.43m/s. 6.875 2.48 x 106
[27]
7.875 2.95 x 106
[28] 22.5 4.01 x 106
2 [29] 76 4.51x 104
Y C2 sin(κd) Y C2 sin(κd) [30] 15 6.46 x 105
+2 (cos φ − M )
X2 + Y 2 X2 + Y 2 (19)
+ (cos φ − M )2
It is seen that the middle term in Eqn. 19 contains elements In signal processing the relationship between the peak fre-
from both monopole and dipole sources. Since the monopole is quency and bandwidth is called the Q value, (Q = fDλ /∆f ,
the more efficient compact source [23] this term is included in where ∆f is the tone bandwidth at -3dB of the peak). To ap-
the monopole equation to minimise error. The monopole gain is proximate the relationship between Q and ReL , the peak frequen-
shown in Eqn. 20: cies and bandwidths from previously published plots are mea-
sured [13, 24–30]. Results are shown in Table 2.
A linear regression line, fitted to the natural logarithm of the
M 2 (ωM δ ∗ /U )5/3 Reynolds number, was obtained from the data in Table 2. The
GM (ωM , r, φ) ∼
r(1 + M cos φ)2 {(ωM δ ∗ /U )2 + τp2 }3/2 equation is given in Eqn. 22:
" 2 2
XC2 sin(κM d) Y C2 sin(κM d)
+ Q = 87.715 − 5.296 log(ReL ) (22)
X2 + Y 2 X2 + Y 2
# To prevent Q reaching unrealistic values a limit has been set so
Y C2 sin(κd) that 2 ≤ Q ≤ 90.
+2 (cos φ − M )
X2 + Y 2
(20) 5.6. Total Output
where GM (ωM , r, φ) is the discrete far field spectrum gain in re- The implementation described above determines the values used
lation to the monopole. For the dipole: in the final output. The sound is generated from a noise source
shaped by a number of bandpass filters. A flow diagram of the
synthesis process for a dipole source is shown in figure 4. It can be
M 2 (ωDλ δ ∗ /U )5/3 seen from Fig. 4 how the parameters are interrelated through the
GD (ωDλ , r, φ) ∼
r(1 + M cos φ)2 {(ωDλ δ ∗ /U )2 + αp2 }3/2 different semi-empirical equations ensuring accuracy throughout
" 2 # the sound effect characteristics.
cos φ − M The monopole source frequency due to the depth resonant
mode fM [n] is calculated from Eqn. 9. The noise source is filtered
(21) through a bandpass filter with the centre frequency set at fM [n], Q
value set by Eqn. 22, giving the output from the filter, BM [n]. The
where GD (ωDλ , r, φ) is the discrete far field spectrum gain in re- intensity is calculated from Eqn. 20 as GM (ωm [n], r, θ), where
lation to the dipole representing one of the Rossiter modes (λ). In ωm [n] = 2πfM [n], giving total monopole output, M [n]:
Eqns. 20 and Eqn. 21, the distance between the source and listener,
r, represents how the sound amplitude diminishes with respect to
M [n] = GM (ωm [n], r, θ)BM [n] (23)
distance in the far field. Figure 3 shows the output from Eqns. 5, 20
and 21, with r = 1. The dipole frequencies due to the Rossiter modes are set from
a discrete implementation of Eqn. 3 with λ = [1, 4]; the corre-
5.5. Tone Bandwidth sponding frequency values, fDλ [n] are calculated from Eqn. 4.
As stated in Section 4.4, no exact relationship for the tone band- The noise source is filtered through a bandpass filter with the cen-
width has been found. It is known that the higher the Reynolds tre frequency set at fDλ [n], Q value set by Eqn. 22, giving the
number the more the vortices reduce in size with increased com- output from the filter, BDλ [n]. The intensity is calculated from
plex interactions leading to a broader width around the peak fre- Eqn. 21 giving a single dipole output, Dλ [n]:
quency; laminar flow will have larger, simpler interactions and be
closer to a pure tone. Dλ [n] = GD (ωDλ [n], r, θ)BDλ [n] (24)
DAFX-5
DAFx-311
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 5: Output from the cavity synthesis model. The grey line
indicates the value from Eqn. 5 for comparison. L and b = 0.03m,
d = 0.015m, u = 3.43m/s, θ = 30◦ .
Figure 4: Flow diagram showing typical dipole synthesis process
DAFX-6
DAFx-312
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 3: Comparison of published measured, computed and theoretical results and our synthesis model. Bold indicates dominant frequency.
Ref. = Reference. (* - read from a graph, † - theoretical answer, ‡ - computational answer, ? - Unknown)
Ref. Airspeed Dimensions (m) Published Results (Hz) Physical Model Results (Hz)
(m/s)
u l w d fD1 fD2 fD3 fD4 fM fD1 fD2 fD3 fD4 fM
[14] 3.43 0.03 0.03 0.015 114*† 49 113 178 243 2061
4530
89.2 1739 4057 6376 8695
4063†
137.2 5854† 2506 5848 9190 12532
7339
181.8 3141 7330 11519 15707
8062†
[13] 0.0191 0.1016 0.0127 8809 1656
230.5 3773 8804 13835 18865
9401†
10027
274.4 4294 10019 15745 21470
10372†
10925
308.7 4680 10919 17159 23398
10941†
195 419 693 947
[31] 291.6 0.4572 0.1016 0.1016 181† 422† 663† 904† 188 438 689 938 293
193‡ 460‡ 667‡ 940‡
31899 71344 110789
[32] 514.5 0.0045 ? 0.0015 28469† 66542† 104615† 28567 66657 104707 142837
32242‡ 66542‡ 126567‡
[24] 40 0.06 0.06 0.35 225 267 623 980 1336 213
454 908
12 168 391 615 838
454‡ 908‡
[33] 0.03 0.6 0.015 640
496 992
15 209 487 766 1043
496‡ 992‡
[34] 31 0.15 0.15 0.15 125* 245* 375* 84 196 308 420 303
The papers with more than one result for the same conditions high-
light that it is difficult to predict an exact frequency value from
theory, indicating that the theoretical and computational models
may not capture all the underlying fluid dynamic processes ongo-
ing when a cavity tone is produced. The wind tunnel recordings
may also have been affected by noise and interference from equip-
ment and the tunnel itself.
The values of C2 and τp introduced by Howe [14] in Eqn. 5
may change depending on the cavity dimensions. If so, this would
introduce discrepancies between our model and measured outputs.
Figure 6: Directional output from synthesis model. L and b =
0.03m, d = 0.015m, airspeed = 34.3m/s. St = Strouhal number. The acoustic intensity calculated by the model matches well
with the theory introduced by Howe [14]. The sound effect is
given an independent gain to allow sound designers to achieve the
loudness they desire while maintaining the relationships over the
The directional output is shown in Fig. 6. For different frequency range and if combining multiple compact sources.
Strouhal numbers there are different propagation angles. The prop- The directional output from our model is like that produced
agation becomes more circular as the frequency increases, indicat- by Howe [14] but is found to differ at higher strouhal numbers.
ing the influence of the monopole. Comparing Fig. 6 with pub- The cause of this is unknown as there is no difference between the
lished results by Howe [14] indicates that for our model the cir- theoretical equation identified by Howe for directional propagation
cular monopole occurs at St ≈ 2 whereas it occurs at St ≈ 2.5 and the equation implemented in the physical model.
in [14].
The equation from Howe [14], Eqn. 5, takes into account the
elevation of the observer to the tone. It does not take into consid-
7. DISCUSSION eration the azimuth angle making the radiated output inherently 2
dimensional rather than 3D. The calculation of the acoustic reso-
The formula given in [16], implemented by our model, performs nance frequency of the cavity is still based on three dimensions.
well for frequency prediction compared to the previously pub- This research focused on the sound generated from a cavity
lished results in Table 3 obtained through wind tunnel measure- as the air passes over it parallel to the length dimension, (Fig. 2).
ments, theoretical predictions and computational fluid dynamics. The vast majority of research by the fluid dynamics community
DAFX-7
DAFx-313
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
has this condition due to the need to minimise noise generated by [11] CKW Tam and PJW Block. On the tones and pressure oscillations in-
wheel wells or bomb bay doors in aircraft. This physical model duced by flow over rectangular cavities. Journal of Fluid Mechanics,
can be used to calculate the sound produced by cavities under such Vol. 89, 1978.
circumstances. The physics change when there is an angle of at- [12] T Colonius, AJ Basu, and CW Rowley. Numerical investigation of
tack on the incoming air which is beyond the scope of this paper. flow past a cavity. In AIAA/CEAS Aeroacoustics Conference, Seattle,
USA, 1999.
Other sound effects that can be developed in the future using
our model as a component part could range from a pipe organ to [13] KK Ahuja and J Mendoza. Effects of cavity dimensions, boundary
layer, and temperature on cavity noise with emphasis on benchmark
grooves in the cross section of a sword. The model may also be data to validate computational aeroacoustic codes. Technical report,
used to calculate the sound generated by a car travelling with a sun- NASA 4653, 1995.
roof or window open or wind passing doorways and other exposed [14] MS Howe. Mechanism of sound generation by low mach number
cavities. flow over a wall cavity. Journal of sound and vibration, Vol. 273,
A demo of our cavity tone synthesis model is available at 2004.
https://code.soundsoftware.ac.uk/hg/cavity-tone. [15] V Sarohia. Experimental investigation of oscillations in flows over
shallow cavities. AIAA Journal, Vol. 15, 1977.
8. CONCLUSIONS [16] HH Heller, DG Holmes, and EEE Covert. Flow-induced pressure
oscillations in shallow cavities. Journal of sound and Vibration, Vol.
We have developed a compact sound source representing the cavity 18, 1971.
tone based on empirical formulas from fundamental fluid dynam-
[17] MJ Lighthill. On sound generated aerodynamically. i. general theory.
ics equations. This has produced a sound synthesis model which Proc. of the Royal Society of London. Series A. Mathematical and
closely represents what would be generated in nature. The model Physical Sciences, 1952.
operates in real-time and unlike other models, our approach ac- [18] MJ Lighthill. On sound generated aerodynamically. ii. turbulence
counts for the angle and distance between source and receiver. Al- as a source of sound. In Proc. of the Royal Society of London A:
though the solution presented is less detailed than solutions ob- Mathematical, Physical and Engineering Sciences, 1954.
tained from CFD techniques, results show that the model predicts [19] CW Rowley, T Colonius, and AJ Basu. On self-sustained oscilla-
frequencies close to measured and computational results however tions in two-dimensional compressible flow over rectangular cavities.
unlike other methods our model operates in real-time. Journal of Fluid Mechanics, Vol. 455, 2002.
Acoustic intensity and propagation angles derived from fun- [20] V Suponitsky, E Avital, and M Gaster. On three-dimensionality and
damental fluid dynamics principles are implemented which overall control of incompressible cavity flow. Physics of Fluids, Vol. 17,
have good agreement with published results. 2005.
[21] T Cebeci and P Bradshaw. Physical and computational aspects of
9. ACKNOWLEDGMENTS convective heat transfer. Springer Science & Business Media, 2012.
[22] H Schlichting et al. Boundary-layer theory. Springer, 1960.
Supported by EPSRC, grant EP/G03723X/1. Thanks to Will [23] D G Crighton et al. Modern methods in analytical acoustics: lecture
Wilkinson, Brecht De Man, Stewart Black and Dave Moffat for notes. Springer Science & Business Media, 2012.
their comments. [24] DD Erickson and WW Durgin. Tone generation by flow past deep
wall cavities. In 25th AIAA Aerospace Sciences Meeting, Reno, USA,
10. REFERENCES
1987.
[1] Y Dobashi, T Yamamoto, and T Nishita. Real-time rendering of [25] YH Yu. Measurements of sound radiation from cavities at subsonic
aerodynamic sound using sound textures based on computational speeds. Journal of Aircraft, Vol. 14, 1977.
fluid dynamics. In ACM Transactions on Graphics, California, USA, [26] HE Plumblee, JS Gibson, and LW Lassiter. A theoretical and exper-
2003. imental investigation of the acoustic response of cavities in an aero-
[2] D Berckmans et al. Model-based synthesis of aircraft noise to quan- dynamic flow. Technical report, DTIC Document, 1962.
tify human perception of sound quality and annoyance. Journal of [27] J Malone et al. Analysis of the spectral relationships of cavity tones
Sound and Vibration, Vol. 331, 2008. in subsonic resonant cavity flows. Physics of Fluids, Vol. 21, 2009.
[3] A Farnell. Designing sound. MIT Press Cambridge, 2010. [28] CW Rowley et al. Model-based control of cavity oscillations part ii:
[4] S Bilbao and CJ Webb. Physical modeling of timpani drums in 3d on system identification and analysis. In 40th AIAA Aerospace Sciences
gpgpus. Journal of the Audio Engineering Society, Vol. 61, 2013. Meeting & Exhibit, Reno, USA, 2002.
[5] F Pfeifle and R Bader. Real-time finite difference physical models of [29] K Krishnamurty. Acoustic radiation from two-dimensional rectan-
musical instruments on a field programmable gate array. In Proc. of gular cutouts in aerodynamic surfaces. NACA Technical Note 3487,
the 15th Int. Conference on Digital Audio Effects, York, UK, 2012. 1955.
[6] R Selfridge et al. Physically derived synthesis model of an Aeolian [30] N Murray, E Sällström, and L Ukeiley. Properties of subsonic open
tone. In Audio Engineering Society Convention 141, Los Angeles, cavity flow fields. Physics of Fluids, Vol. 21, 2009.
USA, Best Student Paper Award. 2016. [31] D Fuglsang and A Cain. Evaluation of shear layer cavity resonance
[7] R Selfridge, DJ Moffat, and JD Reiss. Real-time physical model of mechanisms by numerical simulation. In 30th Aerospace Sciences
an Aeolian harp. In 24th International Congress on Sound and Vision Meeting and Exhibit, Reno, USA, 1992.
(accepted), London, UK, 2017. [32] X Zhang, A Rona, and JA Edwards. An observation of pressure
waves around a shallow cavity. Journal of sound and vibration, Vol.
[8] JW Coltman. Jet drive mechanisms in edge tones and organ pipes.
214, 1998.
The Journal of the Acoustical Society of America, Vol. 60, 1976.
[33] V Koschatzky et al. High speed PIV applied to aerodynamic noise
[9] MS Howe. Edge, cavity and aperture tones at very low mach num-
investigation. Experiments in fluids, Vol. 50, 2011.
bers. Journal of Fluid Mechanics, Vol. 330, 1997.
[34] L Chatellier, J Laumonier, and Y Gervais. Theoretical and exper-
[10] JE Rossiter. Wind tunnel experiments on the flow over rectangular imental investigations of low mach number turbulent cavity flows.
cavities at subsonic and transonic speeds. Technical report, Ministry Experiments in fluids, Vol. 36, 2004.
of Aviation; Royal Aircraft Establishment; RAE Farnborough, 1964.
DAFX-8
DAFx-314
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-315
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
fe (x, t) exciting the system. The matrix C contains weighting pa- Solving (5) for the transformed vector of variables leads to the
rameters for the time derivatives, and the matrix A contains loss representation
parameters and is included in the spatial differential operator L.
Ȳ (µ, s) = H̄(µ, s) F̄e (µ, s) + Φ̄(µ, s) , (8)
The vector of variables Y (x, s) contains the unknown physical
quantities. The matrix I is the identity matrix. with the multidimensional transfer function
The boundary conditions of the system prescribe the existence 1
of a boundary value Φ for one variable or a linear combination of H̄(µ, s) = , Re {sµ } < 0. (9)
s − sµ
variables in the vector Y (x, s) at x = 0, `. The conditions are
formulated in terms of the vector of boundary values and a matrix 3. STATE SPACE MODEL
of boundary conditions FbH acting on the vector of variables
FbH (x, s)Y (x, s) = Φ(x, s), x = 0, `, (2) The description by a multidimensional transfer function from Sec. 2
can be formulated in terms of a state-space description, basically
where the superscript H denotes the hermitian matrix (conjugate shown in [7]. This formulation provides many computational ben-
transpose). efits (e.g. to avoid delay free loops) and allows the incorporation
of boundary conditions of differnt kinds.
2.2. Sturm-Liouville Transformation
For application to the space dependent quantities in (1) a Sturm- 3.1. State Equation
Liouville Transformation (SLT) is defined in terms of an integral The transform domain representation (5) is reformulated to a form
transformation [4] which resembles a state equation for each single value of µ
Z `
Ȳ (µ, s) = T {Y (x, s)} = K̃ H (x, µ)CY (x, s)dx, (3) sȲ (µ, s) = sµ Ȳ (µ, s) + F̄e (µ, s) + Φ̄(µ, s). (10)
0
Combining all µ-dependent variables into matrices and vectors,
where the vector K̃ is the kernel of the transformation and Ȳ (µ, s) expands the state equation to any number of eigenvalues sµ
is the representation of the vector Y (x, s) in the complex tempo-
sȲ (s) = AȲ (s) + Φ̄(s) + F̄e (s), (11)
ral and spatial transform domain. Defining a suitable kernel K,
the inverse transformation is expressed in terms of a generalized with the vectors
Fourier series expansion T
Ȳ (s) = · · · , Ȳ (µ, s), ··· , (12)
∞
X 1 T
Y (x, s) = T {Ȳ (µ, s)} = Ȳ (µ, s)K(x, µ), (4) Φ̄(s) = · · · , Φ̄(µ, s), · · · , (13)
Nµ T
µ=−∞
F̄e (s) = · · · , F̄e (µ, s), · · · , (14)
with the scaling factor Nµ . The kernel functions are unknown and the state matrix is given as
in general, they can be determined for each problem (1) as the
solution of an arising eigenvalue problem [4,12]. The integer index A = diag (. . . , sµ , . . . ) . (15)
µ is the index of a discrete spatial frequency variable sµ for which
the PDE (1) has nontrivial solutions [4]. 3.2. Output Equation
The inverse transformation from Eq. (4) can be reformulated in
2.3. Properties of the Transformation
vector form as an output equation for the state-space model
The SL-transform described above has to fulfill different proper- Y (x, s) = C(x)Ȳ (s), (16)
ties to be applicable to a PDE (1), e.g. the discrete nature of the
eigenvalues sµ , the adjoint nature of the kernel functions K and K̃ with the matrix
and the bi-orthogonality of the kernel functions. The properties of 1
the SLT and the FTM are described in detail in [4, 12]. The math- C(x) = . . . , K(x, µ), . . . . (17)
Nµ
ematical derivations regarding the SLT and the FTM are omitted
here for brevity, as this contribution is focussed on the input and 3.3. Transformed Boundary Term
output behaviour at the system boundaries.
The transformed boundary term Φ̄ in Eq. (6), can be reformulated
2.4. Transform Domain Representation in terms of a vector notation to fit into the state equation (11)
The application of the transformation (3) turns a PDE in the form Φ̄(s) = B(0)Φ(0, s) − B(`)Φ(`, s), (18)
of Eq. (1) into an algebraic equation of the form (see [7, Eq. (16)]) with the matrix
..
sȲ (µ, s) − sµ Ȳ (µ, s) = F̄e (µ, s) + Φ̄(µ, s), (5)
H .
with the transformed vector of boundary values and the transformed B(x) = (19)
K̃ (x, µ) , x = 0, `.
excitation function ..
h i ` .
Φ̄(µ, s) = − K̃ H (x, µ)Φ(x, s) , (6) The state space description with matrices of possibly infinite
0
Z ` size is only used to show the parallelism to other state-space de-
F̄e (µ, s) = K̃ H (x, µ)Fe (x, s)dx. (7) scriptions. The matrices A, B(x) and C(x) act as transformation
0 operators rather than as matrices in the sense of linear algebra.
DAFX-2
DAFx-316
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4. INPUT/OUTPUT BEHAVIOUR With these two matrices the input and output behaviour of the sys-
tem is described completely. Applying the matrices of boundary
When dealing with boundary conditions of physical systems it is conditions (23) and boundary observations (25) to the vector of
important to select which quantities are defined as input and which variables (22) leads to (some arguments (x, s) omitted)
are ouputs of the system (see [10, p. 55]). The number of boundary
conditions corresponds to the order of the spatial differential oper- Φs1 0
FsH Y = , GHs Y = , (26)
ator, respectively the number of variables in the vector Y (x, s). In 0 Yso
many practical cases, boundary conditions are assigned to half the Y1 (x, s) = Φs1 (x, s), Y2 (x, s) = Yso (x, s). (27)
variables at x = 0 and to half of the variables at x = `. The other
variables at either side depend on the system dynamics and can be
observed, they are called boundary observations here. 4.1.2. Complex Boundary Conditions
With this distribution of boundary conditions and boundary Complex boundary conditions are defined here as linear combi-
observations, the vector of variables Y can be sorted in a way, that nations of physical quantities at the system boundaries x = 0, `,
the boundary conditions appear on the top of the vector Y (x, s) including time derivatives or complex parameters and functions.
as Y1 (x, s) and those variables, which are assigned to boundary Like simple boundary conditions, also complex boundary condi-
observations appear on the bottom as Y2 (x, s). To construct these tions can be defined in a more general form. Again the matrix of
vectors, the variable vector is multiplied by suitable permutation boundary conditions is specialized
matrices to extract the relevant entries, which leads to a separated
representation of the vector of variables Fb (x, s) = Fc (x, s), Φ(x, s) = Φc (x, s). (28)
Y1 (x, s) with
Y (x, s) = . (20)
Y2 (x, s)
Fc1 (x, s) 0 Φc1 (x, s)
This partitioning carries over to the whole state space description. Fc (x, s) = , Φc = . (29)
Fc2 (x, s) 0 0
The eigenfunctions K and K̃ from (3), (4) have the same size as
the vector Y (x, s), they can also be divided by applying the same The matrix Fc1 (x, s) has to be chosen as non-singular, except for
permutation isolated values of s. Additionally, as in the case of simple bound-
ary conditions, a matrix of boundary observations is defined
K1 (x, µ) K̃1 (x, µ)
K(x, µ) = , K̃(x, µ) = . (21)
K2 (x, µ) K̃2 (x, µ) 0 Gc1 (x, s)
Gc (x, s) = , (30)
0 Gc2 (x, s)
With this partitioning, the matrices in the output equation (16) and
the transformed boundary vector (18) are also divided which transforms the vector of variables into a vector of complex
boundary observations
C 1 (x)
C(x) = , B(x) = B1 (x) B2 (x) . (22)
C 2 (x) 0
GHc (x, s)Y (x, s) = . (31)
Yco (x, s)
4.1. Boundary Conditions
Also with the complex boundary conditions the input and output
After selecting the input and output variables, the following sec- behaviour of the system can be described in a closed form, by ap-
tions discuss different kinds of boundary conditions and their con- plying both matrices (29), (30) to the vector of variables, which
nection to each other. Sec. 6 presents an application of these gen- leads to
eral concepts to a physical model of a vibrating string.
H H
Fc1 (x, s)Y1 (x, s) + Fc2 (x, s)Y2 (x, s) = Φc1 (x, s), (32)
4.1.1. Simple Boundary Conditions GH
c1 (x, s)Y1 (x, s) + GH
c2 (x, s)Y2 (x, s) = Yco (x, s). (33)
This section defines simple boundary conditions at x = 0, ` of
the physical system. Simple boundary conditions are conditions 4.2. Connection between simple and complex Conditions
acting on individual entries of Y1 (x, s) at the outputs of the sys-
With the definitions of simple and complex boundary conditions
tem. At first the boundary matrices and vectors according to (2)
from Sec. 4.1.1 – 4.1.2 their connections can be explored, to ex-
are specialized to
press complex boundary conditions in terms of the simple ones.
Fb (x, s) = Fs (x), Φ(x, s) = Φs (x, s). (23) So in general the physical system is described by a bi-orthogonal
basis for simple boundary conditions. Then the external connec-
with tions are formulated in terms of relations between the input and
output variables of the complex boundary conditions.
I 0 Φs1 (x, s)
Fs (x) = , Φs (x, s) = . (24)
0 0 0
4.2.1. Simple and Complex Boundary Conditions
Additionally a matrix of boundary observations is defined
Starting with the application of complex boundary conditions to
0 0 the vector of variables Y and exploiting Eq. (25) leads to
GH H
s (x) = I − Fs (x) = , FsH (x) + GH s (x) = I.
0 I
(25) FcH (x, s) FsH (x) + GH s (x) Y (x, s) = Φc (x, s). (34)
DAFX-3
DAFx-317
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4.2.2. Simple and Complex Boundary Observations Φ̄(s) = −B̂K̂Ȳ (s) + Bc Φ̂c (s), (44)
Similar to the boundary conditions, also the boundary observation
with the block matrices
of the simple and complex case can be connected. The goal is
to express the complex boundary observations Yco in terms of the
M12 (0, s)C 2 (0) Φc1 (0, s)
simple boundary values Φs1 and the simple boundary observations K̂ = − , Φ̂c (s) = , (45)
M12 (`, s)C 2 (`) Φc1 (`, s)
Yso . Starting with Eq. (33) and exploiting the block structure of
matrices and vectors, together with (27), leads to
Yco (x, s) = M21 (x, s)Φc1 (x, s) + M22 Yso (x, s), (38) Bc = B1 (0)M11 (0, s) −B1 (`)M11 (`, s) . (46)
DAFX-4
DAFx-318
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5.3. Relation to Control Theory The force can be expressed as a combination of two forces refor-
mulated in terms of the derivatives of string deflection [13]
The modification of the open loop behaviour by feedback as in-
troduced in Sec. 5.2 is well known in the literature on control the- F (0, s) = Ts Y 0 (0, s) − EIY 000 (0, s). (54)
ory [8,10]. The fact that system interconnection constitutes a feed-
back control loop is highlighted e.g. in [10, Fig. 15]. The notation The boundary conditions are rewritten as shown in Sec. 4.1.2 to fit
in (48) has been chosen to reflect the description of closed loop into the input/output model, with the matrices of boundary condi-
systems in e.g. [9, Eq. (19)]. tions
6. STRING MODEL H H −Ts EI
Fc1 (0, s) = I, Fc2 (0, s) = Yb (s) , (55)
0 0
This section presents a multidimensional transfer function model
of a guitar string based on [7]. The derivations are omitted for and the vector of boundary excitations
brevity; instead the results and especially the boundary conditions
from [7] are rewritten to fit the input/output model from Sec. 4. Φc1 (0, s)
Φc1 (0, s) = . (56)
Φc2 (0, s)
6.1. Physical Description
For the position x = ` the simple boundary conditions from Sec. 6.2
A single vibrating string can be described by a PDE [1, Eq. (6)] in are applied. It then follows for the matrix of boundary conditions
terms of the deflection y = y(x, t) depending on time t and space at x = `
x on the string
H H
Fc1 (`, s) = I, Fc2 (`, s) = 0, (57)
ρAÿ + EIy 0000 − Ts y 00 + d1 ẏ − d3 ẏ 00 = fe , (49)
and for the boundary observations
where ẏ represents the time- and y 0 the space-derivative and with
the cross-section area A, moment of inertia I and the length `. The Φc1 (`, s) = Φs1 (`, s). (58)
material is characterized by the density ρ and Young’s modulus E.
Ts describes the tension and d1 and d3 are frequency independent
and frequency dependent damping [1]. The space and time depen- 6.4. Kernel Functions
dent excitation function is defined as fe = fe (x, t).
To fit into the input/output model from Sec. 4 also the kernel func-
The PDE in (49) can be rearranged in the vector form (1) as
tions from [7] have to be rearranged. For the primal kernel function
shown in [7]. To fit the vector of variables into the input/output
follows according to (21) with suitable permutations
model (20), the vector is rearranged with suitable permutation ma-
trices. Therefore the vector of variables defined in [7] can be par- sµ
sin(γµ x) cos(γµ x)
titioned K1 (x, µ) = γµ
, K2 (x, µ) = 2 ,
0 −γµ sin(γµ x) −γµ cos(γµ x)
Y1 (x, s) =
sY (x, s)
, Y2 (x, s) =
Y (x, s)
. (50) (59)
00 000
Y (x, s) Y (x, s)
where K1 is assigned to the boundary conditions and K2 to the
6.2. Simple Boundary Conditions boundary observations. The same partitioning is applied to the
adjoint kernel function
A set of simple boundary conditions for a string fixed at both ends
∗ " ∗ ∗ #
x = 0, ` is formulated as a set of equations for the deflection and s q
q1 cos(γµ x) − µγµ1 sin(γµ x)
bending moment K̃1 (x, µ) = , K̃2 (x, µ) = ,
−γµ2 cos(γµ x) γµ sin(γµ x)
sY (x, s) = Φ1 (x, s), Y 00 (x, s) = Φ2 (x, s) x = 0, `. (51) (60)
This set of linear equations can be formulated in terms of a matrix with the wave numbers γµ and the coefficient q1 defined in [7].
of boundary conditions and a vector of boundary excitations ac-
cording to (2) as shown in [7]. From here on boundary conditions
of the first kind are called simple boundary conditions and they are 6.5. Modified State Space Model
formulated as described in Sec. 4.1.1 in Eq. (24) with
For the synthesis of the string model (49) normally Eq. (4) can
be used as a superposition of first order systems. Alternatively
Φ1 (x, s)
Φs1 (x, s) = , x = 0, `. (52) a state-space model can be set up for the synthesis (see Sec. 3).
Φ2 (x, s)
Here the modified state-space model from Sec. 5 is applied for the
incorporation of complex boundary conditions (see Sec. 6.3).
6.3. Complex Boundary Conditions
As a set of complex boundary conditions, impedance boundary 6.5.1. Output Equation
conditions are used at x = 0 according to [7]. They are formulated
as a set of linear equations, combining the string deflection Y with The output equation is designed according to Eq. (16) with the out-
a force F by a frequency dependent admittance Yb (s) put matrix C(x) from Eq. (17), which is divided according to (22).
For the design of the matrices C 1 and C 2 , the kernel functions (59),
sY (0, s) − Yb (s)F (0, s) = Φc1 , Y 00 (0, s) = Φc2 . (53) (60) and the scaling factor Nµ (see [7, Eq. (15)]) are necessary.
DAFX-5
DAFx-319
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The state equation for the synthesis of the string model is designed
according to Eq. (47). The modified state feedback matrix (48) is 100
constructed via the state matrix A using the eigenfrequencies as
With a state equation designed in this form, the complex bound- The following section shows the simulation results for a guitar
ary conditions are incorporated into a string model designed with string including a boundary circuit. All simulation results are com-
simple boundary conditions as shown in Fig. 1. puted using the modified state-space model as shown in Sec. 6.5,
with the state equation (47) and the output equation (16).
All calculations are performed in the continuous frequency do-
7. BRIDGE MODEL main and the results are presented in terms of the amplitude spec-
tra of the bending force Fbend (x, s) in y-direction in the lower
frequency range. The bending force is calculated according to
As a realistic boundary circuit for the impedance boundary condi-
tions (53), a frequency dependent bridge model is chosen as dis- Fbend (x, s) = EIY 000 (x, s), (63)
cussed in [6]. A bridge model resembles the influence of a guitar
bridge, where the string is attached to the bridge at the position where several valid simplifications were applied [3, 13].
x = 0.
For simplicity the bridge admittance is designed as the super-
position of N second order bandpass filters 8.1. Boundary Conditions and Excitation Function
The complex boundary conditions are chosen for the frequency do-
N
X ω0
sQ main simulations of the string according to Sec. 6.3. The bound-
Yb (s) = Yn ω0
n
, (62) ary values Φc1 , Φc2 of the complex boundary conditions are set
s2 + s + ω02
n=1 Qn
to zero for brevity. The admittance Yb (s) is interpreted as the ad-
mittance of the guitar bridge, where the string is placed on. The
with the quality factor Qn and the resonance frequency ωn = admittance is varied for the following simulations.
2πfn for each mode. An additional gain factor Yn is added to The function fe (x, t) exciting the string is a single impulse at
each bandpass. In general the admittance functions for the bridge the position xe = 0.3 m on the string, as shown in [7]. In the
are position and direction dependent, so the admittance differs for frequency domain this excitation leads to a uniform excitation of
different strings [6]. all frequency components. The simulation of output variables in
The physical parameters for the bridge resonances can be taken response to the excitation function can be seen as an impedance-
from measurements as e.g. shown in [14]. There the lowest six analysis of the string.
most significant eigenmodes of the bridge are realized with damped For all simulations a nylon guitar low E-String is used. The
sinusoids. values for the physical parameters in Eq.(49) are taken from [6,15,
Figure 2 shows the first six prominent modes of the bridge ad- 16].
mittance Yb (s) in the frequency domain. The resonance frequen-
cies ωn and the quality factors Qn are taken from [14]. The gain
8.2. Frequency independent Bridge Impedance
factors are related to the effective masses Yn = m1n as presented
in [6], with the physical values for the effective masses from [14]. In a first step the simulation is performed for a frequency indepen-
For simulations in the continuous frequency domain, the ad- dent constant bridge admittance Yb (s) = const. Figure 3 shows
mittance function from Eq. (62) is directly used. The function is the normalized amplitude spectra of the bending force |Fbend (0, s)|
just inserted into the boundary conditions (55) and incorporated at the bridge position x = 0 for the lower frequency range. The
into the modified state space model described in Sec. 6.5. bridge admittance is varied Yb (s) = 0 s/kg, 2 s/kg, 5 s/kg.
DAFX-6
DAFx-320
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
For zero admittance (solid curve in Fig. 3) the modes of the Fig. 5 shows the influence of the bridge admittance Yb (s) on
guitar string can be seen clearly at the fundamental frequency and the bending force Fbend (xe , s) according to Eq. (63) at the exci-
the higher modes. The behaviour complies with that of a string tation position x = xe . It can be seen that in the frequency inde-
model with simple boundary conditions as the feedback path in pendent case (dotted line) all string resonances are damped in the
the modified state matrix (48) is zero. The results for a string with same way. For a frequency dependent bridge admittance (dashed
simple boundary conditions are confirmed in [5]. line) the modes are influenced according to the shape of the bridge
For increasing bridge admittance (dotted, dashed curve in Fig. 3) admittance.
all modes of the guitar string are damped equally. These results are
not realistic as a real bridge impedance is strongly frequency de- 10−1
pendent. The results for a frequency independent admittance show Yb = 0 s/kg
the validity of the presented concepts for a feedback structure for Yb (s)
a generic example and the results of [7] are verified. 10−2
10−1
10−2
10−4
|Fbend (0, s)|
10−3
10−5
10−4
10−6
0 100 200 300 400 500
Yb = 0 s/kg f in [Hz]
10−5 Yb = 2 s/kg
Yb = 5 s/kg Figure 4: Absolute value of the spectrum of bending force
10 −6 Fbend (0, s) at the bridge position x = 0 for zero bridge admit-
0 50 100 150 200 250 300 tance (solid line) and for a frequency dependent bridge admittance
f in [Hz] according to Eq. (62) in Sec. 7 (dotted line). The dashed black line
is an amplitude shifted version of the bridge admittance Yb (s),
Figure 3: Absolute value of the spectrum of bending force plotted for illustration.
Fbend (0, s) at the bridge position x = 0, for three different fre-
quency independent bridge admittances Yb = const.
10−1
DAFX-7
DAFx-321
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-322
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT
This paper presents recent developments of the EVERTims project,
an auralization framework for virtual acoustics and real-time room
acoustic simulation. The EVERTims framework relies on three
independent components: a scene graph editor, a room acoustic
modeler, and a spatial audio renderer for auralization. The frame-
work was first published and detailed in [1, 2]. Recent develop-
ments presented here concern the complete re-design of the scene
graph editor unit, and the C++ implementation of a new spatial
renderer based on the JUCE framework. EVERTims now func-
tions as a Blender add-on to support real-time auralization of any
3D room model, both for its creation in Blender and its exploration
in the Blender Game Engine. The EVERTims framework is pub-
lished as open source software: http://evertims.ircam.
fr.
1. INTRODUCTION
Auralization is “the process of rendering audible, by physical or Figure 1: Theoretical plot (not generated in EVERTims) illustrat-
mathematical modeling, the sound field of a source in space [. . . ] ing the conceptual components of a RIR: direct path, early reflec-
at a given position in the modeled space” [3]. Room acoustic tions, and late reverberation. EVERTims relies on an image source
auralization generally relies on Room Impulse Responses (RIR), model to simulate the early reflections, and an FDN to generate the
either measured or synthesized. An RIR represents the spatio- late reverberation.
temporal distribution of reflections in a given room upon propaga-
tion from an emitter (source) to a receiver (listener). Fig. 1 details
several conceptual components of an RIR.
A further distinction is made here between ray-based tech-
Auralization has some history of being used in architectural
niques, referring to the family of methods based on the geomet-
design, e.g. for the subjective evaluation of an acoustic space dur-
rical acoustics hypothesis, and ray-tracing techniques [19], one of
ing the early stage of its conception [4]. It also serves for perceptu-
its subfamily. As detailed in [2], the EVERTims room acoustic
ally evaluating the impact of potential renovation designs on exist-
modeler relies on an image source technique [16] (ray-based). The
ing spaces. Auralization is often used as a supplement to objective
implementation is based on a beam tracing algorithm [2], focused
parameter metrics [5, 6, 7]. While previously limited to industry
on real-time estimation of specular early reflections.
and laboratory settings, auralization is now receiving particular at-
A number of geometrical acoustic simulation programs have
tention due to the emergence of Virtual Reality (VR) technologies
been developed having auralization capabilities, either tailored for
[8, 9]. Lately, most research has been driven by the need for (1)
architectural acoustics [20, 21, 22] or acoustic game design [23,
real-time optimization and (2) accuracy and realism of the sim-
24], e.g.: 3DCeption Spatial Workstation1 , DearVR2 , and VisiSon-
ulated acoustics [10, 11, 12], where these two goals drive devel-
ics RealSpace 3D3 (as most of these programs are closed-source,
opments in opposing directions. The phenomenon is similar to the
the authors cannot guarantee that they are strictly based on geo-
race for real-time lighting engines and the advent of programmable
metrical acoustic models). The level of realistic detail achieved is
shaders [13].
typically inversely proportional to the real-time capabilities. The
An overview of the many available techniques for simulating
ambition in developing EVERTims has been to combine the ac-
the acoustics of a given geometry has been presented in [4, 10,
curacy of the former and the performance of the latter, much like
14], detailing both ray-based [15, 16] and wave-based techniques
the approach of the RAVEN [25] based plugin presented in [22].
[17, 18]. Ray-based techniques are based on a high frequency ap-
proximation of acoustic propagation as geometric rays, ignoring 1 Two Big Ears website: www.twobigears.com/spatworks
diffraction and other wave effects. They usually outperform wave- 2 DearVR website: www.dearvr.com
based techniques regarding calculation speed. 3 VisiSonics website: www.visisonics.com
DAFX-1
DAFx-323
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
It does not claim to outperform any of the above-mentioned soft- The visible paths are then sent to the auralization unit. When there
wares, but rather to provide a reliable, simple, and extensible tool is no change, it continues to compute the solution up to the next
for acousticians, researchers, and sound designers. reflection order until the chosen maximum order is reached. When
The main contribution of this paper is to present the evolution a listener moves, visibility tests are computed for all the paths in
of the EVERTims project since its first publication in [2]. Recent the beam-trees for that listener and changes are sent to the aural-
developments include the implementation of a standalone aural- ization unit. Changes in source/listener orientation do not affect
ization rendering unit and a Blender add-on for controlling the au- the beam-tree and visibility of paths; there is no need to perform
ralization process. Sec. 3 introduces the Blender add-on, which any recalculation. A change to the geometry or the source position
replaces the previously implemented VirChor scene graph editor. is computationally expensive, as it requires a new computation of
Only minor revisions have been made to the room acoustic mod- the underlying beam tree, reconstructing the beam-tree up to the
eler that is briefly described in Sec. 4. The implementation of the minimum order, and beginning the iterative order refinement of
new auralization and spatial audio rendering unit is described in the solution once again.
Sec. 5. Sec. 6 summarizes the current state of the framework and For architects and acousticians, the core feature of the frame-
outlines future developments. work is to provide both a reactive auralization during room design
and exploration, providing an interactive rendering based on the
actual geometry of the space, and an accurate auralization for final
2. FRAMEWORK OVERVIEW
room acoustics assessment. The same framework allows for inte-
gration into game engines for real-time applications. At any time,
Fig. 2 shows the functional units of the EVERTims framework.
intermediate rendering results can be exported as an RIR in differ-
The scene graph editor handles the room geometry and the acoustic
ent audio formats (e.g., binaural, Ambisonic), suitable for use in
properties of the wall surface materials, referred to as the room’s
any convolution-based audio renderer.
Geometrical Acoustic (GA) model, as well as the user interaction.
This editor, implemented as a Blender add-on, provides a Graph-
ical User Interface (GUI) from which the room acoustic modeler, 3. SCENE GRAPH EDITOR AND 3D VISUALIZATION
the auralization unit, and the real-time auralization process (i.e.
initiating the data exchange between the functional units) can be The EVERTims scene graph editor and 3D visualizations are han-
started. The room acoustic modeler runs as a background subpro- dled in Blender. The add-on is written in Python, integrated in
cess, while the auralization unit runs as a standalone application. Blender 3D View Toolbox. The approach is similar to the one
In order to support the interchangeability of algorithms the proposed in [27]. The add-on itself handles EVERTims specific
main functional units are fully encapsulated and self-contained. data (GA, assets import, OSC setup, etc.) and processes. To start
The communication and data exchange is based on the Open Sound the auralization, the add-on requires four specific EVERTims ele-
Control (OSC) protocol [26]. This allows the distribution of sub- ments: room, source, listener, and logic. These elements can either
tasks to different computers, or to run the entire environment on be defined from existing objects in the scene, or imported from an
a single computer. During the auralization, the Blender add-on assets library packaged with the add-on.
streams GA model information, along with listener and source po- The source and listener can be attached to any object in the
sitions, to the acoustic modeler. Based on these data, the acoustic 3D scene. The room can be defined from any mesh geometry;
modeler builds a list of image sources, corresponding to a mini- the add-on does not apply any checks on the room geometry (e.g.,
mum reflection order defined at start-up (see Sec. 4). These image convexity, closed form), allowing EVERTims to work for open or
sources, along with an estimated reverberation time for the cur- sparse scenes as well as closed spaces. An acoustic material must
rent GA model, are then sent to the auralization unit. The Blender be assigned to each room mesh face. The wall surface materials
add-on also streams source and listener transformation matrices are defined in the EVERTims materials library, imported from the
to the auralization unit. Based on the data issued from both the assets pack, and shared with the room acoustic modeler unit (see
add-on and the modeler, the auralization unit generates and spa- Sec. 4). Each acoustic material is defined by a set of absorption co-
tialises image sources as early reflections, as well as constructs the efficients (10 octave bands: 32 Hz–16 kHz).The materials library
late reverberation. An additional callback can be initiated by the consists of a human-readable text file which is easy to modify and
add-on to display the results of the acoustic modeler simulation as extend.
reflection paths drawn in the 3D scene for monitoring. The logic object handles the parameters of the Blender add-on
EVERTims uses an iterative refinement procedure and com- when EVERTims is running in game engine mode for real-time au-
putes higher reflection orders progressively. Whenever there is a ralization. This mode uses the Blender Game Engine for building
change in the geometry or a sound source position, an approxi- virtual walkthroughs. Its counterpart, the EVERTims edit mode,
mate beam-tree up to the minimum reflection order is computed. auralizes the room model in real-time during design and creation
DAFX-2
DAFx-324
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-325
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
scenarios, the audio tap of the direct path is handled by a ded- 6. CONCLUSION
icated binaural encoder rather than to the Ambisonic processing
unit. The objective of this direct encoding, already applied in the This paper presented the latest developments of the EVERTims
previous version of the auralization unit [2], is to further increase framework. The aim of the underlying project is to design an accu-
the accuracy of the perceived sound source location. Both decod- rate auralization tool to support research and room acoustic design.
ing schemes are based on filters dynamically loaded from a Simple- Besides accuracy, the framework is focused on real-time rendering
FreeFieldHRIR SOFA file. The interpolation of incomplete HRIR for the auralization of dynamic environments. Its integration as a
measurement grids is not yet supported by the auralization unit, Blender add-on allows for real-time assessment of room acoustics
being handled via Matlab routines at the moment. during its creation.
Each unit of the EVERTims framework is available as an open
While the acoustic modeler efficiently computes high reflec-
source project. These units can be used separately for integra-
tion orders for static geometries, it uses a “low reflection order
tion in any other framework. Sources, tutorials, recordings and ex-
mode” for dynamic geometries. The auralization unit then simu-
change protocol information are available on the EVERTims web-
lates the late reverberation using a Feedback Delay Network (FDN)
site: http://evertims.ircam.fr.
[33]. The current FDN implementation uses 16 feedback channels
with mutually prime delay times, and thus assures for a high modal Latest developments mainly concerned the re-design of the
density in all frequency bands to avoid any “ringing tone” effect. EVERTims framework interface and the integration of its differ-
The FDN input, output, and feedback matrices are tailored to limit ent components. The novel Blender add-on simplifies the control
inter-channel correlation and optimize network density [34]. The of the overall auralization simulation, providing end-users with a
delays and gains are defined following the approach taken in [34], state-of-the-art mesh editing tool in the process. The implementa-
to match the frequency-specific room response time estimated by tion of the auralization and spatial audio rendering unit as a JUCE
the room acoustic modeler. For spatial audio playback, the FDN C++ graphical application, rather than a PureData patch, should
outputs are encoded into 1st -order Ambisonics. The 16 decorre- simplify future maintenance and cross-platform compatibility. The
lated output signals are mapped to the four 1st -order Ambisonics new auralization unit now supports the SOFA format to import ei-
channels (e.g., the outputs 0, 4, 8, and 12 are summed to Am- ther HRIR or source directivity patterns.
bisonic channel 0) to simulate diffuse late reverberation. Higher- Foreseen short-term developments include multi-source and
order Ambisonics encoding for high-resolution binaural playback multi-listener integration along with cross-platform support. The
is planned for future updates. acoustic modeler unit already supports multi-source and multi-
listener, the feature only lacks its counterpart in the auralization
The simulated RIRs can be exported as an audio file in binau- unit. The auralization engine is available as a cross-platform ap-
ral or Ambisonic formats. In addition, all auralization parameters plication, thanks to the versatility of the JUCE framework. Only
and results (source, listener, and image sources positions, image minor modifications to the Blender add-on will be required to sup-
sources absorption coefficients, etc.) can be exported as a text file port both features. Once the multi source/listener support is imple-
for use in other programs. mented, a performance comparison of the framework against exist-
DAFX-4
DAFx-326
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ing auralization solutions will be conducted. This comparison will a calibrated model of Notre-Dame Cathedral,” in Euroregio,
concern both auralization accuracy and efficiency, following that 2016, pp. 6:1–10.
presented in [2]. The assessment will be conducted on the ESPRO [9] Brian FG Katz, David Poirier-Quinot, and Jean-Marc Lyzwa,
model (available from the EVERTims website), a 3D reproduction “Interactive production of the “virtual concert in Notre-
of the IRCAM projection space [35]. Dame”,” in 19th Forum Intl du Son Multicanal, 2016.
The Sabine based statistical model used for estimation of the
room response decay time will be replaced by an Eyring based [10] Michael Vorländer, Auralization, Fundamentals of Acous-
model [36], or other more modern methods taking into account tics, Modelling, Simulation, Algorithms and Acoustic Vir-
some simple geometrical room properties [37]. The estimation tual Reality. Springer, 2008.
will focus on early decay times, rather than on the current RT60 [11] Lauri Savioja, “Real-time 3D finite-difference time-domain
room reverberation time, as these are more relevant from a per- simulation of low-and mid-frequency room acoustics,” in Intl
ceptual point of view. Spatial Impulse Response Rendering [38] Conf on Digital Audio Effects, 2010, vol. 1, pp. 77–84.
or DirAC [39] methods will be considered for the encoding of the
[12] Dirk Schröder, Physically based real-time auralization of in-
diffuse field generated from FDN outputs.
teractive virtual environments, vol. 11, Logos Verlag Berlin
A graphical editor to handle acoustic material creation and
GmbH, 2011.
modification will be added to the add-on, along with an editor for
source directivity. [13] Tomas Akenine-Möller, Eric Haines, and Naty Hoffman,
The impact of room temperature, humidity, etc. on frequency- Real-time rendering, CRC Press, 2008.
specific air absorption is not implemented at the moment (propa- [14] Lauri Savioja and U Peter Svensson, “Overview of geomet-
gation damping in Fig. 4 is based on the inverse propagation law rical room acoustic modeling techniques,” J Acous Soc Am,
only). This feature will soon be added to the auralization unit, vol. 138, no. 2, pp. 708–730, 2015.
applied to both image sources and FDN absorption gains.
The impact of surface scattering on image sources propagation [15] Asbjørn Krokstad, Staffan Strom, and Svein Sørsdal, “Cal-
will be integrated into both modeler and auralization units, based culating the acoustical room response by the use of a ray trac-
on an hybrid approach similar to that suggested in [40]. The Bi- ing technique,” J Sound and Vib, vol. 8, no. 1, pp. 118–125,
nary Space Partitioning used in the room acoustic modeler unit to 1968.
handle the room geometry [2] will be replaced by a Spatial Hash- [16] Jont B Allen and David A Berkley, “Image method for effi-
ing implementation [41], optimized for dynamic geometry updates ciently simulating small-room acoustics,” J Acous Soc Am,
(see [42] and Fig. 3 of [43]). vol. 65, no. 4, pp. 943–950, 1979.
[17] Brian Hamilton, Finite Difference and Finite Volume Meth-
7. REFERENCES ods for Wave-based Modelling of Room Acoustics, Ph.D.
thesis, Univ of Edinburgh, 2016.
[1] Markus Noisternig, Lauri Savioja, and Brian FG Katz, [18] Stefan Bilbao and Brian Hamilton, “Wave-based room
“Real-time auralization system based on beam-tracing and acoustics simulation: Explicit/implicit finite volume model-
mixed-order ambisonics,” J Acous Soc Am, vol. 123, no. 5, ing of viscothermal losses and frequency-dependent bound-
pp. 3935–3935, 2008. aries,” J Audio Eng Soc, vol. 65, no. 1/2, pp. 78–89, 2017.
[2] Markus Noisternig, Brian FG Katz, Samuel Siltanen, and [19] K Heinrich Kuttruff, “Auralization of impulse responses
Lauri Savioja, “Framework for real-time auralization in ar- modeled on the basis of ray-tracing results,” J Audio Eng
chitectural acoustics,” Acta Acustica United with Acustica, Soc, vol. 41, no. 11, pp. 876–880, 1993.
vol. 94, no. 6, pp. 1000–1015, 2008.
[20] Bengt-Inge L Dalenbäck, “Room acoustic prediction based
[3] Mendel Kleiner, Bengt-Inge Dalenbäck, and Peter Svensson, on a unified treatment of diffuse and specular reflection,” J
“Auralization-an overview,” J Audio Eng Soc, vol. 41, no. Acous Soc Am, vol. 100, no. 2, pp. 899–909, 1996.
11, pp. 861–875, 1993.
[21] Graham M Naylor, “Odeon, another hybrid room acoustical
[4] Peter Svensson and Ulf R Kristiansen, “Computational mod- model,” Applied Acoustics, vol. 38, no. 2-4, pp. 131–143,
elling and simulation of acoustic spaces,” in Audio Eng Soc 1993.
Conf 22, 2002, pp. 11–30.
[22] Lukas Aspöck, Sönke Pelzer, Frank Wefers, and Michael
[5] John S Bradley, “Review of objective room acoustics mea- Vorländer, “A real-time auralization plugin for architectural
sures and future needs,” Applied Acoustics, vol. 72, no. 10, design and education,” in Proc of the EAA Joint Symp on
pp. 713–720, 2011. Auralization and Ambisonics, 2014, pp. 156–161.
[6] Brian FG Katz and Eckhard Kahle, “Design of the new opera [23] Regis Faria and Joao Zuffo, “An auralization engine adapting
house of the Suzhou Science & Arts Cultural Center,” in a 3D image source acoustic model to an Ambisonics coder
Western Pacific Acoustics Conf, 2006, pp. 1–8. for immersive virtual reality,” in Audio Eng Soc Conf: Intl
[7] Brian FG Katz and Eckhard Kahle, “Auditorium of the Mor- Conf 28, 2006, pp. 1–4.
gan library, computer aided design and post-construction re- [24] Wolfgang Ahnert and Rainer Feistel, “Ears auralization soft-
sults,” in Intl Conf on Auditorium Acoustics, 2008, vol. 30, ware,” in Audio Eng Soc Conf 12, 1992, pp. 894–904.
pp. 123–130.
[25] Dirk Schröder and Michael Vorländer, “RAVEN: A real-time
[8] Barteld NJ Postma, David Poirier-Quinot, Julie Meyer, and framework for the auralization of interactive virtual environ-
Brian FG Katz, “Virtual reality performance auralization in ments,” in Forum Acusticum, 2011, pp. 1541–1546.
DAFX-5
DAFx-327
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[26] Matthew Wright, Adrian Freed, et al., “Open SoundControl: [42] Dirk Schröder, Alexander Ryba, and Michael Vorländer,
A new protocol for communicating with sound synthesizers,” “Spatial data structures for dynamic acoustic virtual reality,”
in Intl Computer Music Conf, 1997, pp. 1–4. in Proc of the 20th Intl Conf on Acoustics, 2010, pp. 1–6.
[27] Jelle van Mourik and Damian Murphy, “Geometric and [43] Sönke Pelzer, Lukas Aspöck, Dirk Schröder, and Michael
wave-based acoustic modelling using blender,” in Audio Eng Vorländer, “Interactive real-time simulation and auralization
Soc Conf: Intl Conf 49, 2013, pp. 2–9. for modifiable rooms,” Building Acoustics, vol. 21, no. 1, pp.
[28] Wallace Clement Sabine and M David Egan, “Collected pa- 65–73, 2014.
pers on acoustics,” J Acous Soc Am, vol. 95, no. 6, pp. 3679–
3680, 1994.
[29] Yukio Iwaya, Kanji Watanabe, Piotr Majdak, Markus Nois-
ternig, Yoiti Suzuki, Shuichi Sakamoto, and Shouichi
Takane, “Spatially oriented format for acoustics (SOFA) for
exchange data of head-related transfer functions,” in Proc
Inst of Elec, Info, and Comm Eng of Japan, 2014, pp. 19–23.
[30] Piotr Majdak, Yukio Iwaya, Thibaut Carpentier, Rozenn
Nicol, Matthieu Parmentier, Agnieska Roginska, Yôiti
Suzuki, Kanji Watanabe, Hagen Wierstorf, Harald Ziegel-
wanger, and Markus Noisternig, “Spatially Oriented Format
for Acoustics: A Data Exchange Format Representing Head-
Related Transfer Functions,” in Audio Eng Soc Conv 134,
2013, pp. 1–11.
[31] Piotr Majdak and Markus Noisternig, “AES69-2015: AES
standard for file exchange - Spatial acoustic data file format,”
Audio Eng Soc, 2015.
[32] Markus Noisternig, Thomas Musil, Alois Sontacchi, and
Robert Holdrich, “3D binaural sound reproduction using a
virtual Ambisonic approach,” in Intl Symp on Virtual En-
vironments, Human-Computer Interfaces and Measurement
Systems. IEEE, 2003, pp. 174–178.
[33] Manfred R Schroeder, “Natural sounding artificial reverber-
ation,” J Audio Eng Soc, vol. 10, no. 3, pp. 219–223, 1962.
[34] Jean-Marc Jot and Antoine Chaigne, “Digital delay networks
for designing artificial reverberators,” in Audio Eng Soc Conv
90, 1991, vol. 39, pp. 383–399.
[35] Markus Noisternig, Thibaut Carpentier, and Olivier Warus-
fel, “ESPRO 2.0–Implementation of a surrounding 350-
loudspeaker array for 3D sound field reproduction,” in Audio
Eng Soc Conf 25, 2012, pp. 1–6.
[36] Carl F Eyring, “Reverberation time in “dead” rooms,” J
Acous Soc Am, vol. 1, no. 2A, pp. 217–241, 1930.
[37] J Kang and RO Neubauer, “Predicting reverberation time:
Comparison between analytic formulae and computer sim-
ulation,” in Proc of the 17th Intl Conf on Acoustics, 2001,
vol. 7, pp. 1–2.
[38] Juha Merimaa and Ville Pulkki, “Spatial impulse response
rendering i: Analysis and synthesis,” J of the Audio Eng Soc,
vol. 53, no. 12, pp. 1115–1127, 2005.
[39] Ville Pulkki, “Spatial sound reproduction with directional
audio coding,” J of the Audio Eng Soc, vol. 55, no. 6, pp.
503–516, 2007.
[40] Jens Holger Rindel and Claus Lynge Christensen, “Room
acoustic simulation and auralization–how close can we get to
the real room,” in Proc. 8th Western Pacific Acoustics Conf,
2003, pp. 1–8.
[41] Matthias Teschner, Bruno Heidelberger, Matthias Müller,
Danat Pomerantes, and Markus H Gross, “Optimized spa-
tial hashing for collision detection of deformable objects,” in
Vision Modeling and Visualization, 2003, vol. 3, pp. 47–54.
DAFX-6
DAFx-328
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT 2. The player does not receive any visual feedback regarding
the position of a sound source.
This paper presents an experiment comparing player performance
in a gamified localisation task between three loudspeaker config- 3. The player receives a final score determined by how many
urations: stereo, 7.1 surround-sound and an equidistantly spaced sound sources are found.
octagonal array. The test was designed as a step towards deter-
mining whether spatialised game audio can improve player perfor- Three loudspeaker configurations commonly used for spatial
mance in a video game, thus influencing their overall experience. audio rendering were compared in this study: stereo, 7.1 surround-
The game required players to find as many sound sources as pos- sound and an octagonal array with a center channel placed at 0◦
sible, by using only sonic cues, in a 3D virtual game environment. relative to a front-facing listening position. Stereo and 7.1 were
Results suggest that the task was significantly easier when listen- chosen as they represent popular, commercially available, render-
ing over a 7.1 surround-sound system, based on feedback from 24 ing solutions used by video game players. The octagonal array was
participants. 7.1 was also the most preferred of the three listening chosen for its relatively stable sound source imaging all around the
conditions. The result was not entirely expected in that the octag- listener, although it is not a standard configuration currently used
onal array did not outperform 7.1. It is thought that, for the given for game audio. Player performance was quantified based on the
stimuli, this may be a repercussion due to the octagonal array sac- number of correct localisations in each condition, represented by
rificing an optimal front stereo pair, for more consistent imaging their final score.
all around the listening space. The paper is structured as follows: In Section 2, the phantom
image stability of loudspeaker arrays is discussed with an empha-
1. INTRODUCTION sis on surround-sound standards. Game design and the implemen-
tation of the three listening conditions is described in Section 3.
As an entertainment medium, interactive video games are well Results, analysis and a discussion of the results are presented in
suited to the benefits of spatial audio. Spatialisied sound cues can Section 4. The paper conclusion is given in Section 5.
be used to fully envelop the player in audio, creating immersive
and dynamic soundscapes that contribute to more engaging game-
play experiences. In addition to this, an international survey car- 2. BACKGROUND
ried out by Goodwin [1] in 2009 suggests that video game play-
ers consider spatialised sound to be an important factor of a game A listener’s ability to infer the location of a sound source, pre-
experience. At the time of writing, the majority of video game sented using a physical loudspeaker array, is reliant on stable phan-
content is able to output multichannel audio conforming to home tom imaging between two adjacent loudspeakers. By manipulating
theater loudspeaker configurations, such as 5.1 and 7.1 surround- the relative amplitude of two loudspeakers (g1 and g2 in Figure 1),
sound. More recently, Dolby Atmos has been employed in a hand- the illusion of a phantom (or virtual) sound source emanating at
ful of titles such as Star Wars Battlefront (2015) [2] and Blizzard’s some point between them can be achieved [4].
Overwatch (2016) [3]. The ratio of gain values (g1 and g2 ) between the two loud-
A potential benefit arises when considering the influence a speakers shown in Figure 1 is given according to Bauer’s sine pan-
multichannel listening system may have on a player’s performance ning law [6]:
in the game world. If a player is able to better determine the lo-
cation of an in-game event, such as the position of a narrative-
progressing item or a potential threat, will this inform their game- sinθ g1 − g2
= (1)
play decisions? From this concept, a novel idea for a comparative sinθ0 g1 + g2
listening test was derived, in an attempt to assess the influence
a spatial rendering system may have on a player’s ability to lo-
calise a sound source in an interactive game environment. The test where θ is the perceived angle of the virtual source and θ0 is the
presented in this paper was based on the gamification of a simple angle of the loudspeakers relative to a listener facing forward at
localisation task, designed according to three core principles: 0◦ . If the listener’s head is to be more formally considered then
it is suggested that replacing the sin term with tan in (1) will
1. The player’s objective is to locate as many sound sources as provide more consistent imaging [5]. The actual gain values with
possible in the given time limit. a constant loudness can then be derived using:
DAFX-1
DAFx-329
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-2
DAFx-330
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
7.1 surround-sound represents the current state-of-the-art in Sound source: A game object placed at some position in the game
spatial game audio playback, hence its inclusion in the study. The world, from which sound is emitted.
format is an extension of 5.1 surround-sound through the addition
of two loudspeakers either behind or to the sides of the listener 3.1. Game Design
[15], giving an improvement in spatialisation capability compared
to stereo. The ±30◦ stereo arrangement is retained, with an ad- The virtual environment and underlying systems for the localisa-
ditional center channel placed in-between. For this study the sub- tion task were designed and implemented using the Unity game
woofer (‘.1’ channel) was not included, as it is intended for further engine [17]. Sound spatialisation and rendering was done sepa-
defining low frequency effects which were not used. rately in Max/MSP [18]. A single sound source was used in the
As stated previously, an equidistant array of 8 loudspeakers game, the position of which changed as soon as it was success-
arranged as an octagon gives stable phantom imaging all around fully located by the player. The sound source was represented by
the listener. This is an improvement on both 5.1 and 7.1 surround- a spherical Unity game object with a radius of 0.5 metres and its
sound where the inconsistent placement of loudspeakers leads to visual renderer turned off, ensuring that the source would be invis-
instability at the sides and rear. However, unlike stereo and 7.1 sur- ible to participants. The position of the sound source was always
round, octagonal arrays are not used to render audio in consumer determined randomly within the boundaries of the game world,
gaming. Also, the loudspeakers in-front of the listener need to be represented by a 20x20 metre square room. Random positioning
spaced at a wider angle than those in stereo and 7.1 configurations was implemented so that players would not learn sound source po-
if equidistant placement is to be achieved, with two loudspeakers sitions after playing the game multiple times. The virtual room
placed at ±90◦ for reliable lateral imaging. Therefore the trade-off comprised of four grey coloured walls, a floor and a ceiling to
in ease of localisation between more consistent imaging all around serve a visual reference regarding the player’s position within the
the listener, and the potential for higher resolution frontal imaging game world. According to Zielinkski et al. [19], visuals can dis-
in 7.1, is of interest. tract significantly from an audio-based task, therefore visuals were
deliberately simplified.
3. METHODS
x
This section outlines the localisation task participants were asked
to complete and how it was implemented using a game-like virtual z
Player avatar
environment. The methods used to render game audio to the three
listening conditions - stereo, 7.1 surround-sound and an octagonal
array - are then covered. It was decided early in the design process
that a custom-made game environment would be used. Previous A
experiments by the authors have used program material taken from
commercially available video game content for current-generation
gaming consoles. However, it is not possible to access the source
y
20m
DAFX-3
DAFx-331
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
UDP
messages Max/MSP and Spat~
Stereo
processing
Player position down-mix
interface
Amplitude
Speaker
Unity
UDP
panning
Player Player rotation 7.1
interaction
Amplitude
Relative sound panning
source position octagon
Figure 4: Outline of data flow from Unity to Max/MSP to the loudspeaker interface. Coordinates from Unity were sent to the Max/MSP
patch via UDP. UDP messages were then processed to pan a sound source at different positions for the three listening conditions.
the player avatar was within the radius of the sphere represent- Max/MSP object. The ‘spat.pan∼’ object takes a sound source
ing the sound source’s current location, the sphere would move to (in this case the 440Hz repeating sine tone) as its input, and pans
a random new location at least 10 metres away from the player, it according to x, y and z coordinates around the pre-defined loud-
within the room’s boundaries. Upon triggering this event, an on- speaker layout. The x, y and z coordinates used for panning corre-
screen value depicting the player’s score increased by one. A top- sponded to the relative position of the sound source to the player,
down interpretation is illustrated in Figure 3, where position A rep- as transmitted from Unity via UDP. The number of loudspeakers
resents the current position of the sound-source and position B is and their placement around the listening area were defined for the
the new position. If the ‘x’ button was pressed and the player was panners as follows:
not within the radius of the sound source then the current position
was maintained with no increase in score. A count-down timer set 7.1 surround-sound: 0◦ , 30◦ , 90◦ , 135◦ , −135◦ ,
to 2 minutes 30 seconds was also implemented. The timer was −90◦ , −30◦
not displayed to players and once it reached 0, "Game Over" was Octagon: 0◦ , 45◦ , 90◦ , 135◦ , 180◦ , −135◦ , −90◦ ,
displayed to the player, along with their final, score to signify the −45◦
end of the game session. The game was played three times by each
participant. Angles used in the 7.1 condition were defined such that they
conformed to the ITU-R BS: 775 for surround-sound listening [8].
3.2. Game Audio Rendering The octagonal array was arranged in the same configuration used
by Martin et al. [14], inclusive of a front centre loudspeaker at 0◦
Game audio was rendered separately to the main game using the relative to a forward-facing listener. Adjacent loudspeakers were
Spatialisateur (Spat~) object library for Max/MSP provided by defined equidistantly with an angle of 45◦ between them. The
IRCAM [20]. Communication between Unity and Max/MSP was angles used for both conditions are reflected in Figure 5 by the 7.1
achieved using the User Datagram Protocol (UDP). The player surround-sound and Octagon labelled loudspeakers.
avatar’s x, y, and z coordinates in the game world were packed and
transmitted over UDP on every frame update of the game. This
ensured the Max/MSP patch would be synced to the game systems 3.2.2. Distance Attenuation
and visuals. The x, y and z coordinates of the sound source relative
Since the player was able to move around the game world freely,
to the player were sent to Max/MSP in the same way. A diagram of
it was necessary to include distance attenuation in the audio ren-
the data flow from Unity to Max/MSP is given in Figure 4. Players
dering. This made the sound appear louder as the player moved
were asked to locate a sine tone at a frequency of 440Hz repeating
towards its source and quieter as they moved away. This was
every half a second with an attack time of 5 milliseconds to give
achieved by taking the inverse square of the relative distance (in
a hard onset. A short delay was also applied to the tone, giving
metres) between the sound source and the player. This can be ex-
a sonar-like effect. An ascending sequence of tones were played
pressed in decibels (dB) using:
to the player upon every correct localisation to give some auditory
feedback as to their success, in-line with the increase in score. If
the player was incorrect, a descending sequence was played. It was 1
10 log10 (3)
decided other effects commonly found in video games, like music, d2
ambiance and footsteps, would not be included for this test, so as
to not confuse the listener. where d is the distance between the sound source and the listener.
The same distance attenuation was used across the three conditions
3.2.1. Sound Spatialisation in order to keep changes in amplitude consistent. The amplitude of
the sound remained constant as the player stayed within the radius
The amplitude panning law given in (1) can be extended for more of the sound source. This was implemented after informal testing,
than two loudspeakers using pairwise amplitude panning algorithms as it was found that otherwise, the sound would only ever reach
[7]. In this work, pairwise panning was implemented for the 7.1 maximum amplitude if the player was stood directly in the centre
and octagonal loudspeaker configurations using a ‘spat.pan∼’ of the sound source position.
DAFX-4
DAFx-332
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
St = Stereo Su = 7.0 surround-sound Oct = Octagon Table 1: Counterbalanced participant groupings for the three lis-
Su
tening conditions.
Oct
St 0° St
Su Su
Allocation Condition
Oct Oct
30°
Group 1 Stereo 7.1 Octagon
45° Group 2 Stereo Octagon 7.1
Group 3 7.1 Stereo Octagon
Group 4 7.1 Octagon Stereo
Group 5 Octagon Stereo 7.1
Su 90° Su
Oct Oct Group 6 Octagon 7.1 Stereo
135°
180°
made aware of any of the conditions prior to, or during, the test.
Repeated-measures test designs such as this are susceptible to learn-
Su Su ing effects, in that participant results may be influenced through
Oct Oct being exposed to the same program material multiple times. To
reduce this risk, the order of listening conditions was counterbal-
Oct anced as suggested in [21]. This design with 24 participants and
3 listening conditions gave 4 participants in each counterbalanced
Figure 5: Loudspeaker angles used for all three listening condi- group. Furthermore, a training session was provided, as described
tions. Angles are symmetrical to the left and right of a front-facing below.
listener. Each of the three game sessions lasted 2 minutes 30 seconds.
A ‘Game Over’ message and the player’s score (i.e how many
times the sounds source was correctly located) were displayed on-
3.2.3. Stereo Down-mix screen at the end of each session. The number of correct locali-
sations was output to a separate text file after each game session,
It is not uncommon for a stereo down-mix to be generated from a giving each subject a final score for each of the three listening
game’s 7.1 audio output. Audio content intended for surround- conditions. Once a subject had been exposed to all of the listening
sound listening can be presented over any regular two-channel conditions, they were asked to state which of the three conditions
stereo system by down-mixing non-stereo channels. Additional they preferred, and provide any comments regarding the experi-
channels are combined with the front left and right channels to en- ment.
sure sound cues mixed for the centre and surround channels are Before the formal test began, participants were asked to com-
not lost. As illustrated in Figure 4, the down-mix used for this ex- plete a training session based on a simplified version of the game,
periment was done using the same 7 audio channels (not including allowing them to become familiar with the control scheme. The
the low-frequency effects) used in the 7.1 surround-sound listening training version of the game took place in the same virtual room,
condition, according to ITU-R BS.775-3 [14]. The left/right sur- with the addition of 5 coloured rings placed in its center and each
round (Ls/Rs) channels and left/right back surround (Lbs/Rbs) corner. During the training, the sound source would only ever ap-
channels were attenuated by 3dB and sent to the respective left pear at one of these pre-defined locations. Participants were asked
and right front channels. The centre channel was also attenuated to move the in-game avatar to each of these locations and press
by 3dB and sent to the front left and right channels. This can be the gamepad’s ‘x’ button if they believed that to be the origin of
expressed as: the sound source. Once each of the pre-defined sound sources had
been found once, the coloured rings were removed, and partici-
LD = L + 0.707C + 0.707Ls + 0.707Lbs pants were asked to find the sound sources again, without a visual
(4) cue. Training was done in mono to eliminate the possible learn-
RD = R + 0.707C + 0.707Rs + 0.707Rbs
ing effect due to playing the game in an experimental condition
more than once. The distance attenuation was preserved, allowing
where LD and RD are the down-mixed left and right channels, participants to familiarise themselves with amplitude changes as
respectively. they moved closer to and further away from the sound source. The
training session was not timed and only finished once a subject had
4. EXPERIMENTAL PROCEDURE found each of the 5 sound sources twice.
DAFX-5
DAFx-333
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 2: Wilcoxon signed-rank output for the mean-adjusted player scores. T is the signed-rank and p is the significance value. The z value
is used to determine the significance value (p) and the effect size (r). A value of 1 in the h column signifies a rejection of the null hypothesis.
DAFX-6
DAFx-334
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
12
11 a vague sense of a sound source’s general direction. It may there-
10
fore be the case that once a player had positioned their in-game
9
8 avatar such that the sound was perceived to emanate at some point
7 straight ahead, triangulating its specific location was then easiest
6 using a more optimally spaced stereo pair. This observation is re-
5
4
flected in the comments given by participants, where it was stated
3 on multiple occasions that it was easiest to triangulate/focus on the
2 sound source in the 7.1 surround-sound condition.
1
It is important to note, however, that in comparison to com-
7.1 surround-sound Stereo Octagon mercial games, the game used in this experiment was a relatively
Condition simple example. Generally, modern games include more in-depth
sound design and visual effects that work together in forming the
Figure 7: Preference ratings given for each listening condition. entire game experience. It would therefore be of interest to de-
7.1 surround-sound was preferred by the majority of participants. termine whether the results obtained from this experiment could
be replicated using a more complex game task, inclusive of more
Table 3: The percentage of highest scores, alongside the percent- ‘true-to-life’ game systems. This would provide clarity as to whet-
age of preference ratings, for each condition. her the results from this study were dependent on the stimulus
used, and is the proposed next step for this work. In comparison
Condition % highest score % most preferred to previous work by authors, the use of a custom game environ-
ment was found to allow for far more control over experimental
Stereo 12.5% 4.2% variables, and is therefore recommended for such studies.
7.1 surround-sound 60.4% 70.8%
Octagon 27.1% 25.0%
6. CONCLUSION
more preferred. This may explain the minor discrepancy between This paper presented an experiment designed to determine whether
the score and preference percentages for the stereo and surround- enhanced spatial audio feedback has an influence on how well a
sound conditions. player performs at a video game. Player performance was quan-
tified by how many correct localisations of a randomly positioned
sound source were achieved within a time limit of 2.30 minutes.
5.3. Discussion This was compared between three listening conditions: stereo, 7.1
surround-sound and an octagonal array of loudspeakers. Results
In general, players had greater success in the game when listen-
suggest that by using a more spatial listening system, player per-
ing to audio over a 7.1 surround-sound loudspeaker array. The
formance was improved, in that significantly higher localisation
number of correct localisations was consistently higher for the ma-
scores were achieved when using a 7.1 system in comparison to
jority of participants than when compared to stereo and octago-
stereo. 7.1 was also consistently the most preferred of the three
nal loudspeaker systems. 7.1 was also the most preferred of the
conditions by participants. However, the same result was not ob-
three listening conditions, with participant comments suggesting
served for the octagonal array. A possible explanation is that the
this was due to it being the condition in which the highest scores
angles between the front three loudspeakers used in the octagonal
were achieved. It was expected that 7.1 would outperform stereo
array (−45◦ ,0◦ ,+45◦ ) were wider than those in the 7.1 surround-
due to the increased number of channels available in the system,
sound system (−30◦ ,0◦ ,+30◦ ). Therefore frontal sound source
and from these results it can be said that players did benefit from
imaging may not have been as well defined in comparison to the
using a more spatial listening array. However, the same could not
7.1 condition.
be said in regard to the octagonal array.
As stated in section 2, it would be expected that the localisa-
tion of sound sources, especially those positioned laterally and to 7. REFERENCES
the rear of the listener, would be easiest when listening over an oc-
tagonal array of loudspeakers. However, the more consistent and [1] Simon N Goodwin, “How players listen,” in Audio Engi-
stable phantom imaging that can be achieved using such a system neering Society Conference: 35th International Conference:
seems to have had little impact on the results obtained in this ex- Audio for Games. Audio Engineering Society, 2009.
periment. Visuals were presented to the player using a stationary
screen, therefore players were only ever required to look forwards. [2] “Star WarsTM BattlefrontTM in Dolby
For this reason it may have been that those loudspeakers located di- Atmos for PC, note=Available at
rectly infront of the listener’s forward facing position were of most https://www.dolby.com/us/en/categories/games/star-wars-
use in the localisation task. The front left and right loudspeakers of battlefront-in-dolby-atmos.html, year=accessed March 28,
the octagonal array were spaced wider than the ±30◦ used in the 2017,” .
DAFX-7
DAFx-335
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[3] “Dolby Atmos Unleashes the Power of Society Conference: 24th International Conference: Multi-
R
Blizzard’s Overwatch
,” Available at channel Audio, The New Reality. Audio Engineering Society,
https://www.dolby.com/us/en/categories/games/overwatch.html, 2003.
accessed March 28, 2017. [20] Jean-Marc Jot and Olivier Warusfel, “Spat: a spatial pro-
[4] Francis Rumsey and Tim McCormick, Sound and Record- cessor for musicians and sound engineers,” in CIARM: In-
ing: An Introduction, Focal Press, Oxford, UK, fifth edition, ternational Conference on Acoustics and Musical Research,
2006. 1995.
[5] Ville Pulkki and Matti Karjalainen, “Localization of [21] Andy Field and Graham Hole, How to Design and Report
amplitude-panned virtual sources 1: stereophonic panning,” Experiments, Sage, London, UK, 2003.
Journal of the Audio Engineering Society, vol. 49, no. 9, pp.
739–752, 2001.
[6] Benjamin B Bauer, “Phasor analysis of some stereophonic
phenomena,” The Journal of the Acoustical Society of Amer-
ica, vol. 33, no. 11, pp. 1536–1539, 1961.
[7] Francis Rumsey, Spatial Audio, Focal Press, Oxford, UK,
2001.
[8] ITU-R, “Recommendation BS. 775-3: Multichannel stereo-
phonic sound system with and without accompanying pic-
ture,” International Telecommunications Union, 2012.
[9] Mark Kerins, The Oxford Handbook of New Audiovisual Aes-
thetics, chapter Multichannel Gaming and the Aesthetics of
Interactive Surround, pp. 585–604, Oxford University Press,
Oxford, UK, 2013.
[10] Richard C Cabot, “Sound localization in 2 and 4 channel
systems: A comparison of phantom image prediction equa-
tions and experimental data,” in Audio Engineering Society
Convention 58. Audio Engineering Society, 1977.
[11] Geoff Martin, Wieslaw Woszczyk, Jason Corey, and René
Quesnel, “Sound source localization in a five-channel sur-
round sound reproduction system,” in Audio Engineering
Society Convention 107. Audio Engineering Society, 1999.
[12] Sungyoung Kim, Masahiro Ikeda, and Akio Takahashi, “An
optimized pair-wise constant power panning algorithm for
stable lateral sound imagery in the 5.1 reproduction system,”
in Audio Engineering Society Convention 125. Audio Engi-
neering Society, 2008.
[13] Gunther Theile and Georg Plenge, “Localization of lateral
phantom sources,” Journal of the Audio Engineering Society,
vol. 25, no. 4, pp. 196–200, 1977.
[14] Geoff Martin, Wieslaw Woszczyk, Jason Corey, and René
Quesnel, “Controlling phantom image focus in a multichan-
nel reproduction system,” in Audio Engineering Society Con-
vention 107. Audio Engineering Society, 1999.
[15] Tomlinson Holman, Surround Sound: up and running, CRC
Press, FL, US, 2014.
[16] “GameObjects,” Available at https://docs.unity3d.com/
Manual/GameObjects.html, accessed March 28, 2017.
[17] “Download Unity Personal,” Available at
https://store.unity.com/download?ref=personal, accessed
April 7, 2017.
[18] “Cycling 74: Tools for sound, graphics, and interactivity,”
Available at https://cycling74.com/products/max/, accessed
April 7, 2017.
[19] Slawomir K Zielinski, Francis Rumsey, Søren Bech, Bart
De Bruyn, and Rafael Kassier, “Computer games and mul-
tichannel audio quality-the effect of division of attention be-
tween auditory and visual modalities,” in Audio Engineering
DAFX-8
DAFx-336
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT
The reverberation time is one of the most prominent acoustical
qualities of a physical room. Therefore, it is crucial that artifi- x(n) z −m A(z) y(n)
cial reverberation algorithms match a specified target reverberation
time accurately. In feedback delay networks, a popular framework
for modeling room acoustics, the reverberation time is determined
by combining delay and attenuation filters such that the frequency-
dependent attenuation response is proportional to the delay length Figure 1: Feedback comb filter with delay length of m samples
and by this complying to a global attenuation-per-second. How- and attenuation filter A(z) determining the resulting reverberation
ever, only few details are available on the attenuation filter design time.
as the approximation errors of the filter design are often regarded
negligible. In this work, we demonstrate that the error of the filter
approximation propagates in a non-linear fashion to the resulting
There are commonly two paradigms for designing the atten-
reverberation time possibly causing large deviation from the speci-
uation filters: The first being motivated by physical room acous-
fied target. For the special case of a proportional graphic equalizer,
tics, where the reverberation time results from different absorptive
we propose a non-linear least squares solution and demonstrate the
boundary materials and the free path in-between [9, 10, 11]. The
improved accuracy with a Monte Carlo simulation.
resulting reverberation time can be predicted by a modified version
of Sabine’s [12] and Eyering’s [13] formulas. Whereas this attenu-
1. INTRODUCTION ation filter design is useful for recreating real room acoustics, it is
not optimal for achieving precise control of the reverberation time.
Reverberation time is one of the most prominent acoustical qual- The second paradigm is motivated from a system theoretic per-
ities of a physical room dominating the perceptual quality of a spective, where attenuation filters influence directly the magnitude
space depending on the intended purpose. The reverberation time, of the system poles of the FDN which in turn have a close rela-
denoted by T60 , is the most common decay rate measure and is tion to the resulting reverberation time. The idea is to choose the
defined as the time needed for the energy decay curve of an im- strength of the attenuation filter proportional to the corresponding
pulse response to drop by 60 dB [1]. The frequency-dependent delay lengths, i.e., the longer the delay, the stronger is the attenu-
reverberation time T60 (ω) can be similarly derived from the en- ation [6]. By this, it is possible to define a global attenuation-per-
ergy decay relief [2]. The accuracy of perceiving the reverberation second no matter how the signal travels through the delay network.
time has been studied from various application standpoints [3, 4], Various filter types have been proposed to realize the atten-
however, specific just-noticeable-differences may vary depending uation filter A(z) depending on the control flexibility, computa-
on the stimulus signal, early to late reverberation ratio and other tional complexity and required accuracy. In the early days, the
properties. For a rough orientation, JNDs of 4% have been re- most cheaply available filter was a one-pole lowpass filter [6, 14,
ported by participants listening to bands of noise [3], 5-12% for 15, 16, 17, 18]. Biquadratic filters allow control of the decay time
impulse responses and 3-9% for speech signals [5]. in three independent frequency bands, with adjustable crossover
Consequently, generative algorithms for artificial reverbera- frequencies [19, 20]. More advanced studies try to emulate the
tion strive to recreate the reverberation time of the desired virtual frequency response of reflection coefficients in octave bands by
space accurately. Among the algorithms for artificial reverbera- applying high-order filter IIR filters [9, 10, 21]. Recently, Jot pro-
tion, the feedback delay networks (FDNs) enjoy popularity for its posed a proportional graphic equalizer being a simple yet effective
versatility and its efficient implementation [6, 7, 8]. The reverber- method to control an arbitrary number of logarithmic bands [22].
ation time of FDNs is commonly determined in two steps: Firstly, Because of its beneficial design, this graphic equalizer is central to
a lossless FDN is designed, i.e., the energy entering the FDN cy- the discussion in this work.
cles unattenuated through the feedback loop such that there is no Despite of the large number of proposals on the FDN struc-
decay of energy. Secondly, the attenuation filters are introduced in ture and the attenuation filter, only few details are given on the
the feedback loop to control the decay rate of the energy and by design of the attenuation filter A(z) assuming that the approxi-
this the resulting reverberation time of the system. Fig. 1 shows mation error made is negligible. However, the example given in
a single feedback comb filter with delay line z −m and attenuation Fig. 2 demonstrates that an attenuation filter with a maximum ap-
filter A(z). proximation error of a few dB may yield a poor approximation
∗ The International Audio Laboratories Erlangen are a joint institu- of the resulting reverberation time, in fact it has an infinite T60
tion of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and at 2 kHz. The reason for this inconsistency can be found in the
Fraunhofer Institut für Integrierte Schaltungen IIS. reciprocal relation between the attenuation filter response and the
DAFX-1
DAFx-337
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Target
0 Filter Response
−20
2
−40 1
Target
Filter Response
−60 0 1
101 102 103 104 10 102 103 104
Frequency [Hz] Frequency [Hz]
(a) Target magnitude response of the attenuation filter and an ap- (b) Corresponding target T60 response and the resulting T60 of the
proximated filter. approximated filter.
Figure 2: Target reverberation time and corresponding target attenuation filter magnitude response on the dB scale for a delay m of 100 ms.
Although the approximated filter exhibits relatively small attenuation errors, it exhibits large errors for the reverberation time.
reverberation time. The following paper is dedicated to improv- get attenuation-per-sample δ(ω) on dB scale can be related to the
ing the design of the attenuation filter to achieve a more accu- frequency-dependent reverberation time T60 (ω) by
rate reverberation time. Section 2 introduces the background on 1
FDNs and the error propagation between the magnitude response δ(ω) = −60 , (2)
fs T60 (ω)
and the reverberation time. Further it introduces the proportional
graphic equalizer used for the implementation of the attenuation where fs is the sampling frequency. On a linear scale, the attenuation-
filter. Section 3 introduces different techniques to determine the per-sample is given by
parameters of the graphic equalizers and demonstrates their im-
pact on a case example. Further, a Monte Carlo simulation with D(ω) = 10δ(ω)/20 . (3)
randomized frequency-dependent target reverberation time evalu- The attenuation filter A(z) is commonly idealized in having
ates the quality of the different parameter estimation methods. zero-phase such that all system poles of (1) lie on the line specified
by |A(ω)|1/m which in turn determines the decay rate of the FDN
2. REVERBERATION TIME OF AN FDN [6]. Although this cannot be satisfied strictly, in many designs
the attenuation filter delay is small compared to the delay m and
2.1. Combination of Comb Filters can be neglected. To achieve the target reverberation time, the
attenuation filter is designed such that [6]
An FDN consists of multiple delay lines interconnected by a feed-
|A(ω)| ≈ Dm (ω)
back matrix, which is commonly chosen to be a unilossless ma-
trix1 . The constituting lossless FDN is adaptable to produce a or (4)
specified reverberation time by extending every delay line with an α(ω) ≈ mδ(ω),
attenuation filter [6]. Because of the lossless prototype, each atten-
uation filter can be considered independently and only dependent where
on the corresponding delay line. Consequently, the remaining pa- α(ω) = 20 log10 |A(ω)|. (5)
per only considers single delay line FDNs, which are commonly In the following section, we explain the error propagation of
referred to as absorptive feedback comb filters (see Fig. 1). The the filter response as seen in the example given in Fig. 2.
transfer function of the absorptive feedback comb filter in Fig. 1 is
2.2. Error Propagation of Attenuation Filter Approximation
1
H(z) = , (1) The design of the attenuation filter may be performed by approx-
1 − A(z)z −m
imating the target magnitude response either on the dB or linear
where A(z) is the transfer function of the attenuation filter and scale in (4) minimizing an appropriate error norm [24]. First,
m is the delay length in samples. Every time the signal traverses we focus on a least-squares design approach performed on the dB
the feedback loop, it is affected by the attenuation filter such that scale as the main objective such that (4) is expressed as 2
this recursive effect can be described as an attenuation-per-time
interval. Particular care should be taken for the design of the at- kα − mδk22 . (6)
tenuation filter A(z) to ensure the stability of (1). The global tar- 2 The
Euclidean norm k·k2 of a continuous frequency response F is
R 1/2
1 In
contrast to lossless feedback matrices used for example in [7], 2π 2
given by kF k2 = 0 |F (ω)| dω whereas for a n × 1 vector v
unilossless feedback matrices are lossless for all possible delays. For more P 1/2
n 2
information, the reader is referred to [23]. the Euclidean norm is kvk2 = i=1 |vi | .
DAFX-2
DAFx-338
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
with
10 √
g 1/2 Ω2 + 2Ωg 1/4 + 1 (10)
Reverberation Time [s]
p0 =
1/2 2
p1 = 2g Ω −2 (11)
√
p2 = g 1/2 Ω2 − 2Ωg 1/4 + 1 (12)
√
q0 = g 1/2 + 2Ωg 1/4 + Ω2 (13)
5
q1 = 2g 1/2 Ω2 − 2 (14)
√
q2 = g 1/2 − 2Ωg 1/4 + Ω2 , (15)
DAFX-3
DAFx-339
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
20
EQ Gain [dB]
EQ Gain [dB]
0
0.5
−20
0
102 103 104 102 103 104
Frequency [Hz] Frequency [Hz]
(a) Single band proportional gain behavior of the magnitude re- (b) Prototype magnitude responses for a 9-band graphic equalizer
sponse. with additional broadband gain as defined in [25].
ters and the remaining bands being peak filters. The vector of com- scale of the entire attenuation filter:
mand gains is then
αGE (ωP ) = Bγ. (28)
g = [g0 , g1 , . . . , gL ]>
(25)
γ = 20 log10 (g),
The magnitude least squares problem in (6) can then be restated as
where the log10 is applied element-wise, and (·) denotes the> a linear least-squares problem:
transpose operation.
In the next section, we explain the impact of the different error γMLS = arg min kBγ − τ k22
norms on the filter design method for the graphic equalizer and the γ
(29)
resulting reverberation time. = (B > B)−1 B > τ ,
3. MULTIBAND GAIN ESTIMATION where (B > B)−1 B > is often called the Moore-Penrose pseudoin-
3.1. Linear Least-Squares Solution verse of B. The condition number of B > B is between 105 -
2 · 105 , which indicates the inverse operation to be numerically
Because of the self-similarity in (23), the magnitude responses of unstable. The pseudoinverse matrix can be computed in advance
the band filters can be used as basis functions to approximate the using the QR-decomposition approach which is numerically more
magnitude response of the overall graphic equalizer. The filters stable than the direct approach, and can then be stored.
are sampled at a K × 1 vector of control frequencies ωP spaced
logarithmically over the complete frequency range, where K ≥ L.
The resulting interaction matrix is 3.2. Non-Linear Least-Squares Solution
B= 20 log10 [g 0 , H1,g0 (ωP ), . . . , HL,g0 (ωP )] The T60 least-squares problem given in (7) yields then the follow-
ing nonlinear least-squares problem:
(26)
=
2 XK 2
1 1
1 1
γTLS = arg min
Bγ −
= − ,
γ τ
2 (Bγ)k τk
k=1
PL (30)
0
of size K×(L+1), where g is a prototype gain and g = g 1K×1 .0 0 where (Bγ)k = l=0 Bkl γl . Unfortunately, there is no explicit
The white color indicates 0 dB gain and dark color indicates gains solution for this non-linear problem such that it has to be solved
up to 1 dB in Fig. 4b. The interaction matrix B represents how by a numerical optimization algorithm, e.g., gradient descent algo-
much the response of each band filter leaks to other bands. Simi- rithm [31]. For gradient based approaches it is efficient and more
larly, the target vector is stable to provide the first and second derivatives analytically. The
gradient of (30) is given by
τ = mδ(ωP ), (27)
2 K
X
where δ is applied element-wise. The command gains γ for the ∂
1 − 1
= 2 1 1 −Bki
−
individual filters yield the resulting magnitude response on the dB ∂γi
Bγ τ
2 (Bγ)k τk (Bγ)2k
k=1
DAFX-4
DAFx-340
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Target
0 MLSUnc
TLSUnc
−20 TLSCon
2
Target
MLSUnc
−40 MLSCon 1
TLSUnc
TLSCon
−60 0 1
101 102 103 104 10 102 103 104
Frequency [Hz] Frequency [Hz]
(a) Magnitude response with different filter design methods. (b) Reverberation time with different filter design methods.
Figure 5: Case study for target reverberation time given in Section 4.1. The dashed lines indicate the approximation response resulting
from (28), whereas the solid lines indicate the actual filter response which may differ because of violations of the self-similarity property
(23). Both unconstrained methods MLSUnc and TLSUnc has large deviations between the approximation and actual filter response. The
TLSCon results in the least error in the resulting reverberation time.
and the Hessian matrix is given by • MLSCon: linear constrained LS - (29) & (31)
2 • TLSUnc: nonlinear unconstrained LS - (30)
∂2
1 − 1
=
∂γi ∂γj
Bγ τ
2 • TLSCon: nonlinear constrained LS - (30) & (31)
K
X
Bki Bkj 1 1 2Bki Bkj 4.1. Case study
2 + − ,
(Bγ)4k (Bγ)k τk (Bγ)3k
k=1 To illustrate some of the shortcomings of a magnitude least squares
where 0 ≤ i, j ≤ L. method for the graphic equalizer, we choose an example target re-
verberation time and the corresponding target magnitude response
values for a delay length of 100 ms. The centre frequencies, the
3.3. Constrained LLS and NLLS reverberation time definitions and the corresponding target mag-
Additionally to the least squares problem, we introduce linear con- nitude attenuation are given in Table 1. The large fluctuation in
straints on the command gains γ to account for the deteriorating this example was chosen to demonstrate more clearly the potential
self-similarity for large gains in (23): issues and may be exaggerated in the context of many practical
use cases. Table 1 also gives the solutions to the four filter design
−10 ≤ γl ≤ 10 for 1 ≤ l ≤ L, (31) methods MLSUnc, MLSCon, TLSUnc and TLSCon for a 9-band
graphic equalizer and an additional broadband gain. The Fig. 5
where the value of 10 dB was found empirically. Please note that shows the resulting attenuation and reverberation time responses
the first gain γ0 , which is a broadband gain is not limited by a for the four methods:
constraint. • MLSUnc: The approximation response on the magnitude
In the following, we employ the gradient descent implementa- domain is as expected the best fit to the target magnitude.
tion fmincon and fminunc in MATLAB with and without gain con- The corresponding reverberation time deviates strongly, in
straints, respectively, to approximate the optimal command gains fact being infinite at 2 kHz. The actual filter response devi-
γ. For non-linear problems, it is inherent to find a local minimum ates in turn from the approximation in both domains. This
instead of the globally optimal solution and therefore the choice is because of a violation of the self-similarity property (23).
of the initial value γ init can impact the quality of the solution con-
siderably. Different initial values were tested informally, and for • MLSCon: The additional constraints on the command gains
simplicity, we settled for the broadband average gain: γ result in a close match between the approximation and
the actual filter response. However, the overall filter design
γ init = [τ , 0 . . . , 0], (32) quality is poor.
• TLSUnc: The non-linear filter design yields a optimal fit to
where τ is the arithmetic mean of τ . the target reverberation time, but with a similar deviation
from the actual filter response as observed with MLSUnc.
4. EVALUATION • TLSCon: The additional constraints on the command gains
leads to well corresponding approximation and actual filter
Four filter design methods are evaluated in this section: response and results in a good overall approximation of the
• MLSUnc: linear unconstrained LS - (29) target reverberation time response.
DAFX-5
DAFx-341
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Case study corresponding to Fig. 5. The MSE value of the best performing method is indicated in bold.
To summarize, there are two reasons for poor filter designs: Firstly, ods behave largely similar. The MLS methods have a 3-5%
the command gains are too large, causing the self-similarity prop- chance that the MSE is larger than 2 s, whereas this prob-
erty to deteriorate. The solutions listed in Table 1 also demonstrate ability is zero for the TLS methods. The probability of in-
that the command gains cannot be simply constraint after the ap- stability is 20% for the MLS methods, whereas it is 0% for
proximation. Secondly, good approximation in the attenuation do- TLS methods.
main may still result in a poor approximation of the reverberation
• m = 100 ms: Overall, the MSE is slightly larger than for
time. Both shortcomings are overcome by the TLSCon solution
m = 10 ms. However, the probability of instability for
resulting in a considerable improvement. In the next section, we
MLS decreases by up to 7%.
expand this result from this specific case study by a Monte Carlo
simulation. • m = 1000 ms: All methods but TLSCon perform consider-
ably worse with the MSE being distributed almost equally
4.2. Monte Carlo Simulations up to 2 s. The probability to have an MSE over 2 s is be-
tween 15-22% for these methods. The probability of in-
To evaluate the accuracy of the four filter design methods for a stability for the unconstraint methods is between 19-21%,
large number of target functions, we perform a Monte Carlo sim- whereas the constraint methods have a 0% chance to be-
ulation. For this, we randomize the target reverberation time T60 come unstable.
at nine octave band points with uniform distribution between 0.1
Regarding the impact of the delay length on the MSE, we can ob-
and 5 s. We consider three delay lengths of 10, 100 and 1000 ms.
serve the following two effects. Firstly, for short delays the target
For each condition, we computed 1000 randomized T60 responses.
magnitude response is relatively small, e.g., a reverberation time
The approximation error is quantified by the mean squared error
of 5 s corresponds to an attenuation of -0.12 dB per 10 ms, and
(MSE) between the target reverberation time and the approximated
0.1 s corresponds to an attenuation of -6 dB per 10 ms. Because of
reverberation time. Solutions which created an unstable feedback
the small target attenuation there is almost no effect from the ap-
loop were counted separately and are referred to as the probability
proximation constraint on the command gains. Further, for MLS
of instability.
methods the target attenuation is close to 0 dB such that the error
Fig. 6 depicts the probability distribution of the MSE for the
margin before instability is very narrow causing a high probability
different filter design methods. The quality of approximation can
of instability. Secondly, long delays deteriorate the approxima-
be observed consistently to the case study in increasing order: ML-
tion quality as the attenuation filter has to attenuate more in one
SUnc, MLSCon, TLSUnc and as the best filter design method
go: a reverberation time of 0.1 s corresponds to an attenuation of
TLSCon. The probability of instability is roughly in reversed order
-600 dB per 1000 ms, -60 dB per 100 ms and -6 dB per 10 ms. The
with some exceptions. Regarding the three different delay lengths,
large attenuation of -600 dB is naturally more difficult to achieve
we can observe:
in a controlled fashion than smaller attenuations. The unconstraint
• m = 10 ms: The MSE is relatively small for all four fil- methods adapt large command gains causing uncontrolled ripples
ter design methods. Constrained and unconstrained meth- in the frequency response and therefore unstable feedback. On
DAFX-6
DAFx-342
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.2
References
0.1 [1] Manfred R Schroeder, “New Method of Measuring Rever-
beration Time,” J. Acoust. Soc. Amer., vol. 37, no. 3, pp.
409–412, 1965.
0
0 1 2 >2 Inf [2] Jean Marc Jot, “An analysis/synthesis approach to real-
Reverberation Time MSE [s] time artificial reverberation,” in Proc. Int. Conf. Acoustics,
Speech and Signal Processing, San Francisco, California,
(b) Monte Carlo simulation with delay length of 100 ms. USA, 1992, pp. 221–224.
0.4
MLSUnc MLSCon [3] Hans-Peter Seraphim, “Untersuchungen über die
TLSUnc TLSCon Unterschiedsschwelle exponentiellen Abklingens von
0.3 Rauschbandimpulsen,” Acta Acustica united with Acustica,
vol. 8, no. 4, pp. 280–284, 1958.
Probability
DAFX-7
DAFx-343
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[9] Huseyin Hacihabiboglu, Enzo De Sena, and Zoran [21] Torben Wendt, Steven van de Par, and Stephan D Ewert, “A
Cvetkovic, “Frequency-Domain Scattering Delay Networks Computationally-Efficient and Perceptually-Plausible Algo-
for Simulating Room Acoustics in Virtual Environments,” rithm for Binaural Room Impulse Response Simulation,” J.
in Proc. Int. Conf. Signal-Image Technology Internet-Based Audio Eng. Soc., vol. 62, no. 11, pp. 748–766, 2014.
Systems, Dijon, France, 2011, pp. 180–187.
[22] Jean Marc Jot, “Proportional Parametric Equalizers - Ap-
[10] Enzo De Sena, Huseyin Hacihabiboglu, and Zoran plication to Digital Reverberation and Environmental Audio
Cvetkovic, “Scattering Delay Network: An Interactive Re- Processing,” in Proc. Audio Eng. Soc. Conv., New York, NY,
verberator for Computer Games,” in Proc. Audio Eng. Soc. USA, Oct. 2015, pp. 1–8.
Conf., London, UK, Feb. 2011, pp. 1–11.
[23] Sebastian J Schlecht and Emanuel A P Habets, “On Lossless
[11] Enzo De Sena, Huseyin Hacihabiboglu, Zoran Cvetkovic, Feedback Delay Networks,” IEEE Trans. Signal Process.,
and Julius O Smith III, “Efficient Synthesis of Room Acous- vol. 65, no. 6, pp. 1554–1564, Mar. 2017.
tics via Scattering Delay Networks,” IEEE Trans. Audio,
Speech, Lang. Process., vol. 23, no. 9, pp. 1478–1492, 2015. [24] John G Proakis and Dimitris G Manolakis, Digital Signal
Processing, Upper Saddle River, New Jersey, 4th edition,
[12] Wallace Clement Sabine, Collected papers on acoustics, Re- 2007.
verberation. Harvard University Press, Cambridge, 1922.
[25] Vesa Välimäki and Joshua Reiss, “All About Audio Equal-
[13] Carl F Eyring, “Reverberation Time in “Dead” Rooms,” J. ization: Solutions and Frontiers,” Applied Sciences, vol. 6,
Acoust. Soc. Amer., vol. 1, no. 2A, pp. 168–168, Jan. 1930. no. 5, pp. 129, May 2016.
[14] James Anderson Moorer, “About This Reverberation Busi- [26] Robert Bristow-Johnson, “The Equivalence of Various Meth-
ness,” Comput. Music J., vol. 3, no. 2, pp. 13–17, June 1979. ods of Computing Biquad Coefficients for Audio Parametric
Equalizers,” in Proc. Audio Eng. Soc. Conv., San Francisco,
[15] William G Gardner, “A real-time multichannel room simu-
CA, USA, Nov. 1994, pp. 1–10.
lator,” J. Acoust. Soc. Amer., vol. 92, no. 4, pp. 2395, 1992.
[27] Jonathan S Abel and David P Berners, “Filter Design Using
[16] Davide Rocchesso, Stefano Baldan, and Stefano Delle
Second-Order Peaking and Shelving Sections.,” in Proc. Int.
Monache, “Reverberation Still In Business: Thickening And
Comput. Music Conf., Miami, USA, 2004, pp. 1–4.
Propagating Micro-Textures In Physics-Based Sound Mod-
eling,” in Proc. Int. Conf. Digital Audio Effects, Trondheim, [28] Seyed-Ali Azizi, “A New Concept of Interference Compen-
Norway, Nov. 2015, pp. 1–7. sation for Parametric and Graphic Equalizer Banks,” in Proc.
[17] Riitta Väänänen, Vesa Välimäki, Jyri Huopaniemi, and Matti Audio Eng. Soc. Conv., Munich, Germany, Apr. 2002, pp. 1–
Karjalainen, “Efficient and Parametric Reverberator for 8.
Room Acoustics Modeling,” in Proc. Int. Comput. Music [29] Jussi Rämö, Vesa Välimäki, and Balázs Bank, “High-
Conf., Thessaloniki, Greece, 1997, pp. 200–203. Precision Parallel Graphic Equalizer,” IEEE Trans. Audio,
[18] Jon Dattorro, “Effect Design, Part 1: Reverberator and Other Speech, Lang. Process., vol. 22, no. 12, pp. 1894–1904, Dec.
Filters,” J. Audio Eng. Soc., vol. 45, no. 9, pp. 660–684, 2014.
1997. [30] Jose A Belloch and Vesa Välimäki, “Efficient target-
[19] Jean Marc Jot, “Efficient models for reverberation and dis- response interpolation for a graphic equalizer,” in Proc. Int.
tance rendering in computer music and virtual audio reality,” Conf. Acoustics, Speech and Signal Processing, Shanghai,
in Proc. Int. Comput. Music Conf., Thessaloniki, Greece, China, 2016, pp. 564–568.
1997, pp. 1–8.
[31] Jorge Nocedal and Stephen J Wright, Numerical Optimiza-
[20] Thibaut Carpentier, Markus Noisternig, and Olivier Warus- tion, Springer Series in Operations Research and Finan-
fel, “Hybrid Reverberation Processor with Perceptual Con- cial Engineering. Springer Science & Business Media, New
trol.,” in Proc. Int. Conf. Digital Audio Effects, Erlangen, York, Jan. 1999.
Germany, 2014, pp. 93–100.
DAFX-8
DAFx-344
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-345
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2. ALGORITHM DESIGN
DAFX-2
DAFx-346
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
binaural noise
diffuse omni-
directional RIR t
Figure 1: Basic principle of the listener position shifts (LPS) applying mirror images: The amplitude as well as the
diffuse BRIRSyn
auralized. For this, a simple geometric model based on mirror-
Figure 2: Synthesis of the binaural diffuse reverberation: Sections
image sound sources is used. The distance between the listener
of the diffuse omnidirectional RIR (0.67 ms; 32 taps at 48 kHz
and each of the mirror-image sound sources is determined from
sampling rate) and the binaural noise (2.67 ms; 128 taps at 48 kHz
the delay of the corresponding reflection peak to the direct sound
sampling rate) are convolved. Both sections are raised-cosine win-
peak. In a next step, a shifted position of the listener is considered
dowed. The diffuse BRIR is synthesized by summing up the re-
and amplitudes (based on the 1/r law), distances, and directions of
sults of the convolutions applying overlap-add.
incidence are recalculated for each reflection (Fig. 3). Optimizing
an earlier version of the BinRIR algorithm (e.g [16]) we modified
the low-frequency component below 200 Hz when applying LPS.
If, for example, the listener approaches the sound source, the am-
erties. However, in [20] it has been shown that small perceptual plitude of the direct sound increases and the low-frequency energy
differences still remain. Thus recent studies (e.g. [21][22]) pro- for the direct sound needs to be adapted accordingly. For this the
posed models applying an overlap between early reflections and low-frequency part of the direct sound (first 10 ms followed by
the diffuse part instead of using a fixed mixing time. By this some 10 ms raised cosine set) is adjusted according to the 1/r law.
diffuse energy is embedded in the early part of the BRIR. A simi-
lar approach is used in the BinRIR algorithm: All parts of the RIR
excluding the sections of the direct sound and the detected early 2.5. Synthesis of Circular BRIR sets
reflections are assigned to the diffuse part.
To synthesize the binaural diffuse reverberation we developed two The synthesis of the BRIRs is repeated for constant shifts in the
different methods. In [14][15] the RIR was split up into 1/6 octave azimuthal angle (e.g. 1◦ ) for the direct and the reflected sound.
bands by applying a near perfect reconstruction filter bank [23] and Thus, a circular set of BRIRs is obtained, which can be applied by
the binaural diffuse part was synthesized for each frequency band. different rendering engines for dynamic binaural synthesis. The
In [16] we proposed another method for the synthesis of the diffuse synthesized sets of BRIRs can be stored in various formats, e.g
part which is used in this publication. The diffuse reverberation the miro-Format [24], a multi-channel-wave format to be used by
is synthesized by convolving small time sections (0.67 ms) of the the SoundScape Renderer [7] and can be converted to the SOFA-
omnidirectional RIR with sections of 2.67 ms binaural noise (both format [25].
sections raised-cosine windowed). The results of the convolutions
of all time sections are summed up with overlap-and-add. Figure
3. TECHNICAL EVALUATION
2 explains the synthesis of the diffuse reverberation in greater de-
tail. By this, both the binaural features (e.g. interaural coherence) To analyze the performance of the algorithm, a series of listening
of the binaural noise and the frequency-dependent envelope of the experiments has already been conducted [14][16]. These exper-
omnidirectional RIR are maintained. The lengths of the time sec- iments mainly aimed at a quantification of the perceptual differ-
tions were determined by informal listening tests during the devel- ences between measured and synthesized BRIRs. In this paper
opment of the algorithm. This method requires less computational we focus on a technical evaluation of the algorithm and compare
power than the one proposed in [14][15]. Informal listening tests different properties of the synthesized BRIRs to the properties of
showed that both methods are perceptually comparable. measured BRIRs. Therefore we analyzed the detected reflections
2.4. Listener position shifts regarding their direction, their time of incidence, and their ampli-
tude to measured data. Furthermore, we compared the reverbera-
The algorithm includes a further enhancement: The synthesized tion tails and the reverberation times of the synthesized BRIRs to
BRIR can be adapted to listener position shifts (LPS) and thus the ones of measured BRIRs. We investigated to what extent the
freely chosen positions of the listener in the virtual room can be clarity (C50 ) of the synthesized room matches the measured room’s
DAFX-3
DAFx-347
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(a) Large Broadcast Studio (LBS) (b) Small Broadcast Studio (SBS)
2 1.2
1
1.5
0.8
RT60 in [s]
RT60 in [s]
0.6
1
0.4
0.5 0.2
Binaural reference Binaural reference
BinRIR synthesis BinRIR synthesis
Omnidirectional measurement 0 Omnidirectional measurement
Figure 4: Reverberation Time (RT60 ) of the Large Broadcast Studio (a) and the the Small Broadcast Studio (b). In each plot the RT60 for
the binaurally measured reference, for the synthesis with the BinRIR algorithm and for the omnidirectional measurement are shown. The
RT60 was calculated in 1/3 octave bands in the time domain
clarity. Finally we looked briefly in which way the use of the LPS der to classify reflections which are represented as local maxima
influences the early part of the BRIR. in the matrix [30]. The detected reflections were stored with their
attributes "time", "direction" and "level" in a reflection list and can
3.1. Measured rooms be used for further comparison with the BinRIR algorithm.
The performance of the algorithm was analyzed for two different 3.2. Direct Sound and early reflections
rooms. Both rooms are located at the WDR radiobroadcast stu-
dio in Cologne and are used for various recordings of concerts and In a first step we looked at the early part of the synthetic BRIRs
performances. The "KVB-Saal" (Large Broadcast Studio - LBS) and compared the direct sound and the early reflections determined
has a volume of 6100 m3 , a base area of 579 m2 and can seat up to by the BinRIR algorithm to the reflections which are identified us-
637 persons. We measured the impulse responses in the 6th row ing sound field analysis techniques based on the microphone array
(DistanceSrcRec = 13.0 m). The "kleiner Sendesaal" (Small Broad- measurements. To describe the directions of the reflections we
cast Studio - SBS) has a volume of 1247 m3 , a base area of 220 m2 used a spherical coordinate system. The azimuthal angle denotes
and 160 seats. The DistanceSrcRec in this room was 7.0 m. In order the orientation on the horizontal plane with ϕ= 0◦ corresponding
to evaluate the algorithm, measured impulse responses from these to the front direction. The elevation angle is δ = 0◦ for frontal ori-
rooms were used. In addition to the omnidirectional RIRs, which entation and δ = 90◦ for sound incidence from above.
are required to feed the BinRIR algorithm, we measured circular Table 1 and 2 show the reflections detected by BinRIR as well as
BRIR datasets at the same position as a reference. This dataset the 14 strongest reflections which were identified based on the ar-
was measured in steps of 1◦ on the horizontal plane with a Neu- ray data. For each reflection, the time of arrival relative to the
mann KU100 artificial head. Finally we used data from spherical direct sound, the incidence direction, and the level are shown. The
microphone array measurements, conducted with the VariSphear temporal structure and the energy of many of the reflections show
measurement system [26]. For this, we applied a rigid sphere ar- similarities. For example in the LBS for reflection # 7, # 10 and
ray configuration with 1202 sample points on a Lebedev grid at a # 13 and in the SBS for reflection # 13 the level and the time of ar-
diameter of 17.5 cm. The omnidirectional RIRs and the array data rival match quite well. Furthermore, some of the reflections were
were measured with an Earthworks M30 microphone. As sound detected in the array measurements as impinging from different
source, a PA stack involving an AD Systems Stium Mid/High unit directions nearly simultaneously. These reflections are typically
combined with 3 AD Systems Flex 15 subwoofers was used. The merged to one reflection with an increased level by the BinRIR al-
complete series of measurements is described in detail in [27]. gorithm (e.g. LBS: # 4 and # 5; SBS: # 6 and # 7; # 9 - # 11).
Based on the microphone array measurements, we identified re- As the incidence directions of the synthetic reflections are chosen
flections in the room using sound field analysis techniques [28]. by BinRIR based on a lookup-table, it is not surprising that there
For this the array impulse responses were temporally segmented are significant differences compared to the directions of the mea-
(time resolution = 0.5 ms) and transformed into the frequency do- sured reflections. However, if the proportions as well as source and
main. Applying a spatial Fourier transformation, the impulse re- listener position of the synthesized room to some extent match the
sponses were transformed into the spherical wave spectrum do- modelled shoebox room some of the incidence directions are ap-
main [29]. Then the sound field was decomposed into multiple propriate. Thus, at least for the first reflection and as well for some
plane waves using the respective spatial Fourier coefficients. Data other reflections detected by BinRIR, an acceptable congruence ex-
was extracted for a decomposition order of N = 5, a spherical com- ists both for the LBS and the SBS. The azimuthal deviation of the
posite grid with 3074 Lebedev points at a frequency f = 4500 Hz. first reflection is less than 40◦ and in elevation no relevant devia-
For this frequency quite robust results for the detection of the re- tion exists.
flections were found. By this, a spatio-temporal intensity matrix of As already explained in 2.2 the BinRIR algorithm starts detecting
the sound field at the listener position was calculated. Each time reflections 10 ms after the direct sound. Thus no reflections reach-
slice of the matrix was analyzed with a specific algorithm in or- ing the listener within the first 10 ms can be found. The time frame
DAFX-4
DAFx-348
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Properties of the direct sound and the early reflections for the Large Broadcast Studio (LBS). Left side: Reflections determined
from the analysis of the array data. Right side: Reflections detected by the BinRIR algorithm
Table 2: Properties of the direct sound and the early reflections for the Small Broadcast Studio (SBS). Left side: Reflections determined
from the analysis of the array data. Right side: Reflections detected by the BinRIR algorithm
determining the geometric reflections ends after 150 ms and no re- signed deviation between the synthesis and the reference is 0.10 s
flections are detected by BinRIR after this period. for the LBS and 0.09 s for the SBS, the maxima are 0.26 s (LBS)
Furthermore, several reflections which can be extracted from the and 0.23 s (SBS).
array data measurements are not detected by BinRIR (e.g. # 3 and
# 9 in the LBS and # 5 and # 12 in the SBS). However, in total more
than 2/3 of the reflections in the section from 10 ms to 150 ms de- 3.4. Interaural Coherence
termined from the array data correspond to a reflection determined Next, we compared the interaural coherence (IC) of the synthe-
by BinRIR. sized BRIRs and of the reference BRIRs (Figure 5). We calculated
the IC according to [31] applying hamming-windowed blocks with
3.3. Energy Decay and Reverberation Time a length of 256 taps (5.33 ms) and an overlap of 128 taps (2.67
ms). In each plot the IC calculated with three different starting
In a next step we compared the energy decay curves of the synthe- points is shown. For reference and synthesis in both rooms the IC
sis and of the binaurally measured reference BRIRs. As already is significantly different when direct sound is included in the cal-
explained, the BinRIR algorithm synthesizes diffuse reverberation culation (t > 0 ms). For the medium condition (LBS: t > 150 ms;
by applying a frame-based convolution of the omnidirectional RIR SBS: t >50 ms) significant differences between synthesis and ref-
with binaural noise. Thus in addition to the synthesized and the erence can be observed. This is not surprising as no shaping of the
measured BRIR (reference) we analyzed the energy decay of the IC is performed in the BinRIR algorithm. However, this difference
measured omnidirectional RIR as well. The following analysis is is smaller for the SBS, because the impulse response is probably
based on the impulse responses for the frontal viewing direction nearly diffuse at 50 ms. For the condition with the maximal start-
(ϕ = 0◦ , δ = 0◦ ). Analyzing the reverberation time RT60 (Figure ing point (LBS: t > 300 ms; SBS: t > 150 ms) which mainly com-
4) we observed that the general structure of the curves is simi- prises the diffuse reverberation the IC of the synthesized BRIR
lar, but variations between the three curves exist. The average un- matches the reference quite well.
DAFX-5
DAFx-349
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.8 0.8
0.6 0.6
IC
IC
0.4 0.4
0.2 0.2
0.8 0.8
0.6 0.6
IC
IC
0.4 0.4
0.2 0.2
Figure 5: Interaural coherence (IC) of the Large Broadcast Studio (LBS) and the the Small Broadcast Studio (SBS). In plot (a) and (b)
the data for the binaural reference is shown, in (c) and (d) the data for the BinRIR synthesis. For the LBS the IC is plotted for t > 0 ms,
t > 150 ms and t >300 ms, for the SBS for t > 0 ms, t > 50 ms and t >150 ms.
(a) Large Broadcast Studio (LBS) (b) Small Broadcast Studio (SBS)
25 25
Binaural reference Binaural reference
20 BinRIR synthesis 20 BinRIR synthesis
Omnidirectional measurement Omnidirectional measurement
15 15
C50 in [dB]
C50 in [dB]
10 10
5 5
0 0
-5 -5
-10 -10
-15 -15
100 1000 10000 100 1000 10000
Frequency in [Hz] Frequency in [Hz]
Figure 6: Clarity (C50 ) of the Large Broadcast Studio (LBS) and the the Small Broadcast Studio (SBS). In each plot the C50 for the
binaurally measured reference, for the synthesis with the BinRIR algorithm and for the omnidirectional measurement are shown.
DAFX-6
DAFx-350
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0
DF 0,125 pects, a perfect reconstruction of the sound field is generally not
-10
DF 0,25
DF 0,5 possible. An analysis of the early reflections showed that neither
DF 1 all reflections are detected by the BinRIR algorithm nor their di-
Magnitude in [dB]
DF 2
-20
rections match to the physical ones of the room. However, the
-30 reflections which were identified by the BinRIR algorithm corre-
late with the times of incidence and partly with the direction of
-40 incidence of the physical reflections in the room quite well. For
-50
the diffuse part, small differences in the reverberation time and
the interaural coherence were observed. However, in general, the
-60
0 10 20 30 40 50
synthesis can be regarded as appropriate. An evaluation of the re-
Time in [ms] verberation time RT60 and of the clarity C50 only showed minor
differences between reference and synthesis. Analyzing the per-
Figure 7: Influence of the Listener Position Shift (LPS) on the ceptual influences of the determined differences is not covered in
early part of the time response for the Small Broadcast Studio the study presented here. Please refer to [14][16] for an analysis
(SBS). The time responses for Distance Factors (DFs) from 0.125 of these topics.
- 2 are shown in different colors The approach presented in this paper can be combined with other
modifications of measured RIRs. In [32][33], we discussed a pre-
dictive auralization of room modifications by an appropriate adap-
tation of the BRIRs. Thus the measurement of one single RIR
3.5. Clarity is sufficient to obtain a plausible representation of the modified
room. Furthermore, the opportunity to shift the listener position
Next, we examined the clarity C50 over frequency for each of the freely in the room can be employed when perceptual aspects of
conditions (Figure 6). The differences between the omnidirec- listener movements shall be investigated as e.g. proposed in [34].
tional RIR, the synthesis and the reference are minor. The aver-
age unsigned deviation between the synthesis and the reference is
1.4 dB for the LBS and 1.1 dB for the SBS. The maxima are 5.2 dB 5. ACKNOWLEDGEMENTS
(LBS) and 3.2 dB (SBS). Thus the ratio of the energy of the early
The research presented here has been carried out in the Research
part of the BRIR and the late diffuse part of the BRIR can be re-
Project MoNRa which is funded by the Federal Ministry of Ed-
garded as appropriate.
ucation and Research in Germany. Support Code: 03FH005I3-
MoNRa. We appreciate the support.
3.6. Listener position shifts The code of the Matlab-based implementation and a GUI-based
version of the BinRIR algorithm which is available under the GNU
Finally we analyzed for the SBS in which way the early part of the GPL License 4.1 can be accessed via the following webpage:
synthesized BRIR is changed when listener position shifts (LPS) http://www.audiogroup.web.th-koeln.de/DAFX2017.html
are performed. The results are shown in Figure 7 for synthesized
DistanceSrcRec between 0.875 m and 14 m (distance factor 1/8 - 2 of
the original distance). It can be observed that the level and the time 6. REFERENCES
of arrival of the direct sound are changed significantly (according
to the 1/r distance law) when performing LPS. The influence of the [1] Mel Slater, “Place illusion and plausibility can lead to re-
LPS on reflections is hard to observe from the plot, but changes alistic behaviour in immersive virtual environments,” Philo-
in amplitude and time of arrival according to the geometric room sophical Transactions of the Royal Society B: Biological Sci-
model can be found here as well. The diffuse part and thus the ences, vol. 364, no. 1535, pp. 3549–3557, 2009.
complete late reverberation remain unchanged when performing [2] Jens Blauert, Spatial Hearing - Revised Edition: The Psy-
LPS (not shown in Figure 7). choacoustics of Human Sound Source Localisation, MIT
Press, Cambridge, MA, 1997.
4. CONCLUSION [3] Etienne Hendrickx, Peter Stitt, Jean-Christophe Messon-
nier, Jean-Marc Lyzwa, Brian FG Katz, and Catherine
In this paper, the BinRIR algorithm was presented, which aims de Boishéraud, “Influence of head tracking on the exter-
for a plausible dynamic binaural synthesis based on one measured nalization of speech stimuli for non-individualized binaural
omnidirectional RIR. In two different rooms, RIRs were measured synthesis,” The Journal of the Acoustical Society of America,
and binauralized applying the presented BinRIR algorithm, so that vol. 141, no. 3, pp. 2011–2023, 2017.
synthetic BRIR datasets were generated. The presented method
[4] W. Owen Brimijoin, Alan W. Boyd, and Michael A.
separately treats direct sound, reflections and diffuse reverbera-
Akeroyd, “The contribution of head movement to the ex-
tion. The early parts of the impulse responses are convolved with
ternalization and internalization of sounds,” PLoS ONE, vol.
HRIRs of arbitrary chosen directions while the reverberation tail
8, no. 12, pp. 1–12, 2013.
is rebuilt from an appropriately shaped binaural noise sequence.
In an extension, the algorithm allows to modify the sound source [5] Ulrich Horbach, Attila Karamustafaoglu, Renato S. Pelle-
distance of a measured RIR by changing several parameters of an grini, and Philip Mackensen, “Design and Applications of
underlying simple room acoustic model. a Data-based Auralization System for Surround Sound,” in
The synthetic BRIRs were compared to reference BRIRs measured Proceedings of 106th AES Convention, Convention Paper
with an artificial head. Due to missing information on spatial as- 4976, 1999.
DAFX-7
DAFx-351
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[6] Jens Blauert, Hilmar Lehnert, Jörg Sahrhage, and Holger [21] Philipp Stade and Johannes M. Arend, “Perceptual Evalu-
Strauss, “An Interactive Virtual-Environment Generator for ation of Synthetic Late Binaural Reverberation Based on a
Psychoacoustic Research I: Architecture and Implementa- Parametric Model,” in AES Conference on Headphone Tech-
tion,” Acta Acustica united with Acustica, pp. 94–102, 2000. nology, 2016.
[7] Matthias Geier, Jens Ahrens, and Sascha Spors, “The sound- [22] Philip Coleman, Andreas Franck, Philip J.B. Jackson, Luca
scape renderer: A unified spatial audio reproduction frame- Remaggi, and Frank Melchior, “Object-Based Reverberation
work for arbitrary rendering methods,” in Proceedings of for Spatial Audio,” Journal of the Audio Engineering Soci-
124th Audio Engineering Society Convention 2008, 2008, pp. ety, vol. 65, no. 1/2, pp. 66–76, 2017.
179–184. [23] Wessel Lubberhuizen, “Near perfect recon-
[8] Dirk Schröder and Michael Vorländer, “RAVEN: A Real- struction polyphase filterbank, Matlab Central,
Time Framework for the Auralization of Interactive Virtual www.mathworks.com/matlabcentral/fileexchange/15813
Environments,” Forum Acusticum, pp. 1541–1546, 2011. assessed 04/04/2017,” 2007.
[9] Juha Merimaa and Ville Pulkki, “Spatial impulse response [24] Benjamin Bernschütz, “MIRO - measured impulse response
rendering I: Analysis and synthesis,” Journal of the Audio object: data type description,” 2013.
Engineering Society, vol. 53, no. 12, pp. 1115–1127, 2005. [25] Piotr Majdak, Yukio Iwaya, Thibaut Carpentier, Rozenn
[10] Ville Pulkki and Juha Merimaa, “Spatial impulse response Nicol, Matthieu Parmentier, Agnieszka Roginska, Yôiti
rendering II: Reproduction of diffuse sound and listening Suzuki, Kanji Watanabe, Hagen Wierstorf, Harald Ziegel-
tests,” Journal of the Audio Engineering Society, vol. 54, wanger, and Markus Noisternig, “Spatially oriented format
no. 1-2, pp. 3–20, 2006. for acoustics: A data exchange format representing head-
related transfer functions,” in Proceedings of the 134th Audio
[11] Ville Pulkki, Mikko-Ville Laitinen, Juha Vilkamo, Jukka Engineering Society Convention 2013, 2013, number May,
Ahonen, Tapio Lokki, and Tapani Pihlajamäki, “Direc- pp. 262–272.
tional audio coding-perception-based reproduction of spatial
[26] Benjamin Bernschütz, Christoph Pörschmann, Sascha Spors,
sound,” International Workshop On The Principles And Ap-
and Stefan Weinzierl, “SOFiA - Sound Field Analysis Tool-
plications of Spatial Hearing (IWPASH 2009), 2009.
box,” in Proceedings of the International Conference on Spa-
[12] Fritz Menzer, Binaural Audio Signal Processing Using In- tial Audio - ICSA, 2011.
teraural Coherence Matching, Dissertation, École polytech-
[27] Philipp Stade, Benjamin Bernschütz, and Maximilian Rühl,
nique fédérale de Lausanne, 2010.
“A Spatial Audio Impulse Response Compilation Captured
[13] Fritz Menzer, Christof Faller, and Hervé Lissek, “Obtain- at the WDR Broadcast Studios,” in 27th Tonmeistertagung -
ing binaural room impulse responses from b-format impulse VDT International Convention, 2012, pp. 551–567.
responses using frequency-dependent coherence matching,” [28] Benjamin Bernschütz, Philipp Stade, and Maximilian Rühl,
IEEE Transactions on Audio, Speech and Language Process- “Sound Field Analysis in Room Acoustics,” 27thth Ton-
ing, vol. 19, no. 2, pp. 396–405, 2011. meistertagung - VDT International Convention, pp. 568–
[14] Christoph Pörschmann and Stephan Wiefling, “Perceptual 589, 2012.
Aspects of Dynamic Binaural Synthesis based on Measured [29] Earl G. Williams, Fourier Acoustics - Sound Radiation and
Omnidirectional Room Impulse Responses,” in International Nearfield Acoustical Holography, Academic Press, London,
Conference on Spatial Audio, 2015. UK, 1999.
[15] Christoph Pörschmann and Stephan Wiefling, “Dynamische [30] Philipp Stade, Johannes M Arend, and Christoph
Binauralsynthese auf Basis gemessener einkanaliger Rau- Pörschmann, “Perceptual Evaluation of Synthetic Early
mimpulsantworten,” in Proceedings of the DAGA 2015, Binaural Room Impulse Responses Based on a Paramet-
2015, pp. 1595–1598. ric Model,” in Proceedings of 142nd AES Convention,
[16] Christoph Pörschmann and Philipp Stade, “Auralizing Convention Paper 9688, Berlin, Germany, 2017, pp. 1–10.
Listener Position Shifts of Measured Room Impulse Re- [31] Fritz Menzer and Christof Faller, “Stereo-to-Binaural Con-
sponses,” Proceedings of the DAGA 2016, pp. 1308–1311, version Using Interaural Coherence Matching,” in Proceed-
2016. ings of the 128th AES Convention, London UK, 2010.
[17] Benjamin Bernschütz, “A Spherical Far Field HRIR / HRTF [32] Christoph Pörschmann, Sebastian Schmitter, and Aline
Compilation of the Neumann KU 100,” Proceedings of the Jaritz, “Predictive Auralization of Room Modifications,”
DAGA 2013, pp. 592–595, 2013. Proceedings of the DAGA 2013, pp. 1653–1656, 2013.
[18] Benjamin Bernschütz, Microphone Arrays and Sound Field [33] Christoph Pörschmann, Philipp Stade, and Johannes M.
Decomposition for Dynamic Binaural Recording, Disserta- Arend, “Binaural auralization of proposed room modifica-
tion, TU Berlin, 2016. tions based on measured omnidirectional room impulse re-
sponses,” in Proceedings of the 173rd Meeting of the Acous-
[19] Alexander Lindau, Linda Kosanke, and Stefan Weinzierl,
tical Society of America, 2017.
“Perceptual evaluation of physical predictors of the mixing
time in binaural room impulse responses,” Journal of the Au- [34] Annika Neidhardt and Niklas Knoop, “Binaural walk-
dio Engineering Society, vol. 60, no. 11, pp. 887–898, 2012. through scenarios with actual self-walking using an HTC
Vive Real-time rendering of binaural audio Interactive bin-
[20] Philipp Stade, “Perzeptive Untersuchung zur Mixing Time
aural audio scenes Creating scenes allowing self-translation,”
und deren Einfluss auf die Auralisation,” Proceedings of the
in Proceedings of the DAGA 2017, 2017, pp. 283–286.
DAGA 2015, pp. 1103–1106, 2015.
DAFX-8
DAFx-352
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT the CIPIC database and learn the sparse representation of subjects
of a training database. Then a l1 -minimisation problem is solved
Head-Related Transfer Functions (HRTFs) are one of the main as-
for finding the best sparse representation of a new subject. This
pects of binaural rendering. By definition, these functions express
representation is then used for the synthesis of his HRTF set.
the deep linkage that exists between hearing and morphology -
However, in these studies, all the parameters are not neces-
especially of the torso, head and ears. Although the perceptive
sarily independent nor even decisive. Hence, several researchers
effects of HRTFs is undeniable, the exact influence of the human
proposed new means for refining their selection.
morphology is still unclear. Its reduction into few anthropometric
measurements have led to numerous studies aiming at establishing Among them, Hu et al. introduced a correlation analysis in
a ranking of these parameters. However, no consensus has yet been two steps [4, 5] prior to the personalisation process. They used the
set. In this paper, we study the influence of the anthropometric CIPIC database and highlighted significant correlations leading to
measurements of the ear, as defined by the CIPIC database, on the the selection of only 8 parameters out of the available 27. 5 of
HRTFs. This is done through the computation of HRTFs by Fast them were related to the ear.
Multipole Boundary Element Method (FM-BEM) from a paramet- Xu et al. [6, 7] retained ten measurements after performing a
ric model of torso, head and ears. Their variations are measured correlation analysis between the CIPIC HRTFs and anthropomet-
with 4 different spectral metrics over 4 frequency bands spanning ric parameters. It is worth noting that the analysis was restrained
from 0 to 16kHz. Our contribution is the establishment of a rank- to 7 directions and 4 frequencies and that only two of the retained
ing of the selected parameters and a comparison to what has al- measurements were related to the ear.
ready been obtained by the community. Additionally, a discussion Hugeng et al. [8] also realised a correlation analysis over the
over the relevance of each approach is conducted, especially when CIPIC data but ended up in retaining 8 measurements. 4 of them
it relies on the CIPIC data, as well as a discussion over the CIPIC were ear measurements.
database limitations. Grijalva et al. [9] applied a customisation process of HRTFs
using Isomap and Artificial Neural Networks on the CIPIC database.
1. INTRODUCTION Prior to this, they used the results of [8] as the appropriate morpho-
logical parameters to focus on.
The HRTFs of a listener are intimately related to his morphology. While the previous studies added the measurements selection
Thus, a good knowledge of his shape should be a sufficient con- into a wider personalisation process, others exclusively focused on
dition for inferring his HRTFs. Following that idea, many efforts the relative influence of each parameter. This is what did Zhang
have been done to personalise HRTF sets using anthropometric et al. [10], who concluded after a correlation analysis that 7 ear
data. Therefore, the literature is rich of articles exploring this path. measurements were among the 8 most significant ones, and Fels
Inoue at al. [1] measured the HRTFs and nine physical features et al. [11] who use a parametric model of head, torso and ear for
of the head and ears of 86 Japanese subjects. Then, they studied generating new HRTF sets by Boundary Element Method (BEM).
their relationship through multiple regression analysis and used it The latter study compared separately the influence of 6 param-
as an estimation method. eters describing the head and 6 others describing the pinna based
For their part, Zotkin et al. [2] proposed an HRTF personalisa- on the modifications introduced in the HRTFs. The evaluation took
tion algorithm based on digital images of the ear taken by a video into account the spectral distance, the interaural time difference
camera. They perform 7 measurements on it and compute out of (ITD) and the interaural level difference (ILD) variations. One
them a distance between subjects of a given database. The closest parameter at a time was modified, ranged in limits derived from
match is selected and his HRTFs are used as raw material for the anthropometric statistics of adults and children.
individualisation experiment. Although it provided insights on the relative weights of the pa-
Bilinski et al. [3] propose a method for the synthesis of the rameters in each group, a clear ranking was not established. More-
magnitude of HRTFs using a sparse representation of anthropo- over, the simulations were limited to frequencies below 8 kHz.
metric features. They use a super-set of the features defined by This is a major limitation as the human hearing ranges up to 20
DAFX-1
DAFx-353
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
kHz and that localisation information due to the pinna is classi- Table 1: Anthropometric statistics for parameters in P. Distances
cally comprised between 3-4 kHz and 14-15 kHz or more [12, 13]. are in cm and angles in degrees.
This multiplicity of works and conclusions reveals an absence
of clear consensus about the way each anthropometric parameter Var Measurement µ σ
modifies - or not - the HRTFs. In the present paper, we take an d1 cavum concha height 1.91 0.18
approach similar as Fels et al. but extended to frequencies up to 16 d2 cymba concha height 0.68 0.12
kHz and establish a categorisation of the pinna parameters. d3 cavum concha width 1.58 0.28
More in detail, section 2 goes through the process of selec- d4 fossa height 1.51 0.33
tion of parameters, the generation of meshes, the computation of d5 pinna height 6.41 0.51
HRTFs and the choice and definition of the metrics. Section 3 d6 pinna width 2.92 0.27
presents the results themselves, discusses the impact of the chosen θ1 pinna rotation angle 24.01 6.59
metric and establishes a ranking between the retained pinna param- θ2 pinna flare angle 28.53 6.70
eters based on their relative influence over the DTFs. In section 4,
we effectively compare our results to the conclusions proposed up
to now by the community and lead a discussion over the conver-
2.2. Morphological model
gences and points of disagreement. Finally, section 5 sums up the
results and conclusions of the present paper and gather the opened The model used in this paper is a parametric one, result of a merge
questions and perspective of future works. between a schematic representation of head and torso - to which
we will refer as snowman, although it does not strictly match the
one introduced by Algazi & Duda [15] - and an ear model realised
2. PROTOCOL thanks to Blender and deformable through multiple blend shapes.
The snowman is used in order to add more realism to the generated
2.1. Parameters HRTFs and get closer to what could actually be measured on a
real person. The ear model is the true source of interest. It is
2.1.1. CIPIC database
designed to be as close as possible to real ears, there again for
As it is widely used in the community, we have chosen to con- realism. Figure 1 represents the mean ear before and after merge
sider the morphological parameters defined by CIPIC [14]. This on the snowman - referred as the mean shape.
database consists in sets of HRTF and morphological parameters
measured on 45 subjects. These parameters - 27 in total - are in-
tended to describe the torso, head and ears of the human body with
a focus on what could likely impact the HRTF. In particular, 12 pa-
rameters describe the position and shape of the ear, the remaining
ones describe the head and the body.
It is worth noting that only 35 subjects have been fully mea-
sured, meaning that each parameter comes with a set of measures
comprised between 35 and 45 values. The database also comes
with their mean values µ and standard deviations σ.
DAFX-2
DAFx-354
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.4. Metrics
In order to compare the variations introduced by the ear deforma-
tions in the DTF sets, the following metrics are used:
• The widely-used [1, 8, 9] Spectral Distortion (SD) - also
sometimes referred as Log Spectral Distortion -, defined as:
v
u N −1 2
u1 X H1 (fk )
SD = t 20 ∗ log10 [dB] (2)
N H2 (fk )
k=0
DAFX-3
DAFx-355
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-356
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 6: Parameters influence for deviations −2σ (top left), −σ (top right), +σ (bottom left) and +2σ (bottom right).
DAFX-5
DAFx-357
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
the one presented by Xu et al., they are also subject to the same 5. CONCLUSIONS
remarks.
Zhang et al. [10] have exhibited 8 parameters, among which 6
describe the pinna shape. Namely, they are d3 , d4 , d5 , d6 , θ1 and θ2 . In the present work, we have built a database of meshes and com-
The first 2 are the ones we have seen as the most important here puted HRTFs fully dedicated to the study of pinna influence. The
and d1 and d2 are not in their selected set. Nevertheless, d6 is here anthropometric data were carefully selected to be as relevant as
again presented as a prominent parameter while it completely fails possible. This starts with the use of the CIPIC parameters defini-
to present this characteristic in our simulations. tion and statistics, known as a reference in the community. More-
over, the choice of the parameters themselves emerges from pre-
Last, Fels et al. [11] retained d3 as the most important factor
vious works published in the literature. Furthermore, the anthro-
as well as d8 (out of the scope of this study) and rejected d5 , while
pometric values as well as the HRTFs generation parameters were
it was retained by the 2 previous studies. Here, d5 appears to be a
set so as to correspond as much as possible to real life problemat-
good example of non-linearities, as d+2σ proves to have a strong
5
ics and data. In particular, we have covered the whole bandwidth
effect whereas d+σ does not.
5
usually said to contain spectral cues. Finally, 4 different metrics
As it can be observed, the only clear consensus that can be
have been used to perform the study and to compare the results to
reached is for d3 . As it represents the cavum concha width, this
previous studies.
conclusion is also coherent with the prior intuition one could have
about it. Regarding the metrics, it has been shown that they all led to
That being said, another point worthy of interest is the case the same conclusions. Thus, one can simply pick and choose the
of parameter d6 , twice retained as an important parameter, in total metric that best fits its use case. In our case, retaining the ISSD,
disagreement with our observations. In order to investigate it, the parameters d3 and d4 showed a stronger effect on the HRTFs than
original ear meshes are presented in figure 7 hereafter. the other ones while d1 , d2 and d6 had, comparatively, much less
importance. These conclusions have been confronted to results
issued from the community, unveiling a consensus about d3 .
In addition, it has been observed that non-linearities exist be-
tween the CIPIC parameters and the HRTFs. The specific study of
d6 has underscored the need for numerous different ear shapes, i.e.
for a bigger database, especially when performing statistical anal-
yses. It also raises the question of the relevance of the parameters
choices as introduced by CIPIC, perhaps not perfectly suited for
HRTFs analyses, and their definitions, not easily adaptable to 3D
Figure 7: From left to right, d−2σ , the mean ear and d+2σ .
6 6
data.
As we can see, the introduced distortions are not visible from Finally, the lack of reachable consensus between the studies
the ear pit, where lies the virtual sound source. Hence, the concha aiming at defining a clear set of major parameters also question in
operates as a mask, considerably reducing the potential effect of general the validity of the studies that present a selection step prior
d6 . An immediate consequence is that another value of θ2 could to other treatments (as HRTF individualisation).
have yielded a totally different outcome. This fact reveals in par- However, the current set of data has not delivered all of its
ticular that not only the parameters’ values are important but so are information yet. More specifically, future works will investigate
their combinations, making any statistical analysis more challeng- the directionality of the impact of each pinna parameter over the
ing. HRTFs.
DAFX-6
DAFx-358
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6. REFERENCES [13] Robert B King and Simon R Oldfield, “The impact of signal
bandwidth on auditory localization: Implications for the de-
[1] Naoya Inoue, Toshiyuki Kimura, Takanori Nishino, Kat- sign of three-dimensional audio displays,” Human Factors:
sunobu Itou, and Kazuya Takeda, “Evaluation of hrtfs esti- The Journal of the Human Factors and Ergonomics Society,
mated using physical features,” Acoustical science and tech- vol. 39, no. 2, pp. 287–295, 1997.
nology, vol. 26, no. 5, pp. 453–455, 2005.
[14] V Ralph Algazi, Richard O Duda, Dennis M Thompson, and
[2] DYN Zotkin, Jane Hwang, R Duraiswaini, and Larry S Carlos Avendano, “The cipic hrtf database,” in Applications
Davis, “Hrtf personalization using anthropometric measure- of Signal Processing to Audio and Acoustics, 2001 IEEE
ments,” in Applications of Signal Processing to Audio and Workshop on the. IEEE, 2001, pp. 99–102.
Acoustics, 2003 IEEE Workshop on. Ieee, 2003, pp. 157– [15] V Ralph Algazi, Richard O Duda, Ramani Duraiswami,
160. Nail A Gumerov, and Zhihui Tang, “Approximating the
[3] Piotr Bilinski, Jens Ahrens, Mark RP Thomas, Ivan J Ta- head-related transfer function using simple geometric mod-
shev, and John C Platt, “Hrtf magnitude synthesis via sparse els of the head and torso,” The Journal of the Acoustical
representation of anthropometric features,” in 2014 IEEE Society of America, vol. 112, no. 5, pp. 2053–2064, 2002.
International Conference on Acoustics, Speech and Signal [16] Harald Ziegelwanger, Wolfgang Kreuzer, and Piotr Majdak,
Processing (ICASSP). IEEE, 2014, pp. 4468–4472. “Mesh2hrtf: Open-source software package for the numer-
[4] Hongmei Hu, Lin Zhou, Jie Zhang, Hao Ma, and Zhenyang ical calculation of head-related transfer functions,” in 22st
Wu, “Head related transfer function personalization based International Congress on Sound and Vibration, 2015.
on multiple regression analysis,” in 2006 International Con- [17] Harald Ziegelwanger, Piotr Majdak, and Wolfgang Kreuzer,
ference on Computational Intelligence and Security. IEEE, “Numerical calculation of listener-specific head-related
2006, vol. 2, pp. 1829–1832. transfer functions and sound localization: Microphone model
[5] Hongmei Hu, Lin Zhou, Hao Ma, and Zhenyang Wu, “Hrtf and mesh discretization,” The Journal of the Acoustical So-
personalization based on artificial neural network in individ- ciety of America, vol. 138, no. 1, pp. 208–222, 2015.
ual virtual auditory space,” Applied Acoustics, vol. 69, no. 2, [18] Jyri Huopaniemi and Matti Karjalainen, “Review of digital
pp. 163–172, 2008. filter design and implementation methods for 3-d sound,” in
[6] S Xu, ZZ Li, L Zeng, and G Salvendy, “A study of mor- Audio Engineering Society Convention 102. Audio Engineer-
phological influence on head-related transfer functions,” in ing Society, 1997.
2007 IEEE International Conference on Industrial Engineer- [19] John C Middlebrooks, “Individual differences in external-
ing and Engineering Management. IEEE, 2007, pp. 472– ear transfer functions reduced by scaling in frequency,” The
476. Journal of the Acoustical Society of America, vol. 106, no. 3,
[7] Song Xu, Zhizhong Li, and Gavriel Salvendy, “Improved pp. 1480–1492, 1999.
method to individualize head-related transfer function using [20] F Rugeles, M Emerit, and B Katz, “Évaluation objective et
anthropometric measurements,” Acoustical science and tech- subjective de différentes méthodes de lissage des hrtf,” in
nology, vol. 29, no. 6, pp. 388–390, 2008. Cong Français d’Acoustique (CFA). CFA, 2014, pp. 2213–
[8] W Wahab Hugeng and Dadang Gunawan, “Improved method 2219.
for individualization of head-related transfer functions on
horizontal plane using reduced number of anthropometric
measurements,” arXiv preprint arXiv:1005.5137, 2010.
[9] Felipe Grijalva, Luiz Martini, Siome Goldenstein, and Dinei
Florencio, “Anthropometric-based customization of head-
related transfer functions using isomap in the horizon-
tal plane,” in Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on. IEEE,
2014, pp. 4473–4477.
[10] M Zhang, RA Kennedy, TD Abhayapala, and Wen Zhang,
“Statistical method to identify key anthropometric parame-
ters in hrtf individualization,” in Hands-free Speech Com-
munication and Microphone Arrays (HSCMA), 2011 Joint
Workshop on. IEEE, 2011, pp. 213–218.
[11] Janina Fels and Michael Vorländer, “Anthropometric param-
eters influencing head-related transfer functions,” Acta Acus-
tica united with Acustica, vol. 95, no. 2, pp. 331–342, 2009.
[12] Simone Spagnol, Michele Geronazzo, and Federico
Avanzini, “Fitting pinna-related transfer functions to anthro-
pometry for binaural sound rendering,” in Multimedia Signal
Processing (MMSP), 2010 IEEE International Workshop on.
IEEE, 2010, pp. 194–199.
DAFX-7
DAFx-359
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Ian Rodrigues, Ricardo Teixeira, Vasco D. B. Bonifácio Daniela Peixoto, Yuri Binev, Florbela Pereira,
Sofia Cavaco∗ Ana M. Lobo, João Aires-de-Sousa∗
Centro de Química-Física Molecular and
NOVA LINCS Institute of Nanoscience and Nanotechnology LAQV-REQUIMTE
Department of Computer Science Instituto Superior Técnico Department of Chemistry
Faculty of Science and Technology Universidade de Lisboa Faculty of Science and Technology
Universidade Nova de Lisboa Av. Rovisco Pais Universidade Nova de Lisboa
2829-516 Caparica, Portugal 1049-001 Lisboa, Portugal 2829-516 Caparica, Portugal
in.rodrigues@campus.fct.unl.pt, vasco.bonifacio@tecnico.ulisboa.pt {d.peixoto,y.binev, mf.pereira,
scavaco@fct.unl.pt aml, jas}@fct.unl.pt
ABSTRACT Only a few applications that tackle this issue have been pro-
posed in the literature. The AsteriX-BVI web server resorts to
In order to contribute to the access of blind and visually impaired
tactile information, such as one using 3D printing techniques to
(BVI) people to the study of chemistry, we are developing Nav-
construct tactile representations of molecules with Braille anno-
mol, an application that helps BVI chemistry students to interpret
tations [5], or another that functions as a molecular editor that
molecular structures. This application uses sound to transmit the
uses 2D sketches and static Braille reports [6]. While this solu-
information about the structure of the molecules. Navmol uses
tion may help blind users to construct a mental representation of
voice synthesis and describes the molecules using the clock polar
the molecules, it may not be accessible to everyone.
type coordinates. In order to help the users to mentally conceptu-
alize the molecular structure representations more easily, we pro- In order to contribute to the access of BVI people to higher ed-
pose to use 2D spatial audio. This way, the audio signal generated ucation, we are developing tools adapted to their special needs that
by the application gives the user the perception of sound origi- can be used to study chemistry courses. These consist of chemoin-
nating from the directions of the bonds between the atoms in the formatics strategies to teach chemistry to BVI users [7]. In par-
molecules. The sound spatialization is obtained with head related ticular, here we focus on the structure of the molecules. These
transfer functions. The results of a usability study show that the strategies rely on computer representations of molecular structures
combination of spatial audio with the description of the molecules not only as drawings, but as graphs, specifying the atoms in a
using the clock reference system helps BVI users to understand the molecule, their bond connections, and their 3D configuration [8].
molecules’ structure. Here we propose a solution based on spatial audio implemented
with head related transfer functions (HRTFs) to transmit informa-
tion about the structure of molecules to BVI students and chemists.
1. INTRODUCTION This solution has the advantage of being able to reach more BVI
users than previously proposed applications.
BVI people have difficulty in achieving a higher education degree
due to the lack of material adapted to their special needs. More- With the intention to facilitate the integration of blind students
over, the few students who reach higher education, are influenced in the study of chemistry, we are developing Navmol, a naviga-
to pursue a degree in humanities or music and to avoid science and tor and editor of molecular structures designed for BVI students
engineering degrees. According to the 2011 census, there were and chemists that uses voice synthesis and other accessibility tools
16 500 BVI people in Portugal, of which 42 000 were school-age familiar to BVI users (text-to-Braille refreshable displays and the
people (5-29 years old) [1]. However, only 10 000 are referenced keyboard) [9]. Furthermore, it contains a graphical interface that
as students. In 2001 the percentage of BVI people in Portugal replicates the same information that is heard so that sighted users
who completed a higher education degree was 0.9% while in the can better communicate with BVI users, helping with their inte-
general population this number increases to 10.8% [2, 3]. In the gration for example in a classroom (figure 1). Along with these
European Union in 2002 the percentage of people without disabil- features, we propose to use spatial audio to help BVI users to more
ities with 25-64 years of age who completed a university degree easily understand the structure of the molecules.
was 22%, but this number decreases to 10% when we consider The main novelty of this work is the use of spatial audio to
people with disabilities [4]. The same study in Portugal concluded transmit information about the structure of the molecules to BVI
that 17% of the people without disabilities completed a university users. This work can have a real impact in the lives of people in
degree while only 1% of the people with disabilities completed a a special needs group that is typically under-served, contributing
university degree. to their access to higher education in science and engineering de-
The representation of molecular structures is a major chal- grees.
lenge for the accessibility of blind students and professionals to Below we discuss the sound spatialization technique used in
chemistry and other science degrees that include chemistry courses. Navmol, and the results from a usability test performed with a
This is particularly true in organic chemistry that critically re- group of chemists, a group of non-chemistry university students
quires the interpretation and transmission of molecular structures - and a group of BVI volunteers. The results from all groups show
and these are commonly processed visually, by means of chemical that the users can successfully reconstruct the molecular structure
drawings. of the molecules presented to them when Navmol describes the
DAFX-1
DAFx-360
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2. NAVMOL
DAFX-2
DAFx-361
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-362
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
and would then proceed for each of the distinct directions, 1 o’clock, 5. RESULTS
2 o’clock, etc. all the way to 12 o’clock again. The volunteers
would then classify the sound spatialization from 1 to 4, with 1 In order to assess if users can understand the molecules’ structure
being “I could not determine at all the sound’s directions" and 4 when Navmol uses sound spatialization, we ran a usability test in
being “I could hear all the directions perfectly". which the volunteers had to interpret two molecules using Navmol.
Afterwards, the volunteers heard the sentence again but com- Due to the difficulty in having a high number of BVI volunteers to
ing from 10 random directions and we asked them to identify the test Navmol, before running the test with BVI people, we ran it
direction of the sound (in terms of hours in an analog clock). The with sighted volunteers. This way we could guarantee that the
correct answers were counted for each of the 53 sets of HRTFs. expected outcome was obtained and there were no corrections or
As expected, even though all HRTFs sets had positive scores, modifications to the software that had to be made before running
different volunteers gave different scores to each HRTFs set. The the test with the BVI volunteers.
five best distinct HRTFs for all volunteers were chosen. These Before the actual test started, the participants tried the five
were the CIAIR, KEMAR, IRC05, IRC25 and IRC44. HRTFs sets to determine which worked better for them. They
Only five volunteers participated in this procedure because the heard the sentence “atom at X o’clock" from 10 random directions
tasks described took many hours for each person. Afterwards we (where X was not replaced by the time) and were asked to estimate
performed another test with 10 volunteers but using only the five the sound’s direction. The HRTFs set with a minimal number of
HRTF sets mentioned above. Again, the order of the HRTFs sets wrong answers (using the same tolerance as above) was chosen
was random. and the rest of the test was performed with that set.
For this second test, we used the sentence “atom at X o’clock" After the HRTFs set was chosen, there was a training phase.
(without replacing X by 1, 2,. . . , 12). This test started with a train- During training the participants heard the same sentence again but
ing phase during which the volunteers heard that sentence in clock- where X o’clock was substituted by the matching direction (1 to 12
wise order (for two complete turns) and the direction of the sound o’clock). They heard the sentence for all 12 directions in clockwise
varying accordingly. order and for two full circles. Then they heard the same sentence
The main task of the test was performed after the training pe- again from 10 random directions but without information on the
riod. The volunteers heard the sentence 10 times (from random actual direction (i.e., X was not replaced by the corresponding time
directions) and were asked to identify the direction of the sound. in hours). They were asked to estimate the sound’s direction and
Since humans have a localization accuracy that can diverge up to received feedback on their answers, that is, the training application
a few degrees from the correct direction, we considered correct spoke back the correct direction of the spatialized sound.
all answers that were displaced by less than 30◦ to either side of After the training period, the main task of the test started. This
the correct direction (for instance, if the correct direction was 3 consisted of exploring two molecules with Navmol (figure 5) and
o’clock, we considered 2, 3 and 4 o’clock correct and all other reproducing their structure, which for sighted users consisted of
answers incorrect) [19]. drawing them in paper. Before starting the task, the participants
Figure 4 shows the number of correct answers for each of the were shown a molecule in Navmol’s window and a demonstration
five HRTFs set. As it can be observed, while there is at least one set on how to explore it. They tried Navmol by themselves so that they
of HRTFs for each subject with more than eight correct answers, as could familiarize with the controls, interface and the synthesized
expected the set with the highest number of correct answers varies
from subject to subject.
Following these results, Navmol’s installation package includes
many sets of HRTFs so that the users can use the set they feel the
most comfortable with, without having the need of searching and
compiling the measurements by themselves.
DAFX-4
DAFx-363
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 6: Answers from the usability test. (a)-(c) Answers from three sighted participants. (d) Answer from a blind participant.
voice. After they felt confident on how to use Navmol, the task After having drawn the complete molecules or most of the
started. For this, the participants had no access to the screen. They atoms and bonds, most of the chemist volunteers recognized the
could only use the keyboard to explore the molecule and hear its molecules in question, being that some of them reported that they
information. used this knowledge to finalize drawing them. While this knowl-
The point of this task was to see if the participants could un- edge could have helped them to perform the task, the answers from
derstand and reproduce the complete molecules structure from just the non-chemist volunteers show that people who do not have this
the spatialized sound cues. All the sentences with the molecule type of knowledge can also draw the molecules. Whilst the non-
name or constitution were omitted and only the atom bonds were chemist volunteers did not recognize the specific molecules of the
left intact but with the omission of the direction of the bond. For test (as expected since they did not have the same chemistry back-
instance, if the sentence for Navmol’s normal version was “atom ground as the first group), they could grasp their structures cor-
O 2 at 6 o’clock through a double bond" the sentence in the task rectly. This shows that Navmol’s sound spatialization can success-
would only say “atom O 2 at X o’clock through a double bond". fully transmit the molecular structure information to users.
This task had two distinct molecules that were given to the partic- As reported above, the chemist volunteers had a slight higher
ipants in a random order. error average than non-chemists. Some of the people from the
chemist group recognized the molecules and used this knowledge
to help them perform the task. It appears that this can have led
5.1. Results from sighted volunteers
them to make more "mistakes" than the people in the other group
We ran the test with 10 chemist volunteers and 15 non-chemist because reproducing a molecular structure from a chemist’s point
volunteers. The chemist volunteers had ages between 22 and 56 of view is not necessarily reproducing a drawing. Such a molecular
(8 women and 2 men, 6 PhD, 3 MSc and 1 BSc). Two of them structure is chemically defined by the elements of the atoms, their
reported having some hearing problems. The non-chemist volun- connectivity and the bond orders, independently of the orientation
teers were university students with ages between 18 and 27 and of the molecule.
all with normal hearing. There were 10 men and 5 women. They
all wore headphones to perform the test and used an external key- 5.2. Results from BVI volunteers
board (so that they did not face the laptop screen when Navmol
was running). In spite of our efforts, we only managed to have four BVI volun-
It was observed that using only the sound spatialization cues, teers for this test: two blind male participants and two low vision
the participants were able to navigate and understand the molecules female participants. The blind participants were 54 and 67 years
of this study. All participants were able to draw the two molecules old. The youngest was blind since he was 2.5 years old, and the
with only minor errors: on average there were 0.5 errors out of 12 oldest was blind from birth. The low vision participants were 22
possibilities (12 atoms), being that the chemist participants had an and 46 years old. The youngest was a university student, who was
average of 0.9 errors and the other group had an average of 0.3 er- visually impaired since she was 7 years old, and the oldest had low
rors out of the possible 12. The typical errors were a missing bond vision from birth. The three older participants had a bachelor’s de-
in a molecule or the representation of a bond not being in the ex- gree.
act correct direction. Figure 6.a-c shows three examples of typical The same protocol as described above was used but, due to
answers from this test (all representing the same molecule). The the vision impairments of the volunteers, some adjustments were
leftmost and center answers are completely correct. The rightmost made. The instructions were printed both in Braille (for blind vol-
answer is also correct but it has only some small orientation errors: unteers) and with size 16 font (for low vision volunteers). We also
the top oxygen atom should have been drawn right on top of the read the instructions to them if they preferred it this way.
C 2 atom, like in the other two answers, and the C 4 atom has also While the low vision participants were able to draw the molecules,
a small orientation error. the blind participants used another technique. We provided the
DAFX-5
DAFx-364
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
blind volunteers with a set of magnets that they could arrange in a We thank all the volunteers who participated in the tests. Special
metal board to replicate the molecules’ structure (figure 7). There thanks to the team from ACAPO and from Biblioteca Nacional de
was a set of circular magnets (representing atoms) and rectangular Portugal, namely to Carlos Ferreira and Maria Aldegundes, for all
magnets (representing connections between atoms). the help with the tests with BVI subjects.
All four BVI volunteers correctly represented the two molecules This work was funded by Fundação Calouste Gulbenkian un-
without any errors. Figure 6.d shows the answer from one of the der project “Ciência para todos: STEREO+”, and Fundação para a
blind volunteers to the same molecule as that of figure 6.a-c. This Ciência e a Tecnologia through LAQV, REQUIMTE: UID/QUI/50006/
answer is correct and, as it can be observed, represents the same 2013, and NOVA-LINCS: PEest/UID/CEC/04516/2013.
structure as the drawings in figure 6.a-c. This shows that Nav-
mol’s sound spatialization is appropriate to transmit the molecular
8. REFERENCES
structure information to BVI users.
Having said that, there is still another question about the sound [1] INE Resultados Definitivos Censos 2011, Instituto Nacional
spatialization that we have not mentioned: is Navmol better suited de Estatística, Lisboa, Portugal, 2012.
to transmit the molecular structure information to BVI users with
or without spatial audio? While the users can choose to use Nav- [2] SNR Inquérito Nacional às Incapacidades, Deficiências
mol with or without spatial audio, according to the youngest low e Desvantagens - Resultados Globais, Secretariado
vision volunteer it is easier to interpret the molecules’ structure Nacional de Reabilitação, Caderno n.8., Retrieved
with spatial audio. This volunteer had participated in a previous Sep/2008 from http://www.inr.pt/content/1/
Navmol test with no spatial audio [9]. As such, she said that the 117/informacao-estatistica.
sound spatialization helped very much with the mental concep- [3] INE Resultados Definitivos Censos 2001, Instituto Nacional
tualization of the molecules. She referred that with the previous de Estatística, Lisboa, Portugal, 2002.
Navmol version, the users had to wait until the end of the sentence
that described the atom (“atom ... at ... o’clock...") to know where [4] “APPLICA & CESEP & ALPHAMETRICS, men and
the atom was positioned. Now, with the sound spatialization, she women with disabilities in the EU: Statistical analysis of the
could identify the direction where the atom was located right from LFS AD HOC module and the EU-SILC, final report,” 2007.
the beginning of the sentence. She mentioned that this helped her [5] V. Lounnas, H. Wedler, T. Newman, G. Schaftenaar, J. Har-
immensely to build the correct mental representation of the struc- rison, G. Nepomuceno, R. Pemberton, D. Tantillo, and
ture of the molecules. G. Vriend, “Visually impaired researchers get their hands
on quantum chemistry: application to a computational study
6. CONCLUSIONS on the isomerization of a sterol,” J. Comput. Aid. Mol. Des.,
vol. 28, no. 11, pp. 1057–1067, 2014.
While spatialized sound has been used before in computer games, [6] V. Lounnas, H. Wedler, T. Newman, J. Black, and
here we propose to use it in an application that can make a differ- G. Vriend, “Blind and Visually Impaired Students Can Per-
ence in the lives of BVI students. In order to help these special form Computer-Aided Molecular Design with an Assistive
needs users to understand the molecular structures, we are devel- Molecular Fabricator,” Lecture Notes in Computer Science,
oping Navmol, a molecular navigator/editor for BVI users that de- vol. 9043, pp. 18–29, 2015.
scribes molecular structures with synthesized voice. As a contri-
[7] “MOLinsight web portal,” Retrieved Apr/2017 from http:
bution to help these users to more easily conceive a mental repre-
//molinsight.net.
sentation of the structures, we propose to use spatial audio. Thus,
we modified the speech synthesizer’s signal with HRTFs such that [8] F. Pereira, J. Aires-de Sousa, V.D.B. Bonifácio, P. Mata, and
when users hear the molecule description they will immerse in a A. M. Lobo, “MOLinsight: A Web Portal for the Process-
virtual scene where they stand on the atoms of the molecule and ing of Molecular Structures by Blind Students,” Journal of
hear the information coming from different angles of the azimuth, Chemical Education, vol. 88, pp. 361–362, 2011.
DAFX-6
DAFx-365
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-366
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT however, to develop portable application source code that can per-
form well across different architectures: this issue of performance
Numerical modelling of the 3-D wave equation can result in very portability is currently of great interest in the general field of High
accurate virtual auralisation, at the expense of computational cost. Performance Computing (HPC) [5].
Implementations targeting modern highly-parallel processors such In this paper, we explore the performance portability issue for
as NVIDIA GPUs (Graphics Processing Units) are known to be FDTD numerical modelling of the 3D wave equation. To enable
very effective, but are tied to the specific hardware for which they this study, we have developed different implementations of the
are developed. In this paper, we investigate extending the porta- simulation, each performing the same task, including a CUDA im-
bility of these models to a wider range of architectures without plementation that acts as a “baseline” for use on NVIDIA GPUs,
the loss of performance. We show that, through development of plus other more portable alternatives. This enables us to assess per-
portable frameworks, we can achieve acoustic simulation software formance across multiple different hardware solutions: NVIDIA
that can target other devices in addition to NVIDIA GPUs, such as GPUs, AMD GPUs, Intel Xeon Phi many-core CPUs and tradi-
AMD GPUs, Intel Xeon Phi many-core CPUs and traditional Intel tional multi-core CPUs. The work also includes the development
multi-core CPUs. The memory bandwidth offered by each archi- of a simple, adjustable abstraction framework, where the flexibility
tecture is key to achievable performance, and as such we observe comes through the use of templates and macros (for outlining and
high performance on AMD as well as NVIDIA GPUs (where high substituting code fragments) to facilitate different implementation
performance is achievable even on consumer-class variants despite and optimisation choices for a room acoustics simulation. Both
their lower floating point capability), whilst retaining portability to basic and advanced versions of an FDTD algorithm that simulates
the other less-performant architectures. sound propagation in a room (i.e. a cuboid) are explored.
This paper is structured as follows: in Section 2, we give nec-
essary background information on the computational scheme, the
1. INTRODUCTION hardware architectures under study and the associated program-
ming models; in Section 3, we describe the development of a per-
The finite difference time domain method (FDTD) is a well-known formant, portable and productive room acoustics simulation frame-
numerical approach for acoustic modelling of the 3D wave equa- work; in Section 4 we outline the experimental setup for investigat-
tion [1]. Space is discretised into a three-dimensional grid of points, ing different programming approaches and assessing performance;
with data values resident at each point representing the acoustic in Section 5 we present and analyse performance results; and fi-
field at that point. The state of the system evolves through time- nally we summarise and discuss future work in Section 6.
stepping: the value at each point is repeatedly updated using finite
differences of that point in time and space. The so-called stencil of
points involved in each update is determined by the choice of dis- 2. BACKGROUND
cretisation scheme for the partial differential operators in the wave
equation [2]. This numerical approach is relatively computation- This paper is focused on assessing different software implemen-
ally expensive, but amenable to parallelisation. In recent years, tations of a room acoustics simulation across different types of
there has been good progress in the development of techniques to computing hardware. While this project focuses strictly on single
exploit modern parallel hardware. In particular, NVIDIA GPUs node development, the ideas in this work could easily be extended
have proven to be a very powerful platform for acoustic modelling for use across multi-node platforms by coupling with a message-
[3][4]. Most work in this area has involved writing code using the passing framework. In this section we give some necessary back-
NVIDIA specific CUDA language, which is tied to this platform. ground details. We first describe the FDTD scheme used in this
Ideally, however, any software should be able to run in a portable study. We then give details on the hardware architectures that we
manner across different architectures, such that the performance wish to assess and the native programming methods for these ar-
of alternatives can be explored, and different resources can be ex- chitectures. Finally we describe some pre-existing parallel pro-
ploited both as and when they become available. It is non-trivial, gramming frameworks that provide portability.
DAFX-1
DAFx-367
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.1. FDTD Room Acoustics Scheme improvements over traditional CPUs for certain algorithms includ-
ing wave-based ones like acoustic simulation. Xeon Phi chips are
Wave-based numerical simulation techniques, such as FDTD, are essentially multi-core CPUs, but with a large number of cores each
concerned with deriving algorithms for the numerical simulation with lower clock speeds and wider vector units. In this section, we
of the 3D wave equation: describe these architectures in more detail and discuss how they
are typically programmed.
∂2Ψ
= c2 ∇2 Ψ (1)
∂t2
Here, Ψ (x, t) is the dependent variable to be solved for (repre- 2.2.1. Traditional Multi-Core CPUs
senting an acoustic velocity potential, from which pressure and
velocity may be derived), as a function of spatial coordinate x ∈ Computational simulations (like acoustic models) have historically
D ∈ R3 and time t ∈ R+ . been run on CPU chips. However, these architectures were orig-
In standard finite difference time domain (FDTD) construc- inally optimised for sequential codes, whereas scientific applica-
tions, the solution Ψ is approximated by a grid function Ψn l,m,p ,
tions are typically highly parallel in nature. Since the early 2000s,
representing an approximation to Ψ (x = (lh, mh, ph) , t = nk), clock speeds of CPUs have stopped increasing due to physical
where l, m, p, n are integers, h is a grid spacing in metres, and k constraints. Instead, multi-core chips have dominated the market
is a time step in seconds. The FDTD scheme used for this work is meaning CPUs are now parallel architectures.
given in [6], and is the standard scheme with a seven-point Lapla- OpenMP [8] is one framework designed for running on shared
cian stencil. According to this scheme, updates are calculated as memory platforms like CPUs. It is a multi-threading parallel model
follows: which uses compiler directives to determine how an algorithm is
split and assigned to different threads, where each thread can utilise
Ψn+1 2 n 2 n−1
l,m,p = (2 − 6λ )Ψl,m,p + λ S − Ψl,m,p (2) one of the multiple available cores. OpenMP emphasises ease of
use over control, however there are settings to determine how data
where
or algorithmic splits occur.
S = Ψn n n
l+1,m,p + Ψl−1,m,p + Ψl,m+1,p
n n n (3)
+Ψl,m−1,p + Ψl,m,p+1 + Ψl,m,p−1 ,
2.2.2. GPUs
The constant λ = ck/h is referred to as the √ Courant number and GPUs were originally designed for accelerating computations for
must satisfy the stability condition λ ≤ 1/ 3. It can be seen that
graphics, but have increasingly been re-purposed to run other types
each grid value is updated based on a combination of two previous
of codes [9]. As such, the architecture of GPUs has evolved to
values at the same location, and contributions from each of the six
be markedly different from CPUs. Instead of having a few cores,
neighbouring points in three dimensions (giving the six terms in S
they have hundreds or thousands of lightweight cores that oper-
- see Figure 1). The benchmarks used in this study are run over
ate in a data-parallel manner and can thus do many more oper-
ations per second than a traditional CPU. However most scien-
tific applications, including those in this paper, are more sensi-
tive to memory bandwidth: the rate at which data can be loaded
and stored from memory. GPUs offer significantly higher mem-
ory bandwidths over traditional CPUs because they use graphics
memory, though they are not used in isolation but as “accelera-
tors” in conjunction with “host” CPUs. Programming a GPU is
more complicated than a CPU as the programmer is responsible
for offloading computation to the GPU with a specific parallel de-
composition as well as managing the distinct memory spaces.
In this paper we investigate acoustic simulation algorithms us-
Figure 1: Figure of a 7-point stencil on a three dimensional grid. ing NVIDIA and AMD GPUs. NVIDIA GPUs are most com-
monly programmed using the vendor-specific CUDA model [10],
13–105 million grid points at a sample rate of 44.1kHz to produce which extends C, C++ or Fortran. CUDA provides functionality
100ms of sound. The boundary conditions used are frequency- to decompose a problem into multiple “blocks” each with multi-
independent impedance boundary conditions. We also include, to- ple “CUDA threads”, with this hierarchical abstraction designed
wards the end of the paper, some results for comparison where a to map efficiently onto the hardware which correspondingly com-
larger stencil is used, as described in [3][7]. prises multiple “Streaming Multiprocessors” each containing mul-
tiple “CUDA cores”. CUDA also provides a comprehensive API to
2.2. Hardware and Native Programming Methods allow memory management on the distinct CPU and GPU memory
spaces. CUDA is very powerful, but low level and non-portable.
The hardware architectures used in this work are multi-core CPU, AMD GPUs are similar in nature to NVIDIA GPUs, but since there
GPU, and many-core CPU (Intel Xeon Phi) platforms. CPUs have is no equivalent vendor-specific AMD programming model, the
traditionally been the main workhorses for scientific computing most common programming method is to use the cross-platform
and are still targeted for the majority of scientific applications. OpenCL, which we discuss in Section 2.3.1. For simplistic pur-
GPUs have been gaining traction in the scientific computing land- poses, we will use the same terminology to describe the OpenCL
scape over the last decade, and can offer significant performance framework as for CUDA.
DAFX-2
DAFx-368
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.3.1. OpenCL
3. PORTABLE AND PRODUCTIVE FRAMEWORK
OpenCL [12] is a cross-platform API designed for programming DEVELOPMENT
heterogeneous systems. It is similar in nature to CUDA, albeit
with differing syntax. Whereas CUDA acts as a language exten- The room acoustics simulation codes used in this work were pre-
sion as well as an API, OpenCL only acts as the latter resulting viously tied to a specific platform (NVIDIA GPUs) through their
in the need for more boilerplate code and provides a more low- CUDA implementation. In this study we compare the CUDA (pre-
level programming experience. OpenCL can, however, be used as existing), OpenCL, and targetDP frameworks (all with C as a base
a portable alternative to CUDA, as it can be executed on NVIDIA, language). In addition, we introduce the newly developed ab-
AMD and other types of GPUs, as well as manycore CPUs such straction framework titled abstractCL (with C++ as a base lan-
as the Intel Xeon Phi. OpenCL is compatible with the C and C++ guage). In this section we describe our approach in enhancing
programming languages. performance portability and productivity through the development
of this new framework for investigating room acoustics simulation
2.3.2. targetDP codes. First, details about the implementation are discussed fol-
lowed by the benefits of creating such a framework for the field of
The targetDP programming model [13] is designed to target data- acoustics modelling.
parallel hardware in a platform agnostic manner, by abstracting
the hierarchy of hardware parallelism and memory systems in a
way which can map on to either GPUs or multi/many-core CPUs 3.1. Overview
(including the Intel Xeon Phi) in a platform agnostic manner. At
the application level, targetDP syntax augments the base language abstractCL was created to make room simulation kernels on-the-
(currently C/C++), and this is mapped to either CUDA or OpenMP fly, depending on the type of run a user wants to do. The type of
threads (plus vectorisation in the latter case) depending on which variations can be between different data layouts of the grid passed
implementation is used to build the code. The mechanism used is a in to represent the room, hardware-specific optimisations or both.
combination of C-preprocessor macros and libraries. As described This is done through swapping in and out relevant files that in-
in [13], the model was originally developed in tandem with the clude overloaded functions and definitions in the main algorithm
complex fluid simulation package Ludwig, where it exploits par- itself. The data abstractions and optimisations investigated for this
allelism resulting from the structured grid-based approach. The project include: thread configuration settings, memory layouts and
lightweight design facilitates integration into complex legacy ap- memory optimisations. Algorithmic changes can be introduced by
plications, but the resulting code remains somewhat low-level so it adding new classes to the current template for more complicated
lacks productivity and programmability. codes.
DAFX-3
DAFx-369
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
The platforms used for this study are specified in Table 1. In- 5.1. Overall Performance Comparison Across Hardware
cluded are two NVIDIA GPUs, two AMD GPUs, an Intel Xeon
Phi manycore CPU and a traditional Intel Xeon CPU. Of the two In this section we give an overview of performance achieved across
NVIDIA GPUs, the consumer-class GTX-780 has much reduced the different hardware platforms, to assess the capability of the
double precision floating point capability over the high-end K20 hardware for this type of problem. Figure 2 shows the time taken
variant, but offers higher memory bandwidth. Of the AMD GPUs, for the test case where we choose the best performing existing soft-
the R9 259X2 has higher specifications than the R280 and pro- ware on each architecture. It can be seen that the GPUs show simi-
duces the best results overall. lar times, with the AMD R9 295X2 performing fastest. This result
translates to 8466 megavoxels updates per second. Another point
4.1.2. Memory Bandwidth Reference Benchmark of interest is that the consumer-class NVIDIA GTX780 is signif-
icantly faster for the CUDA implementation than the K20, even
As described in Section 2, memory bandwidth can be critical to though it has many times lower double precision floating point
obtaining good performance (and this will be confirmed in Section capability. This is because, as discussed further in Section 5.3,
5). It is therefore important to assess our results relative to what memory bandwidth is more important that compute for this type
DAFX-4
DAFx-370
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Specification of Different Hardware Architectures Used. Note that the “Ridge Point” is the ratio of Peak Gflops to Peak Bandwidth
(in the terminology of the ROOFLINE Model).
Platform Number of Cores/ Peak Bandwidth (GB/s) Peak GFlops Ridge Point Memory (MB)
Stream Processor (Double Precision) (Flops/Byte))
AMD R9 259X2 2816 320 716.67 2.24 4096
AMD R280 2048 288 870 3.02 3072
NVIDIA GTX 780 2304 288.4 165.7 0.57 3072
NVIDIA K20 2496 208 1175 5.65 5120
Xeon Phi 5110P 60 320 1011 3.16 8000
Intel Xeon E5-2620 24 42.6 96 2.25 16000
AMD R9 295X2 AMD R280 NVIDIA GTX780 NVIDIA K20 Xeon Phi Intel Xeon E5 AMD R9 295X2 AMD R280 NVIDIA GTX780 NVIDIA K20 Xeon Phi Intel Xeon E5
Kernel1
Data Copy Total
400 Kernel2
Other Time
400
300
Time(s)
Time(s)
200
200
100
0
abstractCL
OpenCL
abstractCL
OpenCL
abstractCL
CUDA
OpenCL
targetDP
abstractCL
CUDA
OpenCL
targetDP
abstractCL
OpenCL
targetDP
abstractCL
OpenCL
targetDP
0
opencl opencl cuda cuda targetDP opencl
Code Version [ Platform ] Code Version [ Platform ]
Figure 2: Original fastest (optimised) versions across platforms Figure 3: Original timings of the simple room acoustics simula-
for simple room acoustics simulation of room size 512×512×404. tion for room size 512×512×404 on all architectures tested. The
The timings shown produce 100ms of sound. timings shown produce 100ms of sound.
of problem. The Xeon Phi is seen to offer lower performance that point computation and memory bandwidth capabilities. It uses
the GPUs, but remains faster than the traditional CPU. the concept of “Operational Intensity” (OI): the ratio of opera-
tions (in this case double precision floating point operations) to
5.2. Performance Comparison of Software Versions bytes accessed from main memory. The OI, in Flops/Byte, can
be calculated for each computational kernel. A similar measure
In this section we analyse the differences in performance resulting (also given in Flops/Byte and known as the “ridge point”), exists
from running different software frameworks on a particular plat- for each processor: the ratio of peak operations per second to the
form. In Figure 3, it can be seen how the performance depends on memory bandwidth of the processor. Any kernel which has an OI
the software version. The main result is that for each architecture lower than the ridge point is limited by the memory bandwidth of
the timings are comparable, in particular in comparison to the ab- the processor and any which has an OI higher than the ridge point
stractCL version. On the NVIDIA GPU, there is a small overhead is limited by the processor’s floating point capability.
for the portable frameworks (OpenCL, abstractCL and targetDP) The OI for the application studied in this paper is 0.54, which
relative to use of CUDA, but we see this as a small price to pay for is lower than the ridge points of any of the architectures (given in
portability to the other platforms. In particular, the newly devel- Table 1), indicating that this simplified version of the application
oped framework abstractCL shows comparable performance to the is memory bandwidth bound across the board (and thus not sensi-
original benchmarks and those written in OpenCL, indicating that tive to floating point capability). This explains why the NVIDIA
it is possible to build performant, portable and productive room GTX780 performs so well despite the fact that it has very low float-
acoustics simulations across different hardware. ing point capability: its ridge point of 0.57 is still (just) higher than
the application OI.
5.3. Hardware Capability Discussion The blue columns in Figure 4 give observed data throughput
(volume of data loaded/stored by the application divided by run-
In this section, we further analyse the observed performance in time, assuming perfect caching), for each of the architectures. The
terms of the characteristics of the underlying hardware. The ROOFLINE black lines give the peak bandwidth capability of the hardware (as
model, developed by Williams et al. [21], can be used to deter- reported in Table 1). It can be seen that, for all but the Xeon Phi
mine for a given application the relative importance of floating architecture, the measured throughput varies in line with the peak
DAFX-5
DAFx-371
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Peak Bandwidth Stream Benchmark separate segments of the graphs. The reason for comparing to an
Bandwidth optimised thread configuration version instead of the original run
AMD R9 295X2 AMD R280 NVIDIA GTX780 NVIDIA K20 Xeon Phi Intel Xeon E5 was to isolate what effect the memory optimisations had. Both
room sizes (large and small) were run, but the large rooms had
more significant differences in performance so are the only results
300
shown. Overall the abstractCL version showed the most consistent
improvement, however in this graph it is more clear that it is not
Bandwidth (GB/s)
opencl
abstractCL
cuda
opencl
measure of what is achievable. It can be seen that our results are,
in general, achieving a reasonable percentage of STREAM but we
still have room for improvement. In addition to running STREAM,
profiling was also done on a selection of the runs. These results Code Version [ Platform ]
showed higher values than our measured results, indicating that
our assumption of perfect caching (see above) is not strictly true Figure 5: Memory optimisations for CUDA, OpenCL and ab-
and there may be scope to reorganise our memory access patterns stractCL versions of the room acoustics benchmarks run for room
to improve the caching. The Xeon Phi is seen to achieve a no- size of 512×512×404 on an NVIDIA and AMD GPU. The timings
ticeably lower percentage of peak bandwidth relative to the other shown produce 100ms of sound.
architectures, and this warrants further investigation.
5.4. Optimisation
5.5. Advanced Simulations
Two optimisation methods were explored for the GPU platforms:
shared and texture memory. Texture memory is a read-only or Results in this paper have thus far only been presented for a very
write-only memory that is distinct from global memory and uses simplified problem: wave propagation is assumed lossless and the
separate caching allowing for quicker fetching for specific types of simplest FDTD scheme (employing a seven-point stencil) is used.
data (in particular, ones that take advantage of locality). The use It is thus of interest to explore more complex models of wave prop-
of texture memory is relatively straightforward in CUDA, requir- agation, as well as improved simulation designs. Two new features
ing only an additional keyword for input parameters. OpenCL is were investigated for these so-called “advanced” codes: air vis-
restricted to an earlier version on NVIDIA GPUs which does not cosity (see [6]) and larger stencil schemes of leggy type (see [3]).
support double precision in texture memory, so these results are The main algorithmic difference in adding viscosity is that another
not included. Shared memory is specific to one of the lightweight grid is passed into the main kernel and more computations are per-
“cores” of a GPU (streaming multi-processor on NVIDIA) and can formed. Schemes operating over leggy stencils require access to
only can be shared between threads utilising that core, so can be grid values beyond the nearest neighbours. For this investigation,
useful for data re-use. A 2.5D tiling method was used to take ad- 19-point leggy stencils were introduced (three points in each of the
vantage of shared memory [6]. six Cartesian directions as well as the central point, see Figure 1 in
Figure 5 shows the results for this experiment, where the opti- [3]). The comparison was done on a smaller scale, however, with
mised versions are the fastest versions found in this project per the intention only of giving a general idea of whether or not the
version per platform. Three different types of codes were run: codes performed similarly. Thus, only CUDA and OpenCL ver-
sharedtex uses shared and texture memory (for the CUDA ver- sions were tested. The variations were run on the two NVIDIA
sion), shared uses only shared memory and none uses no mem- GPUs, the two AMD GPUs and the Xeon Phi for both small and
ory optimisations. As before, different platforms are indicated by large room sizes.
DAFX-6
DAFx-372
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Results of the performance of these advanced codes can be dis- AMD R9 295X2 AMD R280 NVIDIA GTX780 NVIDIA K20 Xeon Phi
Time(s)
in Figure 6 show the performance timings and the memory band-
1000
width of the advanced codes for the various implementations for
the larger room size. In these graphs, the following versions are
included: cuda (the original version), cuda_adv (cuda with vis- 500
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
cuda
cuda_adv
cuda_leggy
cuda_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
cuda
cuda_adv
cuda_leggy
cuda_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
up in a similar way as those found in Figures 3 and 4, where the
versions of the code run along the x-axis and the performance is on
the y-axis (in seconds) and the separated parts of the graph indicate
the platform it was run on. Here, the top graph shows performance
and the bottom shows bandwidth. Code Version [ Platform ]
Generally the performance profile of the advanced codes looks Bandwidth Limits: peak stream
similar to what can be seen for the original codes in Section 5.1: Calculated Bandwidth: bandwidth
the codes still run fastest on AMD R9 259X2 and slowest on the AMD R9 295X2 AMD R280 NVIDIA GTX780 NVIDIA K20 Xeon Phi
Xeon Phi. The versions run on the AMD R9 259X2 in comparison
to the same versions on the K20 hover between being 43%-54% Bandwidth (GB/s)
300
faster. For both large and small rooms, the leggy codes are slower
than the viscosity codes for both OpenCL and CUDA on every- 200
thing except the K20. The combination advanced codes (*_leg-
gyadv) are significantly slower across the board, particularly on 100
the Xeon Phi.
This purpose of this analysis is to see what effect algorithmic 0
changes (ie. number of inputs or floating point operations) in the
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
cuda
cuda_adv
cuda_leggy
cuda_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
cuda
cuda_adv
cuda_leggy
cuda_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
opencl
opencl_adv
opencl_leggy
opencl_leggyadv
same benchmark have when run on the same platform. How the
performance changes with differences to the models of the rooms
can vary quite a bit across platforms, which echoes the results seen
when comparing local memory optimisations. When comparing
original versions to the leggy, viscosity or combination versions, Code Version [ Platform ]
OpenCL codes are 1.4–6.4x slower for the combination versions.
When this is limited to AMD GPUs, the difference is only 1.4x Figure 6: Timing (top) and Bandwidth (bottom) for different ver-
slower - for NVIDIA GPUs, 3–3.6x slower. In comparison, the sions of the advanced room acoustics benchmarks across plat-
CUDA version on the NVIDIA GPUs varies from 1.6–2.4x slower forms for room size 512×512×404. Smaller room results are not
for the combination version. For the stand-alone leggy and vis- included as they show similar pattern.
cosity versions, this difference is much less pronounced, however
the same trend remains: OpenCL versions retain better perfor-
mance with changes on AMD platforms and significantly worse spondingly found to be the best performing platform for the room
than CUDA versions on NVIDIA GPUs. These differences cannot acoustics simulation codes. In addition, we found the consumer-
wholly be attributed to specification differences given that the dif- class NVIDIA GTX780 outperforms the HPC-specific NVIDIA
ference exists for all these versions between the AMD R280 and K20 variant despite the fact it has many times lower floating point
NVIDIA GTX 780, which share some similar specifications. capability, due to insensitivity of the application to computational
capability, as well as higher memory bandwidth of the former. Tra-
6. SUMMARY AND FUTURE WORK ditional CPUs have much lower memory bandwidth than GPUs,
and measured performance was correspondingly low. The Intel
In this paper we have shown that it is possible to implement room Xeon Phi platform offers a high theoretical memory bandwidth,
acoustics simulations in a way that allows the same source code to but we were unable to achieve a reasonable percentage of this in
execute with good performance across a range of parallel architec- practice.
tures. Prior to this work, such simulations were predominantly tied Performance-portable frameworks, including the abstractCL
to NVIDIA GPUs and we have now extended applicability to other framework designed to allow flexibility in implementation options,
platforms that were previously inaccessible. We have found that were able to achieve similar performance to native methods, with
the main indicator of how the application will perform on a given only relatively small overheads (that we consider a small price to
architecture is the memory bandwidth offered by that architecture, pay for the benefits that are offered by portability). However, rela-
due to the fact that the algorithm has low operational intensity. The tively low-level programming is still required for these frameworks
best performing platforms are AMD and NVIDIA GPUs, due to (including explicit parallelisation and data management). An ideal
their high memory bandwidth capabilities. The AMD R9 259X2 framework would offer performance portability, whilst allowing
has the highest peak bandwidth of the GPUs tested, and was corre- an intuitive definition of the scientific algorithm. Future work will
DAFX-7
DAFx-373
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
adapt a higher level framework named L IFT [22], currently un- [11] James Reinders, “An Overview of Programming for Intel
der research and development at the University of Edinburgh, to Xeon Processors and Intel Xeon Phi Coprocessor,” https:
enable it for 3D wave-based stencil computations. This frame- //software.intel.com/sites/default/
work aims, through automatic code generation, to allow execution files/article/330164/an-overview-of-
of an application across different hardware architectures in a per- programming-for-intel-xeon-processors-
formance portable and productive manner. Future work will also and-intel-xeon-phi-coprocessors_1.pdf.
look into extending abstract frameworks such as LIFT to modelling Accessed: 2016-08-05.
these codes across multiple parallel devices. [12] Khronos OpenCL Working Group, “OpenCL Specifica-
tion. Version 2.2,” 2016, https://www.khronos.
7. ACKNOWLEDGEMENTS org/registry/OpenCL/specs/opencl-2.2.pdf
Accessed: 2016-08-12.
We thank Brian Hamilton and Craig Webb for providing the bench-
[13] Alan Gray and Kevin Stratford, “A Lightweight Approach
marks used in this project and for answering questions about acous-
To Performance Portability With TargetDP,” The Interna-
tics. We would also like to thank Michel Steuwer for providing
tional Journal of High Performance Computing Applica-
invaluable advice about GPUs and OpenCL. This work was sup-
tions, p. 1094342016682071, 2016.
ported in part by the EPSRC Centre for Doctoral Training in Per-
vasive Parallelism, funded by the UK Engineering and Physical [14] Murray Cole, Algorithmic Skeletons: Structured Manage-
Sciences Research Council (grant EP/L01503X/1) and the Univer- ment of Parallel Computation, Ph.D. thesis, University of
sity of Edinburgh. Edinburgh, 1989.
[15] Yongpeng Zhang and Frank Mueller, “Autogeneration and
8. REFERENCES Autotuning Of 3D Stencil Codes On Homogeneous And Het-
erogeneous GPU Clusters,” Parallel and Distributed Sys-
[1] Dick Botteldooren, “Finite-Difference Time-Domain Sim- tems, IEEE Transactions on, vol. 24, no. 3, pp. 417–427,
ulation Of Low-Frequency Room Acoustic Problems,” The 2013.
Journal of the Acoustical Society of America, vol. 98, no. 6,
[16] Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and
pp. 3302–3308, 1995.
Markus Püschel, “Operator Language: A Program Genera-
[2] Stefan Bilbao, Brian Hamilton, Alberto Torin, et al., “Large tion Framework For Fast Kernels,” in Domain-Specific Lan-
Scale Physical Modeling Sound Synthesis,” in Stockholm guages. Springer, 2009, pp. 385–409.
Musical Acoustics Conference (SMAC), 2013, pp. 593–600.
[17] Cédric Nugteren, Henk Corporaal, and Bart Mesman,
[3] Brian Hamilton, Craig J Webb, Alan Gray, and Stefan “Skeleton-based Automatic Parallelization Of Image Pro-
Bilbao, “Large Stencil Operations For GPU-Based 3- cessing Algorithms For GPUs,” in Embedded Computer Sys-
D Acoustics Simulations,” Proc. Digital Audio Effects tems (SAMOS), 2011 International Conference On. IEEE,
(DAFx),(Trondheim, Norway), 2015. 2011, pp. 25–32.
[4] Niklas Röber, Martin Spindler, and Maic Masuch, [18] Joshua Auerbach, David F Bacon, Perry Cheng, et al.,
“Waveguide-Based Room Acoustics Through Graphics “Growing A Software Language For Hardware Design,” 1st
Hardware.,” in ICMC, 2006. Summit on Advances in Programming Languages (SNAPL
[5] “Compilers and More: What Makes Performance Portable?,” 2015), vol. 32, pp. 32–40, 2015.
https://www.hpcwire.com/2016/04/19/ [19] Christian Lengauer, Sven Apel, Matthias Bolten, et al., “Ex-
compilers-makes-performance-portable/. astencils: Advanced Stencil-Code Engineering,” in Euro-
[6] Craig Webb, Parallel Computation Techniques for Virtual pean Conference On Parallel Processing. Springer, 2014, pp.
Acoustics and Physical Modelling Synthesis, Ph.D. thesis, 553–564.
University of Edinburgh, 2014. [20] John D. McCalpin, “Memory Bandwidth and Machine Bal-
[7] Jelle Van Mourik and Damian Murphy, “Explicit Higher- ance in Current High Performance Computers,” IEEE Com-
Order FDTD Schemes For 3D Room Acoustic Simulation,” puter Society Technical Committee on Computer Architec-
IEEE/ACM Transactions on Audio, Speech, and Language ture (TCCA) Newsletter, pp. 19–25, Dec 1995.
Processing, vol. 22, no. 12, pp. 2003–2011, 2014. [21] Samuel Williams, Andrew Waterman, and David Patterson,
[8] OpenMP Architecture Review Board, “OpenMP Ap- “Roofline: An Insightful Visual Performance Model For
plication Program Interface Version 4.5,” November Multicore Architectures,” Communications of the ACM, vol.
2015, http://www.openmp.org/wp-content/ 52, no. 4, pp. 65–76, 2009.
uploads/openmp-4.5.pdf Accessed: 2016-08-12. [22] Michel Steuwer, Christian Fensch, Sam Lindley, and
[9] David B. Kirk and Wen-mei W. Hwu, Programming Mas- Christophe Dubach, “Generating Performance Portable
sively Parallel Processors: A Hands-On Approach, Morgan Code Using Rewrite Rules: From High-Level Functional Ex-
Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd pressions To High-Performance Opencl Code,” ACM SIG-
edition, 2013. PLAN Notices, vol. 50, no. 9, pp. 205–217, 2015.
[10] NVIDIA, “Programming Guide: CUDA Toolkit Documenta- [23] George Teodoro, Tahsin Kurc, Jun Kong, et al., “Compar-
tion,” http://docs.nvidia.com/cuda/cuda-c- ative Performance Analysis Of Intel Xeon Phi, GPU And
programming-guide/index.html. Accessed: 2016- CPU,” arXiv preprint arXiv:1311.0378, 2013.
08-12.
DAFX-8
DAFx-374
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT sponse has appeared to decay into the noise floor. Plosives and im-
pulsive noises will be converted into descending sweeps and any
When measuring room impulse responses using swept sinusoids, such noises in the silence following the recorded sweep will be
it often occurs that the sine sweep room response recording is ter- detrimental to the conversion process. As it turns out, one needs
minated soon after either the sine sweep ends or the long-lasting to record a duration of background noise equivalent to the length
low-frequency modes fully decay. In the presence of typical acous- of the sweep following the response’s decay into the noise floor to
tic background noise levels, perceivable artifacts can emerge from ensure that the problem will not be encountered.
the process of converting such a prematurely truncated sweep re- If one does not record enough silence (background noise) after
sponse into an impulse response. In particular, a low-pass noise the sine sweep response, the resulting impulse response will con-
process with a time-varying cutoff frequency will appear in the tain a time-varying low-pass filter characteristic imprinted on its
measured room impulse response, a result of the frequency-de- noise floor. This paper addresses the cause of these artifacts and
pendent time shift applied to the sweep response to form the im- two methods for alleviating the problem in post-production. Re-
pulse response. lated to this problem, [5] and [6] discuss methods for extending
Here, we detail the artifact, describe methods for restoring impulse responses through the noise floor, however there are im-
the impulse response measurement, and present a case study us- plications for how the noise floor is measured based on these new
ing measurements from the Berkeley Art Museum shortly before results.
its demolition. We show that while the difficulty may be avoided The rest of the paper is organized as follows. In §2 we re-
using circular convolution, nonlinearities typical of loudspeakers view the process of converting a sine sweep measurement into an
will corrupt the room impulse response. This problem can be al- impulse response and introduce the problem associated with stop-
leviated by stitching synthesized noise onto the end of the sweep ping the recoding prematurely. In §3 we introduce two methods
response before converting it into an impulse response. Two noise for pre-processing the response sweep by extending the recording
synthesis methods are described: the first uses a filter bank to esti- with synthesized noise that matches the background noise present
mate the frequency-dependent measurement noise power and then in the recording. Next §4 presents examples that demonstrate how
filter synthesized white Gaussian noise. The second uses a linear- our method for preprocessing the sweep response is desirable com-
phase filter formed by smoothing the recorded noise across per- pared to zero padding or circular convolution. Finally, §5 presents
ceptual bands to filter Gaussian noise. In both cases, we demon- concluding remarks.
strate that by time-extending the recording with noise similar to
the recorded background noise that we can push the problem out
2. CONVERTING SINE SWEEPS TO IMPULSE
in time such that it no longer interferes with the measured room
RESPONSES
impulse response.
Linear and logarithmic sine sweeps can be used to measure room
1. INTRODUCTION impulse responses and work under the principle that the sweep can
be viewed as an extremely energetic impulse smeared out in time.
In order to study room acoustics, one must measure accurate im- A linear sweep, in which the sinusoid frequency increases linearly
pulse responses of the space. These measurements are often chal- over time, can be defined as
lenging and time consuming to acquire. Large spaces frequently
ω1 − ω0 t2
exhibit poor background noise levels, so acousticians often employ x(t) = sin ω0 t + , t ∈ [0, T ] , (1)
methods to improve the signal-to-noise ratio (SNR). In particular, T 2
linear and logarithmic sine sweep measurements have been shown where ω0 is the initial frequency in radians, ω1 the final frequency,
to be highly effective [1]. Researchers have identified some of the and T the total duration. A logarithmic sweep, in which the sinu-
problems and their solutions for using sine sweep measurements soid frequency increases exponentially over time, can be defined
to study room reverberation. Specifically, [2] and [3] address is- using the same variables as
sues of time-smearing, clicks/plosives, and pre/post equalization.
Further, [4] discusses issues related to clock drift. ln ω 1
t
This paper addresses another problem that is often encoun- ω 0 T ω0
x(t) = sin exp − 1 . (2)
tered when measuring impulse responses with sine sweeps in noisy ln ω1 T
ω0
environments that has not been previously discussed—what hap-
pens when the sweep response recording is stopped too early. This To convert sine sweep responses to impulse responses, one
is common, as it is challenging to maintain quiet after the room re- acyclically convolves an equalized, time-reversed version of the
DAFX-1
DAFx-375
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
original sine sweep, x̃ with the recorded response such that inaudible, in spaces with poor SNR, this unwanted effect becomes
quite clear.
h(t) = y(t) ∗ x̃(t) . (3) The naïve solution to this problem would be to use circular
convolution rather than linear convolution. However, this is not
This works because the group delay is canceled by the time rever- ideal. Circular convolution will reconstruct the noise floor and
sal, with the equalization compensating for the relative time spent solve the issue of the filtering effect, shifting the noise correspond-
in each frequency band, ing to times before the sweep to the end of the response, but any
nonlinearities in the measurement will also be shifted to occur dur-
δ(t) = x(t) ∗ x̃(t) . (4)
ing the desired impulse response. Weak nonlinearities are all but
This deconvolution processing aligns the response to the original guaranteed when measuring an acoustic space, as loudspeakers are
sine sweep in time effectively forming the impulse response. As inherently nonlinear at levels useful in impulse response measure-
a linear sweep traverses any given bandwidth in the same length ment. The effect of circular convolution will corrupt the measure-
of time, irrespective of the starting frequency, the equalization is ment.
a constant, independent of frequency. For a logarithmic sweep, Fig. 1 shows a contrived example that highlights the difference
which spends the same length of time traversing any given oc- between linear and circular convolution using a linear sweep and
tave, an exponential equalization is applied to compensate for the noise signal. Under circular convolution, the noise statistics ought
disproportionate amount of low-frequency energy applied to the to stay constant (i.e., n(t) ∼ n(t)~ x̃(t). Because of zero padding,
system. Naturally, the calculation is efficiently computed in the linear convolution causes the noise occupy a longer period of time.
frequency domain1 as Over the course of the beginning of the processed block, the noise
starts from the high frequencies, and lower frequencies enter ac-
Y (ω) cording to the slope of the frequency trajectory of the sweep. At
h(t) = F −1 . (5) the end of the file, the opposite is true: the high frequencies stop
X(ω)
before the lower frequencies. If this noise were the background
One of the benefits to using sine sweep measurements over noise in a space, the effect of the frequencies stopping at different
other integrated-impulse response measurement techniques, such times would be heard as a filter with a time-varying cutoff fre-
as maximum length sequences, is that many nonlinearities pro- quency.
duce only harmonics of an input sinusoid, and the deconvolution Fig. 2 demonstrates the difference between cyclic and acyclic
process will place the onset of unwanted harmonic component re- convolution when there are nonlinearities present. In the acyclic
sponses before the onset of the linear component of the system convolution, the nonlinear components preceed the desired linear
response. Using logarithmic sweeps, the time shift of the har- response while in the cyclical convolution they corrupt the linear
monic distortion components of the response is controlled via the response.
length of the sweep, and the linear response is easily separated Ideally, a sufficiently long segment of background noise fol-
from the harmonic distortion response. While [2, 7, 8] and oth- lowing the sine sweep is recorded so as to avoid additional pro-
ers have shown that useful information can be extracted from the cessing. In real spaces, low frequencies typically take longer to
higher order components, that is not the concern of this paper. For decay than high frequencies. Ascending sweeps help hide issues
acquiring a linear room impulse response, one can simply window caused by the premature ending of a recording, as the low fre-
out the distortion artifacts that precede the linear response. quency components are provided more time to decay while higher
The fundamental problem this paper explores is caused by the frequencies are being excited in the space. In order to guarantee
desire to use acyclic convolution rather than circular convolution that the noise roll off problem will not be encountered during the
to convert the sine sweep response into separate linear "impulse impulse response, an additional amount of background noise of
response" and non-linear system responses. The difficulty results length T must be captured following the decay of the room re-
from the presence of additive measurement noise, Eq. (3) is actu- sponse to x(t). The longer the sine sweep, the more patient one
ally must be when recording the impulse response. (In addition, tran-
sients occurring during the time after the system response has de-
h(t) = [y(t) + n(t)] ∗ x̃(t) . (6) cayed may corrupt the measured impulse response.)
DAFX-2
DAFx-376
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 1: Spectrograms of noise, inverse (linear) sweep, acyclic Figure 2: Non-linear sweep response, inverse (logarithmic) sweep,
convolution, and circular convolution. acyclic convolution, and circular convolution.
be spliced. Our goal is to capture both the overall characteristic signals are filtered both in the forwards and backwards directions
of the background noise as well as any local features necessary to so that the phase is unaltered. The perfect reconstruction aspect of
make an imperceptible transition. If there is a noticeable mismatch this filter bank is important because we use it both for analysis and
between the recorded noise and the synthesized noise, descending- synthesis.
chirps will be introduced into the impulse response by the decon- Once separated into bands, we estimate the gain coefficient in
volution process. If there is not enough isolated noise at the end each frequency band of both n(t) and γ(t) by computing the RMS
of the file, it is possible to use a portion of background noise from level on the steady-state portion of the filtered signals. To com-
somewhere else (i.e., proceeding the sweep or another recording pute the synthesis noise s(t), we color a sufficiently long2 amount
taken at the same time). of γ(t) by scaling each frequency band by the ratio of measured
analysis to synthesis gains and sum the result,
3.1. Band-pass Filtered Noise
X RMS[nk (t)]
In the first approach, we aim to match the spectrum of the back- s(t) = γk (t) . (10)
ground noise by analyzing the amplitude of the recorded noise in a RMS[γk (t)]
k
set of frequency bands, and apply these amplitudes to synthesized
white Gaussian noise to correctly color it. At this point, the steady state portion of s(t) is a block of noise
We define the analysis noise n(t) as 500 ms of recorded noise with the same magnitude frequency band response as the analysis
from the recording we intend to match. Additionally, we generate signal n(t). An example impulse response and magnitude spec-
an i.i.d. Gaussian noise sequence, denoted γ(t). In this method, trum resulting from this filter bank can be seen in Fig. 3. We found
we form a synthesis signal s(t) such that that 1/4 octave bands were sufficient to match the synthesis and
analysis noises’ magnitude spectrum. We then use a 50 ms long
|S(ω)| ∼ |N (ω)| . (7) equal-power cross-fade between the end of y(t) and the beginning
of the steady state portion of s(t). After this, it is safe to convert
We use a perfect reconstruction, zero-phase filter bank to split both
the sweep response into an impulse response as done in Eq. (5).
n(t) and 500 ms of γ(t) into K composite frequency bands
X
n(t) = nk (t), k ∈ [1, 2, . . . , K], (8) 3.2. ERB-Smoothed Noise
k
Our second method synthesizes noise that is perceptually similar
and to the analysis noise via a filter design technique. We define γ(t)
X and n(t) in the same way as above, in § 3.1. We window both noise
γ(t) = γk (t) . (9)
signals with a Hann window and take their Fourier transforms. In
k
the frequency domain, we smooth both signals on a critical band
Our filter bank consists of a cascade of squared 3rd-order Butter-
worth filters with center frequencies spaced 1/4 octave apart. The 2 Sufficient here depends on how prematurely the recording was halted.
DAFX-3
DAFx-377
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
10
-3 Band-Pass Filter Bank IR 10 Linear-Phase ERB-Smoothing IR
-3
2 2
Amplitude
Amplitude
1 1
0 0
-1 -1
-30 -20 -10 0 10 20 30 -30 -20 -10 0 10 20 30
Time (ms) Time (ms)
Magnitude Spectrum Magnitude Spectrum
0 0
Magnitude (dB)
Magnitude (dB)
-20 -20
-40 -40
-60 -60
-80 -80
10 2 10 3 10 4 10 2 10 3 10 4
Frequency (Hz) Frequency (Hz)
Figure 3: Band-pass filterbank impulse response and magnitude Figure 4: ERB-smoothed noise filter impulse response and magni-
spectrum. tude spectrum.
basis by averaging the power within the DFT bins of each critical 4. EVALUATION
band such that
To preserve the acoustics of the Berkeley Art Museum (BAM) at
f (b(ω+1/2)
X 2626 Bancroft Way in Berkeley, CA, acoustic measurements were
N̂ (ω) = |N (ζ)|2 (11) taken by Jacqueline Gordon and Zackery Belanger and their team
ζ=f (b(ω−1/2) shortly before it was demolished in 2015 [9]. This space, de-
signed by Mario Ciampi, has a very long reverberation time due
and to its large, concrete structure. Since this space exhibited a high
f (b(ω+1/2) noise floor, a long sine sweep was employed in order to improve
X
Γ̂(ω) = |Γ(ζ)|2 , (12) the SNR. Circumstances led to the recordings being prematurely
ζ=f (b(ω−1/2)
ended, and to the discovery of the problem described in this paper.
Fig. 5 shows an example recorded (short) response sweep as well
as versions extended with noise synthesized with our two methods.
where b(·) defines a critical bandwidth. This results in N̂ (ω) and
Fig. 6 shows the impulse responses computed from these sweeps.
Γ̂(ω), the critical band smoothed versions of N (ω) and Γ(ω). This Both visually and aurally the noise extended approaches achieve
processing reduces the complexity and detail of the noise signals’ better results than the unprocessed version.
spectra but should not be perceptually audible. We then impart the Both approaches for synthesizing noise produce reasonable re-
spectrum of the smoothed analysis noise upon the smoothed syn- sults, and the power spectrum for the analysis noise and both va-
thesis noise in the frequency domain to obtain the transfer function rieties of synthesis noise can be seen in Fig. 7. On average, the
synthesis noise tracks the nature of the room sound well, and both
N̂ (ω) synthesis methods produce perceptually similar results. As it turns
G(ω) = . (13)
Γ̂(ω) out, it is equally important to have a smooth cross-fade between
the recorded and synthesized noise as sudden changes will create
We then find a linear-phase version of this transfer function perceptual artifacts.
Glin (ω) = |G(ω)| e−jτ ω . (14)
5. CONCLUSIONS
Returning Glin (ω) to the time domain as seen in Fig. 4, we filter
γ(t) with glin (t) such that In this paper, we discuss how converting a sine sweep response to
an impulse response requires a sufficient amount of recording time
s(t) = γ(t) ∗ glin (t) . (15) beyond what is necessary to capture the room response after it has
perceptually decayed. While it is naturally best to record this in the
Last, we splice the synthesized noise onto y(t) with a 50 ms equal physical space, it is not always possible like at the Berkeley Mu-
power cross-fade as described above. seum of Art. We propose two methods for extending the record-
DAFX-4
DAFx-378
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Magnitude (dB)
20
-20
10 2 10 3 10 4
ERB-Smoothed Noise
40
Magnitude (dB)
20
-20
10 2 10 3 10 4
Frequency (Hz)
Figure 5: Spectrograms for signal sweep and sweep responses with
no treatment, band-passed noise, and ERB smoothed noise. Figure 7: Spectrum of synthesized noise (red) averaged over 500
simulations compared to analyzed noise (blue) for the band-pass
filter method (top) and ERB-smoothed method (bottom).
DAFX-5
DAFx-379
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ing after the fact that both depend on analyzing recorded back- Convention 108, Feb 2000.
ground noise and using this information to color Gaussian noise.
[2] Swen Müller and Paulo Massarani, “Transfer-function measurement
One technique measures the frequency-dependent amplitude levels with sweeps,” J. Audio Eng. Soc, vol. 49, no. 6, pp. 443–71, 2001.
with a bank of band-pass filters while the other involves filtering
Gaussian noise with a perceptual-based filter. Both methods work [3] Angelo Farina, “Advancements in impulse response measurements
well and eliminate the undesired artifact from the resulting impulse by sine sweeps,” in Audio Engineering Society Convention 122, May
2007.
response.
Naturally, an impulse response with a large noise floor is not [4] Jonathan S. Abel, Nicholas J. Bryan, and Miriam A. Kolar, “Impulse
ideal. In such cases, the techniques described above can be used to Response Measurements in the Presence of Clock Drift,” in Audio
prepare an impulse response measurement for further processing, Engineering Society Convention 129, Nov 2010.
such as described in [6], to extend the measurement through the [5] Jean-Marc Jot, Laurent Cerveau, and Olivier Warusfel, “Analysis and
noise floor. The result of such processing applied to another of the synthesis of room reverberation based on a statistical time-frequency
BAM measurements appears in Fig. 8. Note that the extended im- model,” in Audio Engineering Society Convention 103, Sep 1997.
pulse response shows none of the time-varying low-pass filtering
[6] Nicholas J. Bryan and Jonathan S. Abel, “Methods for extending room
artifacts present in the original measured impulse response. impulse responses beyond their noise floor,” in Audio Engineering
Society Convention 129, Nov 2010.
6. ACKNOWLEDGMENTS [7] Antonin Novak, Laurent Simon, Frantisek Kadlec, and Pierrick Lot-
ton, “Nonlinear system identification using exponential swept-sine
The authors would like to thank Jacqueline Gordon for bringing signal,” Instrumentation and Measurement, IEEE Transactions on,
the BAM acoustics measurement project [9] to our attention, and vol. 59, no. 8, pp. 2220–9, 2010.
for setting us onto this work.
[8] Jonathan S. Abel and David P. Berners, “A technique for nonlinear
system measurement,” in Audio Engineering Society Convention 121,
7. REFERENCES Oct 2006.
[1] Angelo Farina, “Simultaneous measurement of impulse response and [9] Jacqueline Gordon and Zackery Belanger, “Acoustic deconstruction
distortion with a swept-sine technique,” in Audio Engineering Society of 2626 Bancroft Way, kickstarter project 632980650,” 2015.
DAFX-6
DAFx-380
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-381
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Section 5 presents some preliminary results and Section 6 discusses response of Rm (z) with rm,k 0
= Rm (ejωk ), and vector r1m =
potential improvements and applications. [rm,1 · · · rm,k · · · rm,K ] the frequency response of z −1 Rm (z)
1 1 1 T
1
with rm,k = e−jωk Rm (ejωk ). Next, let Q be the K × 2M ma-
2. MODAL REVERBERATOR DESIGN trix of basis vectors constructed as Q = [r01 · · · r0M , r11 · · · r1M ].
Finally, let vector g contain the numerator coefficients arranged
Given a reverberation impulse response measurement h(t) and an as g = [g0,1 · · · g0,M , g1,1 · · · g1,M ]T . Now we can solve the
input signal x(t), one can obtain the reverberated output signal y(t) least-squares projection problem
from x(t) by y(t) = h(t) ∗ x(t), where ’∗’ denotes convolution.
The modal synthesis approach [1] approximates h(t) by a sum of minimize kQg − hk2 . (5)
g
M parallel components hm (t), each corresponding to a resonant
mode of the reverberant system h(t). In the z-domain, each parallel
term can be realized as a recursive second-order system Hm (z). 2.3. Pole optimization
This is expressed as Recently we proposed methods for modal decomposition of string
M
instrument bridge admittance measurements (e.g., [14, 16]) using
X
Y (z) = Ĥ(z)X(z) = Hm (z)X(z), (1) constrained pole optimization via quadratic programming as de-
m=1
scribed in [15], with complexities in the order of dozens of modes.
In such methods, a quadratic model is used at each step to numeri-
where Ĥ(z) is a 2M -order digital approximation of h(t), and each cally approximate the gradient of an error function that spans the
m-th parallel term Hm (z) is full frequency band. Because in this work we deal with decomposi-
tions of much higher order, using this technique to simultaneously
Hm (z) = (g0,m + g1,m z −1 )Rm (z) (2) optimize hundreds or thousands of poles is impractical. Therefore,
we propose here an extension of the method to allow for several
with real numerator coefficients g0,m and g1,m , and pole optimizations to be carried out separately, for different over-
1 lapping frequency bands. Optimized poles are then collected and
Rm (z) = (3) merged into a single set from which numerator coefficients are
1 + a1,m z −1 + a2,m z −2
estimated by least-squares as described above leading to (5).
is a resonator with real coefficients defined by a pair of complex-
conjugate poles pm and p?m . Denominator coefficients are related 3. INITIALIZATION PROCEDURE
to pole angle and radius by a1,m = −2|pm | cos ∠pm and a2,m =
|pm |2 , and define the m-th modal frequency fm and bandwidth A trusted initial set of pole positions is essential for nonlinear, non-
βm via fm = fs ∠pm /2π and βm = −fs log|pm |/π respectively, convex optimization. Once the order of the system is imposed, pole
where fs is the sampling frequency in Hz. Numerator coefficients initialization comprises two main steps: modal frequency estima-
g0,m and g1,m are used to define the complex gain of the m-th tion and modal bandwidth estimation. Modal frequency estimation
mode [17]. is based on spectral peak picking from the frequency response
measurement, while modal bandwidths are estimated from analyz-
2.1. Design problem ing the energy decay profile as obtained from a time-frequency
representation of the impulse-response measurement.
The problem of designing Ĥ(z) from a given measurement h(t) To impose a modal density given the model order M, we assume
involves a modal decomposition, and it can be posed as that the decreasing frequency-resolution at higher frequencies in hu-
man hearing makes it unnecessary to implement the ever-increasing
minimize ε(Ĥ, H), (4) modal density of reverberant systems. To that end, we devised a
p,g0 ,g1
peak-picking procedure to favor a uniform modal density over a
where p is a set of M complex poles inside the upper half of the warped frequency axis approximating a Bark scale [18]
unit circle on the z-plane, g0 and g1 are two sets of M real coeffi-
cients, and ε(Ĥ, H) is an error measure between the model and the 3.1. Estimation of modal frequencies
measurement. To find good approximations of dense impulse re-
sponses, one needs to face decompositions on the order of hundreds Estimation of modal frequencies is based on analysis of the log-
or thousands of highly overlapping modes. In this work we solve magnitude spectrum Υ(f ) of the impulse response. From Υ(f ),
this non-linear problem in two steps: first, we find a convenient all N peaks in fmin ≤ f < fmax , with 0 ≤ fmin < fmax < fs /2,
set of M modes via constrained optimization of complex poles p; are found by collecting all N local maxima after smoothing of
second, we obtain the modal complex gains by solving a linear Υ(f ). For each peak, a corresponding peak salience descriptor is
problem to find coefficients g0 , g1 as described in Section 2.2. computed by frequency-warping and integration of Υ(f ) around
each peak, as described in [16].
2.2. Estimation of modal gains Next, a total of B adjacent bands are defined between fmin
and fmax , each spanning from delimiting frequencies flb to frb and
Given a target frequency response H(ejω ) and M complex poles following a Bark scale. This is carried out by uniformly dividing
p1 · · · pm · · · pM , it is straightforward to solve for the numerator a Bark-warped frequency axis $ into B portions spread between
coefficients by formulating a linear problem. Let vector h = warped frequencies $min and $max , and then mapping delimiting
[h1 · · · hk · · · hK ]T contain K samples of H(ejω ) at normalized warped frequencies $lb and $rb to their linear frequency counter-
angular frequencies 0 ≤ ωk < π, i.e., hk = H(ejωk ). Likewise, parts. To map between linear and warped frequencies, we use the
let vector r0m = [rm,1 0 0
· · · rm,k 0
· · · rm,K ]T sample the frequency arctangent approximation of the Bark scale as described in [18].
DAFX-2
DAFx-382
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
From all N initial peaks and corresponding annotated frequen- the unit circle on the z-plane by computing their angle ωm and
cies and saliences, picking of M modal frequencies is carried out radius |pm | as ωm = 2πfm /fs and |pm | = e−πβm /fs . We use
via an iterative procedure that starts from B lists of candidate peaks, px to denote the vector of initial pole positions.
each containing the initial Nb peaks that lie inside band b. First,
we define the following variables: P as the total number of picked
4. POLE OPTIMIZATION ALGORITHM
peaks, N as the total number of available peaks, and N b as the
number of available peaks in the b-th band. The procedure for peak Pole optimization is carried out by dividing the problem into many
picking is detailed next: optimization subproblems, each focused on one of B frequency
1: if N ≤ M then bands and dealing with a subset of the initial poles. In each b-th
2: pick all N peaks subproblem, we optimize only those poles for which corresponding
3: else modal frequencies lay inside the b-th frequency band. To help with
4: P ←0 mode interaction around the band edges, we configure the bands to
5: while N − (M − P ) > 0 & M > P do be partially overlapping. After optimization, we collect optimized
6: A ← b(N − P )/Bc poles only from the non-overlapping regions of the bands.
7: C ← (N − P ) mod B Prior to segmenting into bands or performing any optimization,
8: b←1 we use the set px of initial M poles to solve problem (5). This
9: while b ≤ B do leads to initial gain coefficient vectors g0x and g1x . These, together
10: D ← A + C mod b with initial poles px , are used to support band pre-processing and
11: E ← min(D, N b ) optimization as detailed below.
12: pick the highest-salience E peaks in list b;
remove those E peaks from list b. 4.1. Band preprocessing
13: Nb ← N b − E
14: N ←N −E We first uniformly divide the Bark-warped frequency axis $ into
15: P ←P +E B adjacent bands between $min and $max . Band edges of the b-
16: end while th band are $lb and $rb . Next, an overlapping region is added
17: end while to each side of each band, extending the edges from $lb and
18: end if $rb to $Lb and $Rb respectively. The outer edges are defined as
Once the M peaks have been selected, parabolic interpolation $Lb = $lb − δ b ($rb − $lb ) and $Rb = $rb + δ b ($rb − $lb ), with
of Υ(f ) around each m-th peak is used for defining the m-th modal δ b being a positive real number. The outer edges define the b-th
frequency fm . (extended) optimization frequency band, on which the b-th opti-
mization problem is focused. This is illustrated in Figure 1, where
3.2. Estimation of modal bandwidths each optimization frequency band is conformed by three subbands:
a center subband matching the original, non-overlapping band, and
Modal bandwidth estimation starts from computing a spectrogram two side subbands.
of the impulse response, leading to a time-frequency log-magnitude Once the optimization bands have been configured on the
representation Υ(t, f ) from which we want to estimate the decay warped frequency axis $, all band edges are mapped back to the
rate of all L frequency bins within fmin ≤ f < fmax . Several linear frequency axis f . Then, for each b-th band, two sets of
methods have been proposed for decay rate estimation in the context poles pb and p¬b are created from the initial set px as follows.
of modal reverberation (see [3] and references therein). Since in Set pb includes all U poles whose modal frequencies fu are in
our case we only aim at providing a trusted initial estimate of fLb ≤ fu < fRb , while set p¬b includes the remaining O poles
the bandwidth of each mode, we employ a simple method based (with O = M − U ), i.e., those poles whose modal frequency fo
on linear interpolation of a decaying segment of the magnitude is not in fLb ≤ fo < fRb . Moreover, from initial gain vectors g0x
envelope Υ(t, fl ) of each l-th frequency bin, as follows. and g1x we retrieve the gains corresponding to poles in pb and to
The decaying segment for the l-th bin is delimited by start poles in p¬b , leading to two pairs of gain vectors g0b , g1b and g0¬b ,
time tls and end time tle . First, tls is defined as the time at which g1¬b . Finally, with arranged sets of poles and gains, we construct
Υ(t, fl ) has fallen 10 dB below the envelope maximum. Second, two models H b (z) and H ¬b (z) of the form of Ĥ(z) in (1). The
tle is defined as the time at which Υ(t, fl ) has reached a noise first model,
floor threshold Υth . To set Υth , we first estimate the mean µn and
standard deviation σn of the noise floor magnitude, and then define U
X
Υth = µn + 2σn . H b (z) = b
(g0,u b
+ g1,u z −1 )Rub (z) (6)
By linear interpolation of the decay segment of each l-th fre- u=1
quency bin, the decay rate τl is obtained from the estimated slope,
and the bandwidth βl is computed as βl = 1/πτl , leading to the with resonators Rub (z) constructed from poles pb , will be optimized
bandwidth profile β(fl ). Finally, each modal bandwidth βm is to solve the b-th band subproblem (see Section 4.2). The second
obtained by interpolation of β(fl ) at its corresponding modal fre- model,
XO
quency fm .
H ¬b (z) = ¬b
(g0,o ¬b −1
+ g1,o z )Ro¬b (z) (7)
o=1
3.3. Pole positioning
with resonators Ro¬b (z) constructed from poles p¬b , presents fixed
Since optimization is carried out in the z-plane, we use estimated coefficients and is used to pre-synthesize a frequency response to be
frequencies and bandwidths to define M initial pole positions inside used as a constant offset during optimization of the model (6) above
DAFX-3
DAFx-383
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(see Section 4.3). We include this offset response to account for inequalities w− < wub < w+ and s− < sbu < s+ , where ’−’
how (fixed) off-band modes contribute to the frequency response of and ’+’ superscripts are used to indicate lower and upper bounds,
model (6) during optimization. respectively.
We solve this problem by means of sequential quadratic pro-
gramming [19]. At each step, the error surface is quadratically
BAND BAND BAND
approximated by successive evaluations of the band approximation
error function described in Section 4.3.
DAFX-4
DAFx-384
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
OPTIMIZATION
SECTOR
SIDE
CENTER
SUBSECTOR
SUBSECTOR
SIDE
SUBSECTOR
UNIT CIRCLE
Figure 2: Schematic representation of pole optimization inside a frequency band, which is mapped to a sector of the unit circle on the z-plane.
As it can be perceived from listening to the modeled responses, this could provide a good compromise between perceptual quality and
leads to a metallic character in the sound. computational cost. In terms of constraint definition, upper and
To get an idea of the how the modeling error is reduced during lower bounds are still set by hand—further exploration could lead
optimization, we compare the error H(ω) − Ĥ(ω) obtained before to methods for designing bounds for pole radii by attending to a
and after optimization. This is featured in Figure 4 for M = 1600, statistical analysis of observed modal decays and therefore avoid an
where it is possible to observe the coarse envelope of the error to excess in pole selectivity. A potential development aims at making
decrease by 5 to 10 dB in all frequency regions. the whole process to be iterative: use the optimized solution as an
For a detail of how optimization improves the modeling ac- initial point, perform again the optimization, and repeat until no
curacy, in Figure 5 appear the magnitude responses of the target improvement is obtained; this could perhaps have a positive effect
measurement (middle), initial (bottom) and optimized models (top) in the vicinity of band edges, though its significance would need to
of M = 1600 for three different frequency regions. With regard be tested.
to the time domain, initial and optimized impulse response models In conclusion, we note that modal synthesis supports a dynamic
are compared to the target measurement in Figure 6, where it is representation of a space, at a fixed cost, so that one may simulate
possible to observe how optimization helps to significantly reduce walking through the space or listening to a moving source by simply
the pre-onset ripple of the model. changing the coefficients of the fixed modes being summed. The
modal approach may thus be considered competitive with three-
dimensional physical models such as Scattering Delay Networks
6. CONCLUSION [20, 21, 22] in terms of psychoacoustic accuracy per unit of com-
putational expense. In that direction, an imminent extension of
We have presented a frequency-domain pole-zero optimization our method deals with simultaneously using N impulse response
technique to fit the coefficients of a modal reverberator, applied measurements of the same space (a set of measurements taken for
to model a mid-sized room impulse-response measurement. Our different combinations of source and receiver locations, as previ-
method, which includes a pole initialization procedure to favor ously proposed in [8] for low-order equalization models of the low
a constant density of modes over a Bark-warped frequency axis, frequency region) to optimize a common set of poles: at each step,
is based on constrained optimization of pole positions within a poles in each frequency band are optimized via minimization of a
number of overlapping frequency bands. Once the pole locations global error function that simultaneously accounts for the approxi-
are estimated, the zeros are fit to the measured impulse response mation error of N models (one per impulse response) constructed
using linear least squares. Our initial explorations with example from the the same set of poles. A common set of poles suffices for
models of a medium-sized room display a good agreement between each fixed spatial geometry, or coupled geometries.
the measured and modeled room responses, demonstrating how our
pole optimization technique can be of practical use in modeling
problems that require thousands of modes to accurately simulate 7. REFERENCES
dense impulse responses.
Our initial results show a promising path for improving the [1] J. M. Adrien, “The missing link: Modal synthesis,” in
accuracy of efficient modal reverberators. At the same time, opti- Representations of Musical Signals, G. De Poli, A. Piccialli,
mization could lead to a reduction of the computational cost given and C. Roads, Eds., pp. 269–267. MIT Press, 1991.
a required accuracy. Besides carrying out a more exhaustive ex-
ploration of model orders and parameter values (e.g. frequency- [2] O. Darrigol, “The acoustic origins of harmonic analysis,”
dependent overlapping factor), gathering data from subjective tests Archive for History of the Exact Sciences, vol. 61, no. 4, 2007.
DAFX-5
DAFx-385
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 3: Spectrogram of the target impulse response plus three example models with M = 800, 1200, 1800.
[3] M. Karjalainen, P. Antsalo, A. Mäkivirta, T. Peltonen, and [10] A. Mäkivirta, P. Antsalo, M. Karjalainen, and V. Välimäki,
V. Välimäki, “Estimation of modal decay parameters from “Low-frequency modal equalization of loudspeaker-room re-
noisy response measurements,” Journal of the Audio Engi- sponses,” in Audio Engineering Society 111th Convention,
neering Society, vol. 50:11, pp. 867–878, 2002. 2001.
[4] M. Karjalainen and H. Jarvelainen, “More about this rever- [11] J. S. Abel and K. J. Werner, “Distortion and pitch processing
beration science: Perceptually good late reverberation,” in using a modal reverberator,” in Proc. of the International
Audio Engineering Society Convention 111, New York, U.S.A, Conference on Digital Audio Effects, 2015.
2001.
[12] K. J. Werner and J. S. Abel, “Modal processor effects inspired
[5] J. S. Abel, S. A. Coffin, and K. S. Spratt, “A modal architec- by Hammond tonewheel organs,” Applied Sciences, vol. 6(7),
ture for artificial reverberation,” The Journal of the Acoustical pp. 185, 2016.
Society of America, vol. 134:5, pp. 4220, 2013.
[13] E. Maestre, G. P. Scavone, and J. O. Smith, “Modeling of a
[6] J. S. Abel, S. A. Coffin, and K. S. Spratt, “A modal archi- violin input admittance by direct positioning of second-order
tecture for artificial reverberation with application to room resonators,” The Journal of the Acoustical Society of America,
acoustics modeling,” in Proc. of the Audio Engineering Soci- vol. 130, pp. 2364, 2011.
ety 137th Convention, 2014.
[14] E. Maestre, G. P. Scavone, and J. O. Smith, “Digital modeling
[7] B. Bank, “Perceptually motivated audio equalization using of bridge driving-point admittances from measurements on
fixed-pole parallel second-order filters,” IEEE Signal Process- violin-family instruments,” in Proc. of the Stockholm Music
ing Letters, vol. 15, pp. 477–480, 2008. Acoustics Conference, 2013.
[8] Y. Haneda, S. Makino, and Y. Kaneda, “Common acoustical [15] E. Maestre, G. P. Scavone, and J. O. Smith, “Design of
pole and zero modeling of room transfer functions,” IEEE recursive digital filters in parallel form by linearly constrained
Transactions on Speech and Audio Processing, vol. 2:2, pp. pole optimization,” IEEE Signal Processing Letters, vol.
320–328, 1994. 23:11, pp. 1547–1550, 2016.
[9] Y. Haneda, Y. Kaneda, and N. Kitawaki, “Common- [16] E. Maestre, G. P. Scavone, and J. O. Smith, “Joint modeling of
acoustical-pole and residue model and its application to spatial bridge admittance and body radiativity for efficient synthesis
interpolation and extrapolation of a room transfer function,” of string instrument sound by digital waveguides,” IEEE/ACM
IEEE Transactions on Speech and Audio Processing, vol. 7:6, Transactions on Audio, Speech and Language Processing, vol.
pp. 709–717, 1999. 25:5, pp. 1128–1139, 2017.
DAFX-6
DAFx-386
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
magnitude (dB)
20
0
target
-20
20
magnitude (dB)
10
0
-10
-20
-30 initial solution error
-40
20 102 103 104
magnitude (dB)
10
0
-10
-20
-30 optimized solution error
-40
102 103 104
frequency (Hz)
DAFX-7
DAFx-387
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
40
30
magnitude(dB)
20
10
0 target
optimized (scaled by -10 dB)
-10
initial (scaled by +10 dB)
50 100 150 200 250 300 350
frequency (Hz)
30
magnitude(dB)
20
10
0
target
optimized (scaled by -10 dB)
-10
initial (scaled by +10 dB)
20
magnitude(dB)
10
-10 target
optimized (scaled by -10 dB)
initial (scaled by +10 dB)
-20
3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800
frequency (Hz)
Figure 5: Magnitude response of an example model with M = 1600, displayed for three different frequency regions.
-10
initial
target
-20
amplitude (dB)
-30
-30
-20
-10
-10
optimized
target
-20
amplitude (dB)
-30
-30
-20
-10
10-3 10-2 10-1 100
time (seconds)
Figure 6: Initial (top) and optimized (bottom) impulse response example models, for M = 1600.
DAFX-8
DAFx-388
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-389
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
its position. A regular arrangement of loudspeakers in a sphere results than no equalisation at all [20]. Therefore, headphone equal-
can produce an accurate representation (at low frequencies) of the isation should always be calculated from an average of multiple
original sound field at the centre, also known as the ‘sweet spot’ measurements. This will smooth out the deep notches and reduce
[10] of the sphere. Increasing the Ambisonic order expands the sharp peaks in the inverse filter, which are more noticeable than
number of channels, introducing greater spatial discrimination of troughs [21, 22].
sound sources. Higher-order Ambisonics requires more loudspeak- Non-individual headphone equalisation can also be detrimen-
ers for playback, but the sound field reproduction around the head tal to timbre, unless the non-individual headphone equalisation
is accurate up to a higher frequency [12]. is performed using the same dummy head as that used for binau-
A major issue with Ambisonics is timbre. Depending on the ral measurements, which has been shown to produce even greater
order of Ambisonics used, reproduction above the spatial alias- naturalness than individual headphone compensation [23].
ing frequency, relative to the head size, is inaccurate and timbral
inconsistencies exist which are noticeable when listening. This
inconsistency is caused by spectral colouration above the spatial 3. METHOD
aliasing frequency due to comb filtering inherent in the summation
of coherent loudspeaker signals with multiple delay paths to the This section presents a method for simulating an approximate
ears. Timbre between different loudspeaker layouts also varies diffuse-field response and equalisation of three Ambisonic virtual
substantially, even without changing the Ambisonic order. The loudspeaker configurations (see Table 1). As this study used virtual
unnatural timbre of Ambisonics therefore makes the produced spa- loudspeakers, calculations were made using Ambisonic gains ap-
tial audio easily distinguishable from natural sound fields. This plied to head related impulse responses (HRIRs): the time-domain
poses significant issues for content creators who desire a consistent equivalent of HRTFs. No additional real-world measurements were
timbre between different playback scenarios. necessary.
The three loudspeaker configurations utilised in this study were
the octahedron, cube and bi-rectangle (see Table 1). In this paper,
2.2. Diffuse-Field Equalisation all sound incidence angles are referred to in spherical coordinates
A diffuse-field response refers to the direction independent fre- of azimuth (denoted by θ) for angles on the horizontal plane, and
quency response, or the common features of a frequency response elevation (denoted by φ) for angles on the vertical plane, with
at all angles of incidence. This can be obtained from the root-mean- (0◦ , 0◦ ) representing a direction straight in front of the listener
square (RMS) of the frequency responses for infinite directions at the height of the ears. Changes in angles are positive moving
on a sphere [13]. A diffuse-field response of a loudspeaker array anticlockwise in azimuth and upwards in elevation.
requires an averaging of the frequency responses produced by the
loudspeaker array when playing sound arriving from every point Table 1: First-Order Ambisonic Loudspeaker Layouts.
on a sphere (including between loudspeaker positions), however
a sufficient approximate diffuse-field response can be calculated Loudspeaker location (θ◦ , φ◦ )
Loudspeaker number
from a finite number of measurements. It is important to sample Octahedron Cube Bi-rectangle
the sound field evenly in all directions to not introduce bias in any 1 0, 90 45, 35 90, 45
direction. 2 45, 0 135, 35 270, 45
Diffuse-field equalisation, also referred to as diffuse-field com- 3 135, 0 225, 35 45, 0
pensation, can be employed to remove the direction-independent 4 225, 0 315, 35 135, 0
aspects of a frequency response introduced by the recording and 5 315, 0 45, -35 225, 0
reproduction signal chain. A diffuse-field equalisation filter is cal- 6 0, -90 135, -35 315, 0
culated from the inverse of the diffuse-field frequency response. 7 225, -35 90, -45
Diffuse-field compensation of a loudspeaker array is achieved by 8 315, -35 270, -45
convolving the output of the array with its inverse filter.
2.3. Individualisation
3.1. Even distribution of points on a sphere
Though individualised HRTFs produce more accurate localisa-
tion cues and more natural timbre than non-individualised HRTFs The approximate diffuse-field responses were calculated using 492
[14–17], the measurement process is lengthy and requires specific regularly spaced points on a sphere. The points were calculated by
hardware, stringent set up and an anechoic environment. For wide dividing each of the 20 triangular faces on an icosahedron into 72
use individualised HRTFs are therefore not practical, and generic sub-triangles, resulting in a polygon with 1470 equally sized faces
HRTFs produced from dummy heads are utilised. and 492 vertices. The vertices were then projected onto a sphere,
producing 492 spherical coordinates (see Figure 1) [24].
2.4. Headphone Equalisation
3.2. Ambisonic encoding and decoding
The transfer function between a headphone and eardrum (HpTF) is
highly individual [18, 19]. It also varies depending on the position The 492 spherical coordinates were encoded into first-order Am-
of the headphones on the head: even small displacements of the bisonic format, with N3D normalisation and ACN channel ordering.
headphone on the ear can produce large changes in the HpTF. The Ambisonic encode and decode MATLAB functions used in this
Headphone equalisation has been shown to improve plausibility study were written by Archontis Politis [25]. The MATLAB build
of binaural simulations when correctly implemented [1], however used in this study was version 9.0.0 - R2016a. For each of the 492
equalisation based on just one measurement can produce worse directions, gains for the W, Y, Z and X channels were produced.
DAFX-2
DAFx-390
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5
The decode matrices for the three loudspeaker configurations
were then calculated, again using N3D normalisation and ACN
channel ordering. For each loudspeaker in each configuration, this
Amplitude (dB)
0
produced the gain of the four Ambisonic channels (W, Y, Z and X)
as determined by the loudspeaker’s position.
A dual-band decode method [26] was utilised in this study Octahedron (L)
-5
to optimise the accuracy of localisation produced by the virtual Octahedron (R)
loudspeaker configurations. As ITD is the most important azimuthal Cube (L)
localisation factor at low frequencies and no elevation cues exist Cube (R)
Bi-rectangle (L)
below 700 Hz [27,28], a mode-matching decode, which is optimised -10 Bi-rectangle (R)
for the velocity vector [6], was used for this frequency range. Above
700 Hz, the wavelength of sounds become comparable or smaller 10
2
10
3
10
4
than the size to the human head (which on average has a diameter Frequency (Hz)
of 17.5 cm [29]) and ILD is the most important localisation factor
in this frequency range. Above 700 Hz therefore, mode-matching Figure 2: Comparison of the diffuse-field responses of three first-
with maximised energy vector (MaxRe) [11] was used, which is order Ambisonic loudspeaker configurations.
optimised for energy [30].
The decode matrices for each loudspeaker layout were there- Inverse filters were calculated from the diffuse-field responses
fore calculated twice: with and without MaxRe weighting, for the in the frequency range of 20 Hz - 20 kHz, using Kirkeby’s least-
high and low frequencies respectively. The dual-band decode was mean-square (LMS) regularisation method [32], which is perceptu-
achieved through a crossover between the two decode methods at ally preferred to other regularisation methods [1]. To avoid sharp
700 Hz. The filters used were 32nd order linear phase high pass notches and peaks in the inverse filter, 1/4 octave smoothing was
and low pass filters with Chebyshev windows. used. By convolving the Ambisonic loudspeaker configurations
The encoded Ambisonic channel gains were then applied to the with their inverse filters, the diffuse-field responses are equalised
loudspeaker decode matrices for each loudspeaker configuration, to within ±1.5 dB of unity in the range of 20 Hz - 15 kHz. The
resulting in a single value of gain for each loudspeaker for each diffuse-field responses, inverse filters and resultant diffuse-field
of the 492 directions. As this study used virtual loudspeakers for compensated (DFC) frequency responses of the three Ambisonic
binaural reproduction of Ambisonics, the gain for each loudspeaker loudspeaker configurations are presented in Figure 3.
was then applied to a HRIR pair measured at the corresponding
loudspeaker’s location. The HRIRs used in this study were of the 4. EVALUATION
Neumann KU 100 dummy head at 44.1 kHz and 16-bit resolution,
from the SADIE HRTF database [31]. To measure the effectiveness of diffuse-field equalisation of first-
The resulting HRIRs for each virtual loudspeaker in the config- order Ambisonics, two listening tests were conducted. The first
uration were then summed, leaving a single HRIR pair for each of assessed whether diffuse-field equalisation improves timbral con-
the 492 directions. The HRIR pairs represent the transfer function sistency between Ambisonic binaural rendering and diffuse-field
of the Ambisonic virtual loudspeaker configuration for each direc- equalised HRTF convolution, and the second assessed whether
tion of sound incidence. This was repeated for the three Ambisonic the use of diffuse-field equalisation improves timbral consistency
loudspeaker configurations. between different first-order Ambisonic virtual loudspeaker config-
DAFX-3
DAFx-391
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5 5 5
Amplitude (dB)
0 0 0
2 3 4 2 3 4 2 3 4
10 10 10 10 10 10 10 10 10
Frequency (Hz) Frequency (Hz) Frequency (Hz)
Figure 3: Diffuse-field responses, inverse filters and resultant DFC responses of three first-order Ambisonic loudspeaker configurations.
The stimulus used in the tests was one second of monophonic pink
noise at a sample rate of 44.1 kHz, windowed by an onset and offset 4.3. Headphone Level Calibration
Hanning ramp of 5 ms [35] to avoid unwanted audible artefacts. The listening tests were run at an amplitude of 60 dBA, chosen in
The Ambisonic sound samples were created by encoding the pink accordance with Hartmann and Rakerd [36] who found that high
noise into first-order Ambisonics (N3D normalisation and ACN sound pressure levels increase error in localisation.
channel numbering) and decoding to the three virtual loudspeaker Headphone level was calibrated using the following method: a
layouts using the same dual-band decode technique and HRTFs as Genelec 8040 B loudspeaker was placed inside an anechoic cham-
subsection 3.2. The direct HRTF renders were created by convolv- ber, emitting pink noise at an amplitude that was adjusted until a
ing the pink noise stimulus with a diffuse-field compensated HRTF sound level meter at a distance of 1.5 m and incidence of (0◦ , 0◦ )
pair of the corresponding sound source direction. The HRTFs used displayed a loudness of 60 dBA. The sound level meter was then
for the direct convolutions were from the same HRTF database as replaced with a Neumann KU 100 dummy head at the same 1.5 m
used in subsection 3.2. Overall, there were 7 different test configu- distance facing the loudspeaker at (0◦ , 0◦ ), and the input level of
rations: virtual loudspeakers in octahedron, cube and bi-rectangle the KU 100’s microphones was measured.
arrangements, all with and without diffuse-field equalisation, and The loudspeaker was then removed, and the Sennheiser HD 650
direct HRTF convolution. To minimise the duration of the experi- headphones to be used in the listening tests were placed on the
ment, sound source directions were limited to the horizontal plane. dummy head. Pink noise convolved with KU 100 HRTFs at (0◦ , 0◦ )
For each test configuration, test sounds were generated for four from the SADIE database, which were recorded at a distance of
sound source directions (θ = 0◦ , 90◦ , 180◦ and 270◦ ). 1.5 m [31], was played through the headphones. The convolved
pink noise was normalised to a level of -23 dBFS RMS: the same
4.2. Level Normalisation loudness as the test sounds (subsection 4.2). The output level
of the headphones was then adjusted on the audio interface until
Each test sound was normalised relative to a frontal incident sound the input from the KU 100 matched the same loudness as from
(0◦ , 0◦ ) at a level of −23 dBFS. RMS amplitude XRMS of signal the loudspeaker. This headphone level was kept constant for all
X was calculated as listening tests.
DAFX-4
DAFx-392
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Amplitude (dB)
played through the headphones with the output of the KU 100
microphones being recorded. The HpTF was then calculated by -10
deconvolving the recording with an inverse of the original sweep
Average HpTF (L)
[37]. The HpTF measurement process was repeated 20 times, with -20 Average HpTF (R)
the headphones being removed and repositioned between each Inverse Filter (L)
measurement (see Figure 4). -30
Inverse Filter (R)
Result (L)
Result (R)
20 -40
102 103 104
Frequency (Hz)
10
-10
-20 files. The presentation of test files was double blinded and the order
randomised separately for each participant.
-30
-40
4.5.1. Results
Average
The data from the first listening test is non-parametric, and results
-50
10 2
10 3
10 4 are binomial distributions with 44 trials across subjects of each
Frequency (Hz)
condition. For results to be statistically significant at less than 5%
probability of chance, the cumulative binomial distribution must
Figure 4: 20 measurements of the headphone transfer function of be greater than or equal to 61.36%: therefore the DFC Ambisonics
Sennheiser HD 650 headphones on the Neumann KU 100 dummy needs to have been chosen a minimum of 27 times out of the 44
head (left ear). Average response in red. trials of that condition.
An average of results across all conditions shows that diffuse-
An average of the 20 HpTF measurements was then calculated field equalised Ambisonic rendering was perceived as closer in
for the left and right ears by taking the power average of the 20 timbre to HRTF convolution for 65.5% of tests. Results for the sep-
frequency responses. An inverse filter was computed using Kirkeby arate conditions of the first listening test are presented in Figure 6
regularization [38], with the range of inversion from 200 Hz - as the percentage that timbre of DFC Ambisonic rendering was per-
16 kHz and 1/2 octave smoothing. The average HpTFs, inverse ceived as closer to HRTF convolution than non-DFC Ambisonics
filters and resultant frequency responses (produced by convolving for 44 trials across all participants. A higher percentage demon-
the averages with their inverse filters) are presented in Figure 5. strates a clearer indication that the DFC Ambisonics was closer to
the HRTF reference, with values at or above 61.36% statistically
significant.
4.5. Listening Test 1 - ABX
The results are statistically significant (p < 0.05) for 9 condi-
The first listening test followed the ABX paradigm [39], whereby tions, therefore the null hypothesis can be rejected for 9 of the 12
three stimuli (A, B and X) were presented consecutively to the tested conditions: diffuse-field equalisation does in fact have an
participants, who were instructed to answer which of A or B was effect on the perceived timbre of Ambisonic binaural rendering.
closest to X in timbre. This differs from the modern definition of the The three conditions that were below statistical significance are
ABX test where X would be one of A or B: here X was employed as for rear-incidence (θ = 180◦ ). This suggests that diffuse-field
a reference sound (HRTF convolution), and A and B were the Am- equalisation has a different effect on the timbre of Ambisonics
bisonic binaural renders - one of which was diffuse-field equalised. depending on the angle of sound incidence. Friedman’s analysis of
The test was forced choice: the participant had to answer either variance (ANOVA) tests were conducted to test whether this effect
A or B. The null hypothesis is that diffuse-field equalisation has is statistically significant (see Table 2).
no effect on the similarity of timbre between Ambisonic binaural The Friedman’s ANOVA tests showed that for the cube (Chi-sq
rendering and HRTF convolution. = 17.36; p = 0.0006), the effect that diffuse-field equalisation has on
There were 12 test conditions in total: four sound source direc- timbre varies significantly depending on the sound source direction,
tions (as in subsection 4.1), across the three Ambisonic loudspeaker but not for the octahedron and bi-rectangle (p > 0.05). Post-hoc
configurations. Every test condition was repeated with the order analysis to determine which direction produced the statistical signif-
of DFC and non-DFC Ambisonic rendering reversed, to avoid bias icance in the cube’s results was conducted using a Wilcoxon signed
towards any particular arrangement, resulting in a total of 24 test rank test, which showed the outlying results were from (θ = 180◦ ).
DAFX-5
DAFx-393
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
equalisation and the other without. The test was forced choice: the
100
participant had to answer either A or B. The null hypothesis is that
diffuse-field equalisation has no effect on the consistency of timbre
90
between different Ambisonic virtual loudspeaker configurations.
There were 4 conditions in total: one for each of the four sound
80
source directions (as in subsection 4.1). Every test file was repeated
DFC choice (%)
Octahedron
40 Cube 4.6.1. Results
Bi-rectangle
30
As for the first listening test, data from the second listening test
0 90 180 270 is non-parametric and results are binomial distributions, with 44
Azimuth (°) trials across subjects of each condition. For results to be statistically
significant at less than 5% probability of chance, the cumulative
Figure 6: Results across subjects for which DFC Ambisonic render- binomial distribution must be greater than or equal to 61.36%: there-
ing was considered more consistent in timbre to HRTF convolution fore the DFC Ambisonics needs to have been chosen a minimum
than non-DFC Ambisonics, for each condition of the ABX test. of 27 times out of the 44 trials of that condition.
Dashed line at 61.36% shows the boundary for statistical signifi- An average of results across all directions shows that Am-
cance. Small horizontal offsets applied to results for visual clarity. bisonic virtual loudspeakers were perceived as more consistent in
timbre when diffuse-field equalised for 74.4% of tests. Results for
Table 2: Friedman’s ANOVA tests to determine whether direction of the separate conditions of the second listening test are presented in
sound incidence had a significant effect on results for the 3 tested Figure 7 as the percentages that the three DFC loudspeaker configu-
loudspeaker configurations. rations were perceived as more consistent in timbre than non-DFC
configurations. A higher percentage demonstrates a clearer indica-
Loudspeaker Configuration Chi-sq p tion that diffuse-field compensated Ambisonics was more consistent
Octahedron 4.05 0.2566 in timbre across different loudspeaker configurations, with values
Cube 17.36 0.0006 at or above 61.36% statistically significant.
Bi-rectangle 4.32 0.2293
100
Table 3). These tests showed that, for all directions of sound inci-
dence, this difference in results is not statistically significant (p > 70
0.05).
60
Table 3: Friedman’s ANOVA tests to determine whether the virtual
loudspeaker configurations produced significantly different results 50
from each other for the 4 tested directions of sound incidence.
AB Test
Azimuth (◦ ) Chi-sq p 40
0 90 180 270
0 0.67 0.7165 Azimuth (°)
90 2.67 0.2636
180 1.61 0.4464 Figure 7: Results across subjects for which DFC Ambisonic ren-
270 0.36 0.8338 dering was considered more consistent in timbre across different
virtual loudspeaker configurations, for each condition of the AB
test. Dashed line at 61.36% shows the boundary for statistical
significance from chance at a confidence level of p < 0.05.
4.6. Listening Test 2 - AB
The second listening test was an AB comparison. Two sets of three Results for all conditions are statistically significant, as the
stimuli were presented consecutively to the participants, who were DFC Ambisonic loudspeaker configurations were considered more
instructed to answer which of set A or set B had the most consistent consistent for more than 61.36% of results. Therefore the null
timbre. Each set consisted of the three Ambisonic binaural renders hypothesis can be rejected: diffuse-field equalisation does have an
(octahedron, cube and bi-rectangle): one set with diffuse-field effect on the consistency of timbre between different Ambisonic
DAFX-6
DAFx-394
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
virtual loudspeaker configurations. Frontal incidence (θ = 0◦ ) The conducted subjective listening tests show that diffuse-field
produced the most notable result: the three virtual loudspeaker equalisation of Ambisonics is successful in improving timbral con-
configurations were considered more consistent when diffuse-field sistency between Ambisonic binaural rendering and HRTF convolu-
equalised for 91% of tests in this condition. The other sound source tion, as well as between different first-order Ambisonic loudspeaker
directions (θ = 90◦ , 180◦ and 270◦ ) favoured DFC Ambisonics as configurations. However, results differ depending on sound inci-
more consistent for 70%, 70% and 66% of the tests, respectively. dence, and a theory to address this by changing the directional
As in the first listening test, direction of sound incidence appears weighting in the diffuse-field calculation has been proposed which
to have had an effect on the results. To determine the significance is currently being investigated.
of this effect, a Friedman’s ANOVA test was conducted which Future work will look at diffuse-field equalisation of higher-
showed statistical significance (Chi-sq = 9.19; p = 0.0269). Post- order Ambisonics, as well as assessing the effect that diffuse-field
hoc analysis to determine which condition produced the statistical equalisation has on localisation accuracy in Ambisonic rendering.
significance was conducted using a Wilcoxon signed rank test, If it can be shown that diffuse-field equalisation either improves or
which showed the outlying results were from (θ = 0◦ ). has no effect on localisation accuracy, then diffuse-field equalisation
will be a clear recommendation for achieving more natural sounding
Ambisonics.
4.7. Discussion
The results from the two listening tests clearly indicate the ap- 6. ACKNOWLEDGEMENT
plicability of diffuse-field equalisation for reducing the timbral
differences between Ambisonic binaural rendering and HRTF con- Thomas McKenzie was supported by a Google Faculty Research
volution, as well as improving timbral consistency between different Award.
Ambisonic virtual loudspeaker configurations.
One observation from the results is that direction of incidence 7. REFERENCES
had a substantial effect on people’s judgement, with DFC causing
minimal change to timbre for rear incident (θ = 180◦ ) sounds in [1] Zora Schärer and Alexander Lindau, “Evaluation of equal-
the first test but a clear trend for other directions. This change is ization methods for binaural signals,” in Proc. AES 126th
statistically significant for the results of the cube in the first test. Convention, 2009, pp. 1–17.
Direction of incidence also had a significant effect on results for [2] Alexander Lindau, Torben Hohn, and Stefan Weinzierl, “Bin-
the second test: the three virtual loudspeaker configurations were aural resynthesis for comparative studies of acoustical envi-
perceived to be more consistent when diffuse-field equalised for all ronments,” in Proc. AES 122nd Convention, Vienna, 2007,
sound incidence, but the trend was significantly more pronounced pp. 1–10.
for frontal incidence (θ = 0◦ ).
[3] A. J. Berkhout, D. de Vries, and P. Vogel, “Acoustic control
A possible explanation for the variation in results depending
by wave field synthesis,” J. Acoust. Soc. Am., vol. 93, no. 5,
on direction of incidence is that, in general, the diffuse-field equali-
pp. 2764–2778, 1993.
sation filters calculated in this study boosted high frequencies and
attenuated low frequencies (see Figure 3). As the direction of inci- [4] Ville Pulkki, “Virtual sound source positioning using vector
dence changes the frequency content of a sound, with rear-incident base amplitude panning,” J. Audio Eng. Soc., vol. 45, no. 6,
sounds typically subject to attenuation in the high frequencies due pp. 456–466, 1997.
to pinnae shadowing, this may explain the variation in significance [5] Michael A. Gerzon, “Periphony: with-height sound reproduc-
of results. tion,” J. Audio Eng. Soc., vol. 21, no. 1, pp. 2–10, 1973.
In this study, when calculating the diffuse-field responses of the [6] Michael A. Gerzon, “Criteria for evaluating surround-sound
loudspeaker configurations, every direction was weighted evenly. systems,” J. Audio Eng. Soc., vol. 25, no. 6, pp. 400–408,
However, to address the direction-dependent influence of diffuse- 1977.
field equalisation on timbre, an approach by changing the weighting
based on the solid angle could be taken by increasing the weight- [7] Michael A. Gerzon, “Ambisonics in multichannel broad-
ing for rear-incident sounds when calculating the diffuse-field re- casting and video,” J. Audio Eng. Soc., vol. 33, no. 11, pp.
sponses. This could produce a more even effect on timbre for 859–871, 1985.
different directions of sound incidence. This theory requires further [8] Pierre Lecomte, Philippe Aubert Gauthier, Christophe Lan-
investigation, which is currently in progress. grenne, Alexandre Garcia, and Alain Berry, “On the use of a
lebedev grid for ambisonics,” in Proc. AES 139th Convention,
2015, pp. 1–12.
5. CONCLUSIONS [9] Adam McKeag and David McGrath, “Sound field format
to binaural decoder with head-tracking,” in Proc. AES 6th
This paper has demonstrated the timbral variation that exists be- Australian Regional Convention, 1996, pp. 1–9.
tween different Ambisonic loudspeaker configurations above the
spatial aliasing frequency, due to comb filtering produced by the [10] Markus Noisternig, A. Sontacchi, T. Musil, and R. Höldrich,
summing of multiple sounds at the ears. A method to address “A 3D ambisonic based binaural sound reproduction system,”
this timbral disparity through diffuse-field equalisation has been in Proc. AES 24th International Conference on Multichannel
presented, and the effectiveness of the method in improving the Audio, 2003, number March, pp. 1–5.
consistency of timbre between different spatial audio rendering [11] Gavin Kearney and Tony Doyle, “Height perception in am-
methods and between different first-order Ambisonic loudspeaker bisonic based binaural decoding,” in Proc. AES 139th Con-
configurations has been evaluated. vention, 2015, pp. 1–10.
DAFX-7
DAFx-395
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[12] Jerôme Daniel, Représentation de champs acoustiques, ap- [28] V. Ralph Algazi, Carlos Avendano, and Richard O. Duda,
plication à la transmission et à la reproduction de scènes “Elevation localization and head-related transfer function anal-
sonores complexes dans un contexte multimédia, Ph.D. thesis, ysis at low frequencies,” The Journal of the Acoustical Society
l’Université Paris, 2001. of America, vol. 109, no. 3, pp. 1110–1122, 2001.
[13] Aaron J. Heller and Eric M. Benjamin, “Calibration of sound- [29] George F. Kuhn, “Model for the interaural time differences in
field microphones using the diffuse-field response,” in Proc. the azimuthal plane,” J. Acoust. Soc. Am., vol. 62, no. 1, pp.
AES 133rd Convention, San Francisco, 2012, pp. 1–9. 157–167, 1977.
[14] Elizabeth Wenzel, Marianne Arruda, Doris Kistler, and Fred- [30] Michael A. Gerzon and Geoffrey J. Barton, “Ambisonic
eric L. Wightman, “Localization using nonindividualized decoders for HDTV,” in Proc. AES 92nd Convention, 1992,
head-related transfer functions,” J. Acoust. Soc. Am., vol. 94, number 3345, pp. 1–42.
no. 1, pp. 111–123, 1993. [31] Gavin Kearney and Tony Doyle, “A HRTF database for virtual
[15] Henrik Møller, Michael. F. Sørensen, Clemen B. Jensen, and loudspeaker rendering,” in Proc. AES 139th Convention, 2015,
Dorte Hammershøi, “Binaural technique: do we need indi- pp. 1–10.
vidual recordings?,” J. Audio Eng. Soc., vol. 44, no. 6, pp. [32] Ole Kirkeby and Philip A. Nelson, “Digital filter design for
451–469, 1996. inversion problems in sound reproduction,” J. Audio Eng.
[16] Adelbert W. Bronkhorst, “Localization of real and virtual Soc., vol. 47, no. 7/8, pp. 583–595, 1999.
sound sources,” J. Acoust. Soc. Am., vol. 98, no. 5, pp. 2542– [33] Henrik Møller, “Fundamentals of binaural technology,” Ap-
2553, 1995. plied Acoustics, vol. 36, no. 3-4, pp. 171–218, 1992.
[17] CJ Tan and Woon Seng Gan, “Direct concha excitation for [34] Stéphane Pigeon, “Hearing test and audiogram,” Available at
the introduction of individualized hearing cues,” J. Audio Eng. https://hearingtest.online/, accessed March 15, 2017.
Soc., vol. 48, no. 7, pp. 642–653, 2000.
[35] David Schonstein, Laurent Ferré, and Brian Katz, “Com-
[18] Henrik Møller, Dorte Hammershøi, Clemen B. Jensen, and parison of headphones and equalization for virtual auditory
Michael F. Sørensen, “Transfer characteristics of headphones source localization,” J. Acoust. Soc. Am., vol. 123, no. 5, pp.
measured on human ears,” J. Audio Eng. Soc., vol. 43, no. 4, 3724–3729, 2008.
pp. 203–217, 1995.
[36] W M Hartmann and B Rakerd, “Auditory spectral discrimi-
[19] David Griesinger, “Accurate timbre and frontal localization nation and the localization of clicks in the sagittal plane.,” J.
without head tracking through individual eardrum equaliza- Acoust. Soc. Am., vol. 94, no. 4, pp. 2083–2092, 1993.
tion of headphones,” in Proc. AES 141st Convention, 2016, [37] Angelo Farina, “Simultaneous measurement of impulse re-
pp. 1–8. sponse and distortion with a swept-sine technique,” in Proc.
[20] Abhijit Kulkarni and H. Steven Colburn, “Variability in AES 108th Convention, Paris, 2000, number I, pp. 1–23.
the characterization of the headphone transfer-function,” J. [38] Ole Kirkeby, Philip A. Nelson, Hareo Hamada, and Felipe
Acoust. Soc. Am., vol. 107, no. 2, pp. 1071–1074, 2000. Orduna-Bustamante, “Fast deconvolution of multichannel
[21] Roland Bücklein, “The audibility of frequency response systems using regularization,” IEEE Transactions on Speech
irregularities,” J. Audio Eng. Soc., vol. 29, no. 3, pp. 126 – and Audio Processing, vol. 6, no. 2, pp. 189–194, 1998.
131, 1981. [39] W. A. Munson and Mark B. Gardner, “Standardizing auditory
[22] Bruno Masiero and Janina Fels, “Perceptually robust head- tests,” J. Acoust. Soc. Am., vol. 22, no. 5, pp. 675, 1950.
phone equalization for binaural reproduction,” in Proc. AES
130th Convention, 2011, pp. 1–7.
[23] Alexander Lindau and Fabian Brinkmann, “Perceptual evalu-
ation of headphone compensation in binaural synthesis based
on non-individual recordings,” J. Audio Eng. Soc., vol. 60, no.
1-2, pp. 54–62, 2012.
[24] John Burkardt, “SPHERE_GRID - points, lines, faces on
a sphere,” Available at http://people.sc.fsu.edu/˜jburkardt
/datasets/sphere_grid/sphere_grid.html, accessed February 09,
2017.
[25] Archontis Politis, Microphone array processing for paramet-
ric spatial audio techniques, Ph.D. thesis, Aalto University,
2016.
[26] Aaron J. Heller, Richard Lee, and Eric M. Benjamin, “Is my
decoder ambisonic?,” in Proc. AES 125th Convention, San
Francisco, 2008, pp. 1–22.
[27] Mark B. Gardner, “Some monaural and binaural facets of
median plane localization,” J. Acoust. Soc. Am., vol. 54, no.
6, pp. 1489–1495, 1973.
DAFX-8
DAFx-396
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-397
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
dividual HRTFs that capture all of the physical effects creating a Performances of this technique were evaluated by Yukio
personal perception of immersive audio. The measurement of a lis- Iwaya that proved that the personalized DOMISO HRTFs
tener’s own HRTFs in all directions requires a special measuring results were similar to individualized HRTFs ones but very
apparatus and a long measurement time, often a too heavy task to different from the away condition (totally random HRTF,
perform for each subject involved in every-day application. That that could not win the tournament).
is the main reason why alternative ways are preferred to provide • Two steps selection[14, 15]
HRTFs giving to listeners a personalized, but not individually cre- This is a technique based on two different steps. Usually
ated, HRTF set: a trade off between quality and cost of the acoustic the first step selects one subset from a complete initial pool
data for audio rendering [8, 9]. of HRTF sets, removing worse HRTFs from a perceptual
point of view. The second step refines the selection in or-
2.1. Individual / Own HRTFs der to obtain the best match among generic HRTFs of a
dataset which is reduced in size compared to the complete
The standard setup for individual HRTF measurement is in an ane- database.
choic chamber with a set of loudspeakers mounted on a geodesic
sphere (with a radius of at least one meter in order to avoid near- • Matching anthropometric ear parameters[16, 17]
field effects) at fixed intervals in azimuth and elevation. The lis- This method is based on finding the best match HRTF in the
tener, seated in the center of the sphere, has microphones in his/her anthropometric domain, matching the external ear shape of
ears. After subject preparation, HRIRs are measured playing ana- a subject using anthropometric measurements available in
lytic signals and recording responses collected at the ears for each the database.3
loudspeaker position in space (see [9] for a systematic review on
this topic). 3. IMAGE-GUIDED HRTF SELECTION
The main goal is to extract the set of HRTFs for every listener
thus providing him/her the individual/own transfer function. In ad- Another approach to HRTF selection problem consists in mapping
dition to the above mentioned high demanding requirements (time anthropometric features into the HRTF domain, following a ray-
and equipment), there are some other critical aspects in HRTF tracing modeling of pinna acoustics [18, 19]. The main idea is to
measurements; listener’s pose is usually limited to a few posi- draw pinna contours on a image. Distances from the ear canal
tions (standing or sitting), relatively few specific locations around entrance define reflections on pinna borders generating spectral
his/her body, and own intrinsic characterization without consider- notches in the HRTF. Accordingly, one can use such anthropomet-
ing that pinna shape is one of the human body part that always ric distances and corresponding notch parameters to choose the
grows during lifetime [10]. Moreover, repeatability of HRTF mea- best match among a pool of available HRTFs [5].
surements are still a delicate issue [11].
3.1. Notch distance metrics
2.2. Personalized / Generalized HRTFs
The extraction of HRTFs using reflections and contours is based on
The personalized HRTFs are chosen among available HRTFs of a an approximate description of the acoustical effects or the pinna on
dataset instead of doing individual measurements. This procedure incoming sounds. In particular, the distance dc between a reflec-
is based on a match between external subjects (the one without tion point on the pinna and the entrance of the ear canal (the “focus
individual HRTFs) and internal, i.e. belonging to a database, sub- point” hereafter) is given by:
jects with already stored information (both acoustics and anthro-
pometry) . The most interesting and important part, is the method ctd (φ)
dc (φ) = , (1)
of how is selected a specific set of HRTFs to an external subject. 2
Researchers are finding different ways to deal with this issue and
where td (φ) is elevation-dependent temporal delay between the
there are a variety of alternatives using common hardware and/or
direct and the reflected wave and c is the speed of sound.
software tools. The main benefit of this approach is that a user
The corresponding notch frequency depends on the sign of the
can be guided to a self selection of their best HRTF set without
reflection. Assuming the reflection coefficient to be positive, a
needing a special equipment or knowledge. It has to be noted that
notch is created at all frequencies such that the phase difference
the personalized HRTF can not guarantee the same performance as
between the reflected and the direct wave is equal to π:
their own HRTF but they usually provide better performance than
the generic dummy-head HRTFs such those of Knowles Electronic 2n + 1 c(2n + 1)
Manikin for Acoustic Research (KEMAR) [12]. fn (φ) = = , (2)
2td (φ) 4dc (φ)
In the following, we summarize three main approaches to HRTF
selection. where n ∈ N. Thus, the first notch frequency is found when n =
2 0, giving the following result:
• DOMISO[13]
In this technique, subjects can choose their most suitable c
HRTFs from among many, taken from a database follow- f0 (φ) = . (3)
4dc (φ)
ing tournament-style listening tests. The database (corpus)
is built using different subjects, storing 120 sets of HRTFs, In fact, a previous study [19] on the CIPIC database [6] proved
one set per listener. that almost 80% of the subjects in the database exhibit a clear neg-
ative reflection in their HRIRs. Under this assumption, notches are
2 DOMISO: Determination method of OptimuM Impulse-response by
Sound Orientation 3 see section 3.2 for further details.
DAFX-2
DAFx-398
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(k,n) (k,n)
where f0 (ϕ) = c/[2dc (ϕ)] are the frequencies extracted managed efficiently using the three buttons, “Add Subject”, “Up-
from the image and contours of the subject, and F0 are the notch date Subject” and “Delete Subject”, as well as some text-fields
frequencies extracted from the HRTF with ad-hoc algorithms such used to assign to each subject their own information. For each
as those developed in [18, 20, 17]; (k, n) with (0 ≤ k < K) subject stored in the list, an image of the left pinna can be assigned
and (0 ≤ n < N ) refers to a one particular pair of traced C1 with the button “Choose ear image”: the image will be shown in
contour and focus point; ϕ spans all the [−45◦ , +45◦ ] elevation the middle of the GUI when a name from the list is clicked.
angles for which the notch is present in the corresponding HRTF; After loading the pinna image of a subject, the main pinna
Nϕ is the number of elevation angles on which the summation is contour C1 and the focus point can be traced manually by click-
performed. Extracted notches need to be grouped into a single ing on the “Trace Ear” button. Two parameters N and K can be
track evolving through elevation consistently, a labeling algorithm specified, which are the number of estimates that the operator will
(e.g. [19]) performed such computation along subsequent HRTFs. trace for the C1 contour and the focus point, respectively.
If the notches extracted from the subject’s pinna image are to Two checkboxes under the “Trace Ear” button aid the usabil-
be compared with a set of HRTFs taken from a database, various ity of the tracing task: the first one is the “Stored traced contours”
notch distance metrics can be defined based on this mismatch func- that shows the already drawn contours in the previous drawing ses-
tion, to rank database HRTFs in order of similarity. In particular, sion. The second one, called “Current traced contours” is about
we define three metrics: visualizing on pinna image the contours drawn in the current ses-
sion. 4
• Mismatch: each HRTF is assigned a similarity score that
One last parameter to be set, M , refers to the top-M appear-
corresponds exactly to increasing values of the mismatch
ance metrics discussed above. By clicking on the “Process Con-
function calculated with Eq. (6) (for a single (k, n) pair).
tours”, the application returns the ranked positions of the database
• Ranked position: each HRTF is assigned a similarity score HRTFs according to the three metrics.
that is an integer corresponding to its ranked position taken
from the previous mismatch values (for a single (k, n) pair). Database of generic HRTFs and extracted F0 . The public data-
• Top-M appearance: for a given integer M , for each HRTF, base used for our purpose is the CIPIC [6]. The first release pro-
a similarity score is assigned according to the number of vided HRTFs for 43 subjects (plus two dummy-head KEMAR HRTF
times (for all the (k, n) pairs) in which that HRTF ranks in sets with small and large pinnae, respectively) at 25 different az-
the first M positions. imuths and 50 different elevations, to a total of 1250 directions. In
addition, this database includes a set of pictures of external ears
and anthropometric measures for 37 subjects. Information of the
3.2. A HRTF selection tool first prominent notch in each HRTF were extracted with the struc-
Based on the concepts outlined above, we propose a tool for se- tural decomposition algorithm [20, 9] and F0 tracks were labeled
lecting from a database a HRTF set that best fits the image of a with the algorithm in [19] and then stored in a custom data struc-
subject’s pinna. The C1 contour and the focus point are traced ture;
manually on the pinna image by an operator, and then the HRTF
sets in the database are automatically ranked in order of similarity Guidelines for contour tracing. In the trace-ear GUI, the user
with the subject. The tool is implemented in Matlab. has to draw by hand N estimates of the C1 contours on top of a
pinna image. After that, the user has to point K positions of the
Graphical user interface. Figure 1 provides a screenshot of the 4 The default tracing procedure allows drawing a single contour/focus
main GUI which is responsible for managing subjects and orga- point at a time, that visually disappears once traced; for every estimate, our
nizing them in a list (on the left of the screen). The list can be tool shows pinna images clean from traced information.
DAFX-3
DAFx-399
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-400
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
distribution with zero-mean and standard deviation, called uncer- due to irregularities of HRTF measurements or erroneous track la-
tainty, U . The lower the U value, the higher the sensitivity of the bel assignment of F0 (ϕ) evolving through elevation (in 2 of the
virtual listener in discriminating different spectral profiles result- 5 subjects which were previously removed). 6 On the other hand,
ing in a measure of probability rather than a deterministic value. the case where a systematic lowering in notch frequency of traces
The virtual experiment was conducted simulating listeners with occurred (in 3 of the 5 subjects previously removed) deserves a
all analyzed CIPIC HRTFs, using an uncertainty value U = 1.8 more careful consideration: from one of our previous studies [19],
which is similar to average human sensitivity [7]. We predicted el- we identified a 20% of CIPIC subjects for whom a positive re-
evation performance for every virtual subject when listening with flection coefficient better models the acoustic contribution of the
his/her own individual HRTFs, with those of CIPIC subject 165 pinna. Accordingly, it is reasonable to think that those three ex-
(the KEMAR), and the best m / best r / best top3 selected HRTFs.
6 Repeatability of HRTF measurements are still a delicate issue, sug-
The precision for the j-th elevation response close to the target
position is defined in the local polar RMS error (P E): gesting a high variability in spectral details [11].
sP
i∈L (φi − ϕj )2 pj [φi ]
P Ej = P ,
i∈L pj [φi ]
10
where L = {i ∈ N : 1 ≤ i ≤ Nφ , |φi − ϕj | mod180◦ < 90◦ }
defines local elevation responses within ±90◦ w.r.t. the local re-
9
sponse φi and the target position ϕj , and pj [φi ] denotes the pre- 8
Frequency (kHz)
diction, i.e. probability mass vector. 7
(k,n)
from mismatches between f0 (ϕ) and individual HRTF’s F0 (ϕ) 7
our criteria and for which no firm conclusions can be drawn. After 9
rank positions.
4
Figure 3 depicts the three typical tracing scenarios: (a) a con-
sistent trace-notch correspondence, (b) a systematic lowering in 3
-40 -30 -20 -10 0 10 20 30 40
Elevation (Deg)
notch frequency of traces, and (c) an irregular notch detection. In
the first case, traced contours and individual HRTF notches are (e) Sub 44 - traces (f)
in the same range resulting in the ideal condition of applicability
for the proposed metric. The latter situation occasionally occurred Figure 3: (a,c,e) Examples of traced C1 /focus points for three
(k,n)
CIPIC subjects; (b,d,f) corresponding f0 (ϕ) (light gray lines)
5 We focused on local polar error in the frontal median plane, where
with F0 (ϕ) values of individual HRTFs (black solid line), best se-
individual elevation-dependent HRTF spectral features perceptually dom- lection according to mismatch/rank metric (black dotted line), and
inate; on the contrary, front-back confusion rate (similar to quadrant error best selection according to Top-3 metric (black dash-dotted line).
rate QE in [7]) derives from several concurrent factors, such as dynamic
In this examples, best HRTF selection according to mismatch and
auditory cues, visual information, familiarity with sound sources and train-
ing [23], thus it was not considered in this study. rank metrics do not differ significantly.
DAFX-5
DAFx-401
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0,2 25
these results appear to be counter intuitive because we always se-
RANK
0,15 20 lected a generic HRTF which differed from the individual HRTF in
0,1
15 terms of both mismatch and rank. However, this evidence can not
10 be misleading because we already know from our previous study
0,05 5 in [5] that the notch associated to the pinna helix border is not
0 0 enough to describe elevation cues for all listeners. Moreover, bio-
metric recognition studies [24] show that the pinna concha wall is
also relevant in order to uniquely identify a person. Finally, multi-
ple contours tracing highly contributes to the uncertainty of notch
frequency matches, providing a good average rank anyway (11),
though preventing the individual HRTFs to be chosen as best se-
(a) Mismatch (b) Rank lection.
Surprisingly, localization predictions from auditory model sim-
Figure 4: Global statistics (average + standard deviation) for met- ulations provided average local polar RMS error (average PE) which
ric assessment on (a) mismatch, (b) rank, grouped by HRTF con- has a statistically significant difference between best m and best
dition. Asterisks and bars indicate, where present, a significant top3 metrics, t(16) = 2.134, p < 0.05 (see Fig. 5.a for a graphical
difference (*: p < 0.05, **: p < 0.01 , ***: p < 0.001 at representation). This results suggest that best top3 yields better lo-
post-hoc test). calization performances than best m, and with a similar compared
to best r (not proven to be statistically significant in this study).
Intuitively, best top3 metric is more robust to contour uncertainty
** because of the M-constraint in its definition, that allowed us to fil-
40 * 40 *** *** ter out variability due to HRTF set with sparse appearances in our
)
rankings.
Average PE (Deg)
g
e
35 D( 35 Finally, localization predictions were computed also for in-
E
P dividual HRTFs and KEMAR virtual listening. Pairwise t-tests
e
30 g 30
a reveal significant differences in average PEs between individual
e
v listening condition and KEMAR (t(16) = −7.79, p ≪ 0.001),
A
25 25
and between individual listening condition and best top3 HRTF
20 20
(t(16) = −4.13, p < 0.01), reporting a better performance with
individual HRTFs. Moreover, pairwise t-test reports significant
differences in average PEs between best top3 HRTF and KEMAR
(t(16) = 5.590, p ≪ 0.001), with a better performance of the se-
(a) HRTF selection metrics (b) Individual vs. generic
lected HRTF compared to dummy-head listening condition. This
final result further confirms the gap between individual HRTF lis-
tening and the proposed HRTF selection based on pinna image;
Figure 5: Global statistics (average + standard deviation) for local- on the other hand, best top 3 criteria selected generic HRTFs that
ization prediction in elevation on average PE for (a) metrics based outperformed KEMAR listening condition.
on notch distance mismatch, (b) individual vs. generic (KEMAR)
vs. personalized (best top3). Asterisks and bars indicate, where 6. CONCLUSIONS
present, a significant difference (*: p < 0.05, **: p < 0.01 , ***:
p < 0.001 at post-hoc test). The proposed distance metrics considering the C1 contour pro-
vides insufficient features in order to unambiguously identify an
individual HRTF from the corresponding side picture of the lis-
tener. Moreover, multiple tracing of C1 and of the focus point
cluded subjects can be assigned to this special group. 7
adds further variability to the procedure resulting in extra uncer-
Our metrics based on notch distance clearly distinguish the
tainty for the validation. On the other hand, our final result con-
three sets of HRTF, i.e. individual, KEMAR, and best selected,
firms that our image-guide HRTF selection procedure provides a
in terms of mismatch and rank (Fig. 4 clearly shows this aspect);
useful tool in terms of:
Kruskal Wallis nonparametric one-way ANOVAs with three levels
of feedback condition (individual HRTF, KEMAR, best m) pro- • personalized dataset reduction: since individual HRTF
vided a statistically significant result for mismatch [χ2 (2)=1141.8, rank is on average the 12th position, one can compute a
p ≪ 0.001]; Pairwise post-hoc Wilcoxon tests for paired sam- personalized short list of ≈ 12 best candidate HRTFs for
ples with Holm-Bonferroni correction revealed statistical differ- a given pinna picture, in which finding with high probabil-
ences among all pairs of conditions (p ≪ 0.001). The same anal- ity a generic HRTF reasonably close to the individual one.
Accordingly, a subsequent refinement of the HRTF selec-
7 Unfortunately, we were not able to directly compare our current study tion procedure might be required through subjective selec-
with [19] because different CIPIC subjects were considered. tion procedures or additional data analysis on the reduced
DAFX-6
DAFx-402
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-403
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-404
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
VELVET-NOISE DECORRELATOR
DAFX-1
DAFx-405
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1
[16], [17], [18], [19], [20], [21]. Whereas previous work on velvet
noise has focused on its perceptual qualities and suitability to ar- (a) 0
tificial reverberation, we investigate how to design a velvet-noise
decorrelator (VND). -1
This paper is organized in the following way. In Section 2, we
1
discuss velvet noise and how signals can be efficiently convolved
with it. Section 3 introduces various decorrelation methods us- (b) 0
ing velvet noise. Section 4 presents our results and compares the
proposed method with a white-noise-based decorrelation method. -1
Section 5 concludes this paper.
1
-1
2.1. Velvet-Noise Sequence 0 200 400 600 800 1000 1200
The main goal of using velvet-noise sequences (VNS) is to create Sample index
spectrally neutral signals, comparable to white noise, while using
Figure 1: Steps to create a velvet-noise sequence: (a) a sequence of
as few non-zero values as possible. By taking advantage of the
evenly spaced impulses, (b) the values of impulses are randomized
sparsity of the signal, we can efficiently convolve it in the time
to either 1 or −1, and (c) the spacing between the impulses is
domain with another signal without any latency [19], [21]. The
randomized.
first step in the generation of velvet noise is to create a sequence
of evenly spaced impulses at the desired density [12], as shown
in Fig. 1(a). The sign of the impulses is then randomized (see
rate of 44.1 kHz, the zero elements represent 97.7% of the se-
Fig. 1(b)). Finally, the spacing between each impulse is also ran-
quence. Therefore, given a sufficiently sparse sequence, time-
domized within the space available to satisfy the density parame-
domain convolution can be more efficient than a spectral domain
ter, which is illustrated in Fig. 1(c).
convolution using an FFT for an equivalent white-noise sequence.
For a given density ρ and sampling rate fs , the average spacing
Furthermore, this sparse time-domain convolution offers the bene-
between two impulses is
fit of being latency-free.
Td = fs /ρ, (1)
3. VELVET-NOISE DECORRELATION
which is called the grid size. The total number of impulses is
3.1. Exponential Decay
M = Ls Td , (2)
When using a sequence of white noise for decorrelation, adding an
where Ls is the total length in samples. The sign of each impulse exponentially decaying amplitude envelope to the signal is recom-
is mended to minimize the potentially audible smearing of transients
[6].
s(m) = 2 round(r1 [m]) − 1, (3)
The decay characteristics of the envelope are designed to achieve
where m is the impulse index, the round function is the rounding a desired attenuation and target length. The envelope is given by
operation, and r1 (m) is a random number between 0 and 1. The
location of the impulse is calculated from D(m) = e−αm (6)
based on a decay constant α given by
k(m) = round[mTd + r2 [m](Td − 1)], (4)
−ln10−LdB /20
where r2 (m) is also a random number between 0 and 1. α= , (7)
M
where LdB is the target total decay, while m represent the index
2.2. Velvet-Noise Convolution of a specific impulse and M is the total number of non-zeros ele-
ment in the sequence. The attenuation of an impulse is set by its
To convolve a VNS with another signal x, we take advantage of index m. For a given set of parameters, every VNS generated has
the sparsity of the sequence. Indeed, by storing the velvet-noise the same decay attenuation D(m) for the same index m, only the
sequence as a series of non-zero elements, all mathematical oper- sign varies. This distribution method ensures the energy is con-
ations involving zero can be skipped. For further optimization, the sistently distributed amongst the impulses regardless of their posi-
location of the positive and negative impulses can be stored in sep- tions. The energy could also be distributed over time, depending
arate arrays, k+ and k− , which removes the need for multipliers on an impulse location. However, since the impulse locations vary
in the convolution [19], [21] based on a random value, a variation of distributed power would
M+ M − be obtained, which would lead to unbalanced decorrelation filters,
X X
x∗k = x[n − k+ [m]] − x[n − k− [m]]. (5) unsuitable for multichannel purposes.
m=0 m=0 To generate an exponentially decaying velvet-noise filter (Fig-
ure 2), we combine the decay envelope of Eq. (6) to the random
For a sequence with a density of a 1000 impulses per second, sign sequence:
which has been found sufficient for decorrelation, and a sample se (m) = e−αm s(m). (8)
DAFX-2
DAFx-406
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 1
0 0
-1 -1
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Sample index Sample index
Figure 2: Velvet-noise sequence with an exponentially decaying Figure 3: Velvet-noise with a segmented, staircase-like decay en-
temporal envelope. velope.
1
The velvet-noise convolution equation also needs to be updated to
include the attenuation factors:
0
M
X
x∗k = x[n − k[m]]se (m). (9)
m=0 -1
0 200 400 600 800 1000 1200
Sample index
3.2. Segmented Decay
Figure 4: Velvet-noise sequence having a logarithmically widening
Unfortunately, the decaying amplitude envelope requires multi-
grid size combined with a segmented decay envelope.
plications and thus leads to more operations for the convolution.
However, since the benefits of a smoothly decaying envelope are
not really perceivable with this application, the process can be sim-
plified while still attenuating impulses based on their location. In- Figure 4 shows an example in which the impulse locations are dis-
deed, audible decorrelation artifacts can be caused by velvet-noise tributed logarithmically and the segmented decay envelope is also
sequences that contain large impulses near the end of the sequence. applied. Since the impulses are not independent and identically
Therefore, to prevent these artifacts while minimizing the number distributed, this distribution may not preserve a flat power spec-
of multiplications required by the sparse convolution, we can sim- trum. However, given the low impulse density, this does not appear
ply limit the attenuation factors to a fixed set of attenuation coeffi- to have a significant impact on the spectral envelope in practice
cients, as illustrated in Fig. 3. Storing each segment in a separate (see Section 4.2).
list allows for a segmented convolution, applying each coefficient
to a partial sum 4. RESULTS
I Mi + Mi −
!
X X X To evaluate the proposed methods, we first compared the comput-
x∗k = si x[n − k+ [m]] − x[n − k− [m]] . (10)
ing cost of convolving velvet noise against the FFT-based white-
i=0 m=0 m=0
noise method using different signal lengths and impulse density
The specific number of segments I and coefficient si values can parameters. We analyzed the spectral envelope of the generated
be set manually. While testing with this approach, we have learned filters to ensure a spectral smoothness comparable to an exponen-
that a small number of segments can sound satisfying, e.g., I = 4. tially decaying white-noise sequence and we calculated the cross-
correlation after convolving a sine-sweep signal with the generated
3.3. Logarithmic Impulse Distribution filters. The signal coherence was calculated over the audible fre-
quency range. Finally, to complement the numerical evaluation
Since the loudness of later impulses needs to be minimized, the methods, a preliminary perceptual test was conducted to validate
samples near the end of the sequence contribute very little to the the potential of the proposed method to externalize stereo sounds
convolved signal. To this end, the impulses can be concentrated at over headphones.
the beginning of the sequence, where loud impulses do not cause as
much smearing of the transients in the decorrelated signals, could
4.1. Computing Resources
be beneficial. For this purpose, we can concentrate the impulses at
the beginning of the sequence by distributing their location loga- We compared the total number of arithmetic operations required
rithmically. The grid size between each impulses is by the white-noise decorrelation to the VND method. To esti-
mate the cost of the FFT-based convolution required by the white
TT 2Td m
Tdl (m) = 10 M (11) noise, we chose the radix-2 algorithm. However, a low-latency
100 version would require extra operations to segment the input sig-
by distributing the spacing logarithmically based on a total length nal into multiple parts. The radix-2 algorithm has a complexity
TT in samples and the density parameter Td . This will also require of N log2 (N ) additions and N2 log2 (N ) complex multiplications.
the use of an updated location equation We only counted the operations required to compute the FFT of
the input signal, since the white-noise filter itself only needs to be
Xm
k(m) = round r2 [m](Tdl (m) − 1) + Tdl (i). (12) converted into the spectral domain once [17]. The cost per sample
i=0
was calculated using an N/2 window size for the input signal.
DAFX-3
DAFx-407
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Cost [op./sample]
The chosen amplitude decay or the impulse distribution method 400
do not have any impact on the required convolution, only the use of 300
segments does. Therefore, the computational cost of the proposed 200 FFT
methods can be regrouped into two categories: regular VND and 100
VND
Seg. VND
segmented VND. In Fig. 5, the proposed methods are shown to 0
outperform FFT-based convolution well below the required length 0 2000 4000 6000 8000 10000 12000 14000 16000
of a decorrelation filter. Furthermore, for a length of 1024 sam- Filter length [samples]
ples, approximately 0.23 ms at a 44.1 kHz sampling rate, the pro- (a)
posed methods remain more efficient when the impulse density is
Cost [op./sample]
400
kept below 6000 impulses per second or below 13000 impulses per 300
second when using the segmented method with four segments. 200
In our sound examples, the decorrelation filters were chosen to 100
be of length 30 ms, which leads to a signal length of 1323 samples 0
for a sampling rate of 44.1 kHz. An impulse density of 1000 im- 0 2000 4000 6000 8000 10000
pulses per seconds was selected for the VND. The input signal was Impulse density [impulses/second]
segmented into windows of 2048 samples. Since the convolution (b)
result has the length 2M − 1, convolving two signals of length M
requires a zero-padded FFT of 4096 samples to convolve with a Figure 5: Comparison of computational cost between block FFT-
filter of the same length. We can see from Table 1 that, for a length based convolution (cross), VND (diamond), and segmented VND
of 30 ms and a density of 1000 impulses per seconds, the VND using four segments (circle): (a) the number of operations required
with exponential decay envelope will require 76% less mathemati- to process each input sample for different filter lengths when the
cal operations than fast convolution, whereas a segmented version impulse density ρ is set to 1000 impulses per second and (b) the
of the algorithm with four segments will yield 87% fewer opera- cost for various impulse density settings when the filter length is
tions than fast convolution. set to Ls = 1024 samples.
where l is the correlation lag. For decorrelation purposes, the zero- Apart from the time-domain correlation, another useful metric is
lag case rab (0) is of practical interest, since it corresponds to the the cross-correlation in different frequency bands, called coher-
maximum similarity between the two sequences. The normalized ence. Normally, a decorrelator is more effective at higher frequen-
zero-lag correlation is given by cies than at lower, which is a result of the effective length of a
decorrelation filter. Indeed, a longer filter will exhibit stronger
P decorrelation for longer wavelengths, but will also create poten-
a(n)b(n)
n tially perceivable artifacts when the input signal contains transients.
ρab = rP P . (14)
a2 (n) b2 (n) To study the decorrelation behavior on a frequency-dependent scale,
n n we can use an octave or third-octave filterbank. The signals for the
DAFX-4
DAFx-408
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
20 20
Amplitude [dB]
Amplitude [dB]
10 10
0 0
-10 -10
-20 -20
100 1k 10k 100 1k 10k
(a) (a)
20 20
Amplitude [dB]
Amplitude [dB]
10 10
0 0
-10 -10
-20 -20
100 1k 10k 100 1k 10k
(b) (b)
Frequency [Hz] Frequency [Hz]
Figure 6: Spectral envelope of (a) an exponentially decaying Figure 7: Spectral envelope of (a) a segmented VNS and (b) a seg-
white-noise sequence and (b) an exponentially decaying VNS. A mented and logarithmically distributed VNS. Both use four seg-
third-octave smoothing has been applied to these magnitude re- ments of equal length and with values 0.85, 0.55, 0.35, and 0.20,
sponses. The dashed lines represent the average standard devia- respectively. A third-octave smoothing has been applied to these
tion of randomly generating 500 sequences. magnitude responses. The dashed lines represent the average stan-
dard deviation of randomly generating 500 sequences.
DAFX-5
DAFx-409
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
-0.02 1
Correlation
Lag [s]
0 0.5
0.02
(a)
-0.02 1
Correlation
Lag [s]
0 0.5
0.02
(b)
-0.02 1
Correlation
Lag [s]
0 0.5
0.02
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
(c)
Time [s]
Figure 8: (a) Auto-correlogram of the sine sweep signal. (b) Cross-correlogram of both channels using the white-noise sequence for
decorrelation. (c) Cross-correlogram of both channels using the segmented VND for decorrelation. The correlation values were normalized
for visualization purposes.
0.5 0.5
Coherence
Coherence
0.25 0.25
0 0
-0.25 -0.25
-0.5 -0.5
500 2k 5k 10k 15k 20k 500 2k 5k 10k 15k 20k
(a) (a)
0.5 0.5
Coherence
Coherence
0.25 0.25
0 0
-0.25 -0.25
-0.5 -0.5
500 2k 5k 10k 15k 20k 500 2k 5k 10k 15k 20k
(b) (b)
Frequency [Hz] Frequency [Hz]
Figure 9: Coherence graph, from 20 Hz to 20 kHz, between left and Figure 10: Coherence graph, from 20 Hz to 20 kHz, between left
right channels when using (a) white noise and (b) velvet noise. The and right channels when using (a) a segmented VNS and (b) a
dashed lines represent the average standard deviation of the data segmented and logarithmically distributed VNS. The dashed lines
after running the algorithm 500 times. represent the average standard deviation of the data after running
the algorithm 500 times.
7. REFERENCES
[2] G. Potard and I. Burnett, “Decorrelation techniques for the
[1] G. S. Kendall, “The decorrelation of audio signals and its rendering of apparent sound source width in 3D audio dis-
impact on spatial imagery,” Computer Music J., vol. 19, no. plays,” in Proc. Int. Conf. Digital Audio Effects (DAFx-04),
4, pp. 71–87, 1995. Naples, Italy, Oct. 2004, pp. 280–284.
DAFX-6
DAFx-410
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[3] V. Pulkki and J. Merimaa, “Spatial impulse response ren- [19] B. Holm-Rasmussen, H.-M. Lehtonen, and V. Välimäki, “A
dering II: Reproduction of diffuse sound and listening tests,” new reverberator based on variable sparsity convolution,”
J. Audio Eng. Soc., vol. 54, no. 1/2, pp. 3–20, Jan./Feb. 2006. in Proc. 16th Int. Conf. Digital Audio Effects (DAFx-13),
[4] J. Blauert, Spatial Hearing: The Psychophysics of Human Maynooth, Ireland, Sept. 2013, pp. 344–350.
Sound Localization, MIT press, 1997. [20] V. Välimäki, J. D. Parker, L. Savioja, J. O. Smith, and J. S.
[5] C. Faller, “Parametric multichannel audio coding: synthesis Abel, “More than 50 years of artificial reverberation,” in
of coherence cues,” IEEE Trans. Audio, Speech, and Lang. Proc. Audio Eng. Soc. 60th Int. Conf. Dereverberation and
Processing, vol. 14, no. 1, pp. 299–310, Jan. 2006. Reverberation of Audio, Music, and Speech, Leuven, Bel-
gium, Feb. 2016.
[6] M. J. Hawksford and N. Harris, “Diffuse signal processing
and acoustic source characterization for applications in syn- [21] V. Välimäki, B. Holm-Rasmussen, B. Alary, and H.-M.
thetic loudspeaker arrays,” in Proc. Audio Eng. Soc. 112nd Lehtonen, “Late reverberation synthesis using filtered vel-
Conv., Munich, Germany, May 2002. vet noise,” Appl. Sci., vol. 7, no. 483, May 2017.
[7] B. Xie, B. Shi, and N. Xiang, “Audio signal decorrelation
based on reciprocal-maximal length sequence filters and its
applications to spatial sound,” in Proc. Audio Eng. Soc.
133rd Conv., San Francisco, CA, USA, Oct. 2012.
[8] E. Kermit-Canfield and J. Abel, “Signal decorrelation us-
ing perceptually informed allpass filters,” in Proc. 19th Int.
Conf. Digital Audio Effects (DAFx-16), Brno, Czech Repub-
lic, Sept. 2016, pp. 225–231.
[9] M. Bouéri and C. Kyriakakis, “Audio signal decorrelation
based on a critical band approach,” in Proc. Audio Eng. Soc.
117th Conv., San Francisco, CA, USA, Oct. 2004.
[10] A. Politis, J. Vilkamo, and V. Pulkki, “Sector-based para-
metric sound field reproduction in the spherical harmonic do-
main,” IEEE J. Selected Topics in Signal Processing, vol. 9,
no. 5, pp. 852–866, Aug. 2015.
[11] J. Vilkamo, B. Neugebauer, and J. Plogsties, “Sparse
frequency-domain reverberator,” J. Audio Eng. Soc., vol. 59,
no. 12, pp. 936–943, Dec. 2012.
[12] M. Karjalainen and H. Järveläinen, “Reverberation model-
ing using velvet noise,” in Proc. 30th Int. Conf. Audio Eng.
Soc.: Intelligent Audio Environments, Saariselkä, Finland,
Mar. 2007.
[13] V. Välimäki, H.-M. Lehtonen, and M. Takanen, “A percep-
tual study on velvet noise and its variants at different pulse
densities,” IEEE Trans. Audio, Speech, and Lang. Process-
ing, vol. 21, no. 7, pp. 1481–1488, July 2013.
[14] P. Rubak and L. G. Johansen, “Artificial reverberation based
on a pseudo-random impulse response,” in Proc. Audio Eng.
Soc. 104th Conv., Amsterdam, The Netherlands, May 1998.
[15] P. Rubak and L. G. Johansen, “Artificial reverberation based
on a pseudo-random impulse response II,” in Proc. Audio
Eng. Soc. 106th Conv., Munich, Germany, May 1999.
[16] K.-S. Lee, J. S. Abel, V. Välimäki, T. Stilson, and D. P.
Berners, “The switched convolution reverberator,” J. Audio
Eng. Soc., vol. 60, no. 4, pp. 227–236, Apr. 2012.
[17] V. Välimäki J. D. Parker, L. Savioja, J. O. Smith, and
J. S. Abel, “Fifty years of artificial reverberation,” IEEE
Trans. Audio, Speech, and Lang. Processing, vol. 20, no. 5,
pp. 1421–1448, Jul. 2012.
[18] S. Oksanen, J. Parker, A. Politis, and V. Välimäki, “A di-
rectional diffuse reverberation model for excavated tunnels
in rock,” in Proc. IEEE Int. Conf. Acoustics, Speech, and
Signal Processing (ICASSP), Vancouver, Canada, May 2013,
pp. 644–648.
DAFX-7
DAFx-411
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT the problem area. There have also been instances which incorpo-
rate a spherical video. One example is in [8], where a parabolic
This paper details a software implementation of an acoustic mirror with a single camera sensor is utilised and the image is then
camera, which utilises a spherical microphone array and a spher- obtained after unwrapping. For a study using single and multiple
ical camera. The software builds on the Cross Pattern Coherence cameras, the reader is directed to [9].
(CroPaC) spatial filter, which has been shown to be effective in
reverberant and noisy sound field conditions. It is based on de- One approach to developing an acoustic camera is to utilise
termining the cross spectrum between two coincident beamform- a rectangular or circular microphone array and then apply beam-
ers. The technique is exploited in this work to capture and analyse forming in several directions to generate the power-map. How-
sound scenes by estimating a probability-like parameter of sounds ever, a more common approach is to use a spherical microphone
appearing at specific locations. Current techniques that utilise con- array, as it conveniently allows for the decomposition of the sound
ventional beamformers perform poorly in reverberant and noisy scene into individual spatial components, referred to as spherical
conditions, due to the side-lobes of the beams used for the power- harmonics [10]. Using these spherical harmonics, it is possible to
map. In this work we propose an additional algorithm to suppress carry out beamforming with similar spatial resolution for all direc-
side-lobes based on the product of multiple CroPaC beams. A tions on the sphere. Therefore, these are preferred for use-cases
Virtual Studio Technology (VST) plug-in has been developed for in which a large field of view has to be analysed. The most com-
both the transformation of the time-domain microphone signals mon signal-independent beamformer in the spherical harmonic do-
into the spherical harmonic domain and the main acoustic camera main is based on the plane wave decomposition (PWD) algorithm,
software; both of which can be downloaded from the companion which (as the name would suggest) relies on the assumption that
web-page. the sound sources are received as plane-waves, which makes it
suited only for far-field sound sources [11]. These beam patterns
can be further manipulated using in-phase [12], Dolph-Chebyshev
1. INTRODUCTION [10] or maximum energy weightings [13].
Signal-dependent beamformers can also be utilised in acoustic
Acoustic cameras are tools developed in the spatial audio commu-
cameras with the penalty of higher computational cost. A com-
nity which are utilised for the capture and analysis of sound fields.
mon solution is the minimum-variance distortion-less response
In principle, an acoustic camera imitates the audiovisual aspect
(MVDR) algorithm [14]. This approach takes into account the
of how humans perceive a sound scene by visualising the sound
inter-channel dependencies between the microphone array signals,
field as a power-map. Incorporating this visual information within
in an attempt to enhance the beamformers performance by placing
a power-map can significantly improve the understanding of the
nulls to the interferers. However, the performance of such an al-
surrounding environment, such as the location and spatial spread
gorithm is relatively sensitive in scenarios where high background
of sound sources and potential reflections arriving from different
noise and/or reverberation are present in the sound scene [15]. An
surfaces. Essentially, any system that incorporates a video of the
alternative approach, proposed in [16], is to apply pre-processing
sound scene in addition to a power-map overlay can be labelled
in order to separate the direct components from the diffuse field.
as an acoustic camera, of which several commercial systems are
This subspace-based separation has been shown to dramatically
available today.
improve the performance of existing super-resolution imaging al-
Capturing, analysing and tracking the position of sound gorithms [17]. Another popular subspace-based approach is the
sources is a useful technique with application in a variety of multiple signal classification (MUSIC) algorithm [18], which has
fields, which include reflection tracking in architectural acoustics been orientated as a multiple speaker localisation method in [19],
[1, 2, 3], sonar navigation and object detection [4, 5], espionage by incorporating a direct-path dominance test.
and the military [6, 7]. An acoustic camera can also be used to
identify sound insulation issues and faults in electrical and me- A recent spatial filtering technique, which can potentially be
chanical equipment. This is due to the fact that in many of these applied to spherical microphone arrays, is the cross-pattern coher-
scenarios, the fault can be identified as an area that emits the most ence (CroPaC) algorithm [20]. It is based on measuring the cor-
sound energy. Therefore, calculating the energy in multiple direc- relation between coincident beamformers and providing a post fil-
tions and subsequently generating a power-map that depicts their ter that will suppress noise, interferers and reverberation. The ad-
energies relative to each-other is an effective method of identifying vantage of CroPaC, when compared to other spatial filtering tech-
niques, is that it does not require the direct estimation of the mi-
∗ The research leading to these results has received funding from the crophone noise. The algorithm has recently been extended in the
Aalto ELEC school. spherical harmonic domain for arbitrary combinations of beam-
DAFX-1
DAFx-412
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
formers in [21]. signals L that can be estimated. Please note that the frequency and
The purpose of this work is to detail a scalable acoustic cam- time indexes are omitted for the brevity of notation.
era system that utilises a spherical microphone array and a spher- The spherical harmonic signals can be estimated as
ical video camera, which are placed in a near-coincident fashion.
Several different static and adaptive beamforming techniques have s = Wx, (1)
been implemented within the system and care has been taken to where
ensure that the proposed system is accessible to a wide range of
2
acoustic practitioners. Additionally, we investigate the use of a s = [s00 , s1−1 , s10 , . . . , sLL−1 , sLL ]T ∈ C(L+1) ×1
, (2)
coherence-based parameter to generate the power-maps. The main
contributions of this work can be summarised as: are the spherical harmonic signals and W ∈ C(L+1) ×Q is the
2
• The capture and analysis of a sound scene using a micro- frequency-dependent spatial encoding matrix. For uniform or
phone array and subsequently estimating a parameter to de- nearly-uniform microphone arrangements, it can be calculated as
termine sound source activity in specific directions. W = αq Wl Y† , (3)
• The development of a real-time VST plug-in, for spatially
encoding the microphone array signals into spherical har- where αq are the sampling weights, which depend on the micro-
monic signals. phone distribution on the sphere [10]. The sampling weights can
2 2
be calculated as αq = 4π . Furthermore, Wl ∈ C(L+1) ×(L+1)
• Devising a real-time acoustic camera, also implemented as Q
is an equalisation matrix that eliminates the effect of the sphere,
a VST plug-in, by utilising a commercially available spher- defined as
ical microphone array and spherical camera.
w0
• The use of vector-base amplitude panning (VBAP) [22], in
w1
order to interpolate the power-map grid.
w 1
• Optimal side-lobe suppression of the CroPaC spatial filters Wl = w1 , (4)
for analysis purposes. ..
.
This paper is organised as follows. In Section 2 we provide the wL
necessary background on spherical microphone array processing,
which includes the encoding of the microphone signals into spher- where
ical harmonic signals and common signal-dependent and signal-
1 |bl |2
independent beamforming techniques. In Section 3 we elaborate wl = , (5)
the proposed theoretical background for generating the power- bl |bl |2 + λ2
maps. In Section 4 the details of the hardware and software are where bl are frequency and order-dependent modal coefficients,
shown in detail. Finally, in Section 5 we present our conclusions. which contain the information of the type of the array, open
or rigid, and the type of sensors, omnidirectional or directional.
2. SPHERICAL MICROPHONE ARRAY PROCESSING Lastly, λ is a regularisation parameter that influences the micro-
phone noise amplification. For details of some alternative options
Spherical microphone arrays (SMA) are commonly utilised for for calculating the equalisation matrix Wl , the reader is referred
sound field analysis for three-dimensional (3-D) spaces, as they to [26, 27], or for a signal-dependent encoder [28]. Y(Ωq ) ∈
2
provide a similar performance in all directions when sensors are RQ×(L+1) is a matrix containing the spherical harmonics
placed uniformly or nearly-uniformly on the sphere. In this sec- T
tion we provide a brief overview of how to estimate the spherical Y00 (Ω1 ) Y00 (Ω2 ) . . . Y00 (ΩQ )
harmonic signals from the microphone signals and how to perform Y−11 (Ω1 ) Y−11 (Ω2 ) . . . Y−11 (ΩQ )
adaptive and non-adaptive beamforming in the spherical harmonic Y10 (Ω1 ) Y10 (Ω2 ) . . . Y10 (ΩQ )
domain. Only the details required for the current implementation Y(Ωq ) = Y11 (Ω1 ) Y11 (Ω2 ) . . . Y11 (ΩQ ) , (6)
are included here. For a detailed overview of these methods, the .. .. .. ..
. . . .
reader is referred to [12, 23, 10, 24, 25].
YLL (Ω1 ) YLL (Ω2 ) . . . YLL (ΩQ )
Note that matrices, M, have been denoted using bold upper-
case letters and vectors, v, are denoted with bold lower-case let- where Ylm are the individual spherical harmonics of order l ≥ 0
ters. and degree m ∈ [−l, l].
DAFX-2
DAFx-413
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-414
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(a) Side-lobe suppression for order L = 1. The two beams on the left show the rotating beam patterns and the
beam on the far-right is the resulting beam pattern.
(b) Side-lobe suppression for order L = 2. The three beams on the left show the rotating beam patterns and the
beam on the far-right is the resulting beam pattern.
(c) Side-lobe suppression for order L = 3. The four beams on the left show the rotating beam patterns and the
beam on the far-right is the resulting beam pattern.
then transformed into the time-frequency domain and a parameter half-wave rectifier. The resulting power-map is then given by
is estimated for each frequency, k, and time index, n. The cross
K
spectrum is then calculated between two spherical harmonic sig- 1 X
nals of orders L and L + 1 and the same degree m GMAP (Ωj , n) = max 0, G(Ωj , k, n) . (18)
K
k=1
<[sL (Ωj , k, n)∗ sL−1 (Ωj , k, n)]
G(Ωj , k, n) = λ PL+1 , (16) The half-wave rectification process ensures that only sounds ar-
|sL (Ωj , k, n)|2
L riving from the look direction are analysed. An illustration of the
where < denotes the real operator, sL and sL−1 are the spherical effect of the half-wave rectification process to the directional se-
harmonic signals for a look direction Ωj and the same degree m, lectivity of the CroPaC beams is depicted in Fig. 1.
* denotes the complex conjugate and λ is an order-dependent nor-
malisation factor to ensure that GMAP ∈ [0, 1]. The normalisation 3.2. Side-lobe suppression
factor can be calculated as
The calculation of the spectrum between different orders of beam-
(L + 1)2 − (L − 1)2 + 1 4L + 1 formers results in the creation of unwanted side-lobes that exhibit
λ= = . (17)
2 2 different shapes depending on the order. A visual depiction of
The power-map is then estimated for a grid of look directions Ω = these aberrations, in Fig. 2, have been generated by multiplying
(Ω1 , Ω2 , . . . , ΩJ ), averaged across frequencies and subjected to a the following spherical harmonics together: YLL Y(L+1)(L+1) for
DAFX-4
DAFx-415
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
spherical grid
harmonic VBAP gains
data
transform
time- rotation
beamformer parameter
frequency unit interpolation
estimation
transform
Figure 3: Block diagram of the proposed parametric acoustic camera system. The microphone array signals are processed in the time-
frequency domain using the proposed parametric analysis. The output is then averaged across frequencies and the grid is interpolated with
VBAP for visualisation. The resulting power-map is projected on top of the spherical video.
L = 1 (Fig. 2, (a), left), L = 2 (Fig. 2, (b), left) and L = 3 (Fig. The algorithms within the acoustic camera have been gener-
2, (c), left). alised to support spherical harmonic signals up to the 7th order.
These side-lobes can potentially introduce biases in the power- These signals can be optionally generated by using the accom-
map. Therefore, in this sub-section we propose a technique to sup- panying Mic2SH VST, which accepts input signals from spher-
press these side-lobes by multiplying rotated versions of the esti- ical microphone arrays such as A-format microphones (1st or-
mated beams. The number of estimated beams is determined by der) or the Eigenmike (up to 4th order). In the case of the
the order L. The side-lobe suppressing parameter GSUP is esti- Eigenmike, Mic2SH will also perform the necessary frequency-
mated as dependent equalisation, described in Section 2.1, in order to miti-
gate the radial dependency incurred when estimating the pressure
GMAP (Ωj , n)
, if L = 1
L
on a rigid sphere. Different equalisation strategies have been im-
GSUP (Ωj , n) = Y ρi (19) plemented that are common in the literature, such as the Tikhonov-
GMAP (Ωj , n) , if L > 1,
based regularised inversion [23] and soft limiting [33].
i=1
In order to optimise the linear algebra operations, the code
where ρi is a parameter that defines an axis symmetric roll on the within the audio plug-in has been written to conform to the ba-
direction of the beam of L π
radians. Such a roll can successfully sic linear algebra library (BLAS) standard. Other operations such
suppress side-lobes that are generated by the multiplication of the as the lower-upper (LU) factorisation and SVD are addressed by
different spherical harmonics. This concept is illustrated in Fig. 2. the linear algebra package (LAPACK) standard; for which Apple’s
Each row illustrates the side-lobe suppression for different orders. accelerate framework and Intel’s MKL are supported for the Mac
In the top row L = 1, which results in a single roll of π2 in (19). OSX and Windows versions, respectively.
For L = 2 (middle row) and L = 3 (bottom row) three and four
The overall block diagram of the proposed system is shown
rolls of π3 and π4 are applied, respectively. The resulting enhanced
in Fig. 3. The time-domain microphone array signals are initially
beam patterns GSUP ∈ [0, 1], which are derived from the product
transformed into spherical harmonic signals using the Mic2SH au-
of multiple GMAP ∈ [0, 1], are shown on the right-hand side of
dio plug-in, which are then transformed into the time-frequency
the figures.
domain by the acoustic camera. For computational efficiency
reasons, the spherical harmonic signals are rotated after the
4. IMPLEMENTATION DETAILS time-frequency transform towards the points defined by the pre-
computed spherical grid. These signals are then fed into a beam-
In order to make the software easily accessible, the acoustic cam- former unit, which forms the two beams that are required to com-
era was implemented as a virtual studio technology (VST) audio pute the cross-spectrum based parameter for each grid point. Note
plug-in1 using the open-source JUCE framework. The motivation that when the side-lobe suppression mode is enabled, one parame-
for selecting the VST architecture, is the wide range of digital ter is estimated per roll and the resulting parameters are multiplied,
audio workstations (DAWs) that support them. This enables an as defined in (19). For visualisation, the parameter value at each
acoustic practitioner to select their DAW of choice, which will in of the grid points is interpolated using VBAP and projected on top
turn act as the bridge between real-time microphone signal cap- of the spherical video.
ture and the subsequent visualisation of the energy distribution in
The user-interface for the acoustic camera consists of a view
the sound scene. The rationale behind the selection of JUCE is the
window and a parameter editor (see Fig. 3). The view window dis-
many useful classes it offers; most notably of which is camera sup-
plays the camera feed and overlays the user selected power-map
port, which allows for real-time video to be placed behind the cor-
in real-time. The field-of view (FOV) and the aspect ratio are user
responding power-map. Additionally, the framework is developed
definable in the parameter editor, which allows the VST to accom-
with cross-platform support in mind and can also produce other
modate a wide range of different web-cam devices. Additionally,
audio plug-in formats, provided that the corresponding source de-
the image frames from the camera can be optionally mirrored us-
velopment kits are linked to the project.
ing an appropriate affine transformation (left-right, or up-down);
1 The VST plug-ins are available for download on the companion web- in order to accommodate for a variety of different camera orienta-
page: http://research.spa.aalto.fi/publications/papers/acousticCamera/ tions.
DAFX-5
DAFx-416
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.5
0 80 Spherical Grid
z
2D VBAP Grid
60
Elevation (degrees)
40
-0.5
20
0
-1 -20
1
-40
0.5 1 -60
0 0.5 -80
0
-0.5 -150 -100 -50 0 50 100 150
-0.5 Azimuth (degrees)
y x
-1 -1
(b) Nearly-uniform spherical grid plotted on a 2-D
equirectangular plane (black colour) and a 2-D VBAP
(a) Nearly-uniform spherical grid on a unit sphere. interpolated grid overlay (cyan colour).
4.1. Time-frequency transform where α ∈ [0, 1] is the smoothing parameter. The spherical power-
map values are then interpolated to attain a two-dimensional (2-D)
The time-frequency transform utilised in this work is filter-bank- power-map, using pre-computed VBAP gains. The spherical and
based and was implemented originally for the study in [21] 2 . The interpolated grids are shown in Fig. 4. These 2-D power-maps are
filter-bank was configured to use a hop size of 128 samples and an then further interpolated using bi-cubic interpolation depending on
FFT size of 1024. Additionally, the optional hybrid filtering mode the display settings and are normalised such that ĜSUP ∈ [0, 1].
offered by the filter-bank was enabled, which allows for more res- The pixels that correspond to the 2-D interpolated results are then
olution in the low frequency region by dividing the lowest four coloured appropriately, such that red indicates high energy and
bands into eight; thus, attaining 133 frequency bands in total. A blue indicates low energy. Additionally, the transparency factor
sampling rate of 48 kHz was chosen, and the power-map analy- is gradually increased for the lower energy valued beams to ensure
sis utilises frequency bands with centre frequencies between [140, that they do not unnecessarily detract from the video stream.
8000] Hz. The upper limit of 8000 Hz was selected due to the spa-
tial aliasing of the microphone array used [34].
4.3. Example power-maps
4.2. Power-map modes and sampling grids Power-maps examples are shown in Fig. 5 for four different
modes: the basic PWD beamformer, the adaptive MVDR beam-
The power-map is generated by sampling the sphere with a spheri- former, the subspace MUSIC approach, and the proposed tech-
cal grid. A precomputed almost-uniform spherical grid was chosen nique. The recordings were performed by utilising the Eigen-
that provides 252 nearly-uniformly distributed data points on the mike microphone array and a RICOH Theta S spherical camera.
sphere. The grid is based on the 3LD library [35], where the points Fourth order spherical harmonic signals were generated using the
are generated by utilising geodesic spheres. This is performed by accompanying Mic2SH VST plugin, which were then used by all
tessellating the facets of a polyhedron and extending them to the four power-map modes. The video was unwrapped using the soft-
radius of the original polyhedron. The intersection points between ware provided by RICOH and then combined with the calculated
them are the points of the spherical grid. Two different power-map power-map to complete the acoustic camera system. However,
modes and two pseudo-spectrum methods were implemented in since the camera may not be facing the same look direction as
the spherical harmonic domain: conventional signal-independent the microphone array, a calibration process is required in order to
beamformers (PWD, minimum side-lobe, maximum energy and align the power-map with the video stream. However, it should
Dolph-Chebyshev); MVDR beamformers; multiple signal classi- be noted that since the two devices do not share a common ori-
fication (MUSIC); and the proposed cross-spectrum-based with gin, sources that are very close to the array may not be correctly
the additional side-lobe suppression. The power-map/pseudo- aligned. The resulting power-maps are shown for two different
spectrum values are then summed over the analysis frequency recording scenarios: a staircase of high reverberation time of ap-
bands and averaged over time slots using a one-pole filter proximately 2 seconds (Fig. 5, bottom) and a corridor of approxi-
mately 1.5 seconds (Fig. 5, top).
ĜSUP (Ωj , n) = αĜSUP (Ωj , n) + (1 − α)GSUP (Ωj , n − 1), It can be seen from Fig. 5(top) that there is one direct source
(20) and at least one prominent early reflection. However, in the case
of PWD, the distinction between the two paths is the least clear,
2 1https://github.com/jvilkamo/afSTFT and also erroneously indicates that the sources are spatially larger
DAFX-6
DAFx-417
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
(a) PWD, corridor (b) MVDR, corridor (c) MUSIC, corridor (d) proposed, corridor
(e) PWD, stairs (f) MVDR, stairs (g) MUSIC, stairs (h) proposed, stairs
Figure 5: Images of the acoustic camera VST display, while using fourth order spherical harmonic signals and the four processing modes
in reverberant environments.
than they actually are. The distinction between the two paths is acoustic reflections with spherical microphone arrays,” in
improved slightly when using MVDR beamformers, which is im- Applications of Signal Processing to Audio and Acoustics
proved further when utilising the MUSIC algorithm. However, in (WASPAA), 2015 IEEE Workshop on. IEEE, 2015, pp. 1–5.
the case of the proposed technique, the two paths are now com- [4] GC Carter, “Time delay estimation for passive sonar signal
pletely isolated and a second early reflection with lower energy is processing,” IEEE Transactions on Acoustics, Speech, and
now visible; which is not as evident in the other three methods. Signal Processing, vol. 29, no. 3, pp. 463–470, 1981.
PWD also indicates a sound source that is likely the result of the
side-lobes pointing towards the real sound source; thus, highlight- [5] H Song, WA Kuperman, WS Hodgkiss, Peter Gerstoft, and
ing the importance of side-lobe suppression for acoustic camera Jea Soo Kim, “Null broadening with snapshot-deficient co-
applications. Fig. 5(bottom) indicates similar performance; how- variance matrices in passive sonar,” IEEE journal of Oceanic
ever, in the case of MUSIC, the ceiling reflection is more difficult Engineering, vol. 28, no. 2, pp. 250–261, 2003.
to distinguish as a separate entity. [6] RK Hansen and PA Andersen, “A 3d underwater acoustic
camera: properties and applications,” in Acoustical Imaging,
5. CONCLUSIONS pp. 607–611. Springer, 1996.
[7] Russell A Moursund, Thomas J Carlson, and Rock D Pe-
This paper has presented an acoustic camera that is easily acces- ters, “A fisheries application of a dual-frequency identifica-
sible as a VST plug-in. Among the possible power-map modes tion sonar acoustic camera,” ICES Journal of Marine Sci-
available, is the proposed coherence-based parameter, which can ence: Journal du Conseil, vol. 60, no. 3, pp. 678–683, 2003.
be tuned to this particular use case via additional suppression of
the side-lobes. This method presents an intuitive approach to at- [8] Leonardo Scopece, Angelo Farina, and Andrea Capra, “360
taining a power-map, and is potentially easier and computationally degrees video and audio recording and broadcasting employ-
cheaper to implement than MVDR or MUSIC, as it does not rely ing a parabolic mirror camera and a spherical 32-capsules
on lower-upper decompositions, Guassian Elimination, or singular microphone array,” IBC 2011, pp. 8–11, 2011.
value decompositions. It is also demonstrated that in the simple [9] Adam O’Donovan, Ramani Duraiswami, and Jan Neumann,
recording scenarios, the proposed method can be inherently toler- “Microphone arrays as generalized cameras for integrated
ant to reverberation. audio visual processing,” in Computer Vision and Pattern
Recognition, 2007. CVPR’07. IEEE Conference on. IEEE,
6. REFERENCES 2007, pp. 1–8.
[10] Boaz Rafaely, Fundamentals of spherical array processing,
[1] Adam O’Donovan, Ramani Duraiswami, and Dmitry Zotkin, vol. 8, Springer, 2015.
“Imaging concert hall acoustics using visual and audio cam-
eras,” in Acoustics, Speech and Signal Processing, 2008. [11] Miljko M Erić, “Some research challenges of acoustic cam-
ICASSP 2008. IEEE International Conference on. IEEE, era,” in Telecommunications Forum (TELFOR), 2011 19th.
2008, pp. 5284–5287. IEEE, 2011, pp. 1036–1039.
[2] Angelo Farina, Alberto Amendola, Andrea Capra, and Chris- [12] J Daniel, Représentation de champs acoustiques, applica-
tian Varani, “Spatial analysis of room impulse responses cap- tiona la reproduction eta la transmission de scenes sonores
tured with a 32-capsule microphone array,” in Audio Engi- complexes dans un contexte multimédia, Ph.D. thesis, Ph. D.
neering Society Convention 130. Audio Engineering Society, thesis, University of Paris 6, Paris, France, 2000.
2011. [13] Franz Zotter, Hannes Pomberger, and Markus Noisternig,
[3] Lucio Bianchi, Marco Verdi, Fabio Antonacci, Augusto “Energy-preserving ambisonic decoding,” Acta Acustica
Sarti, and Stefano Tubaro, “High resolution imaging of united with Acustica, vol. 98, no. 1, pp. 37–47, 2012.
DAFX-7
DAFx-418
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[14] Michael Brandstein and Darren Ward, Microphone arrays: [27] Stefan Lösler and Franz Zotter, “Comprehensive radial fil-
signal processing techniques and applications, Springer Sci- ter design for practical higher-order ambisonic recording,”
ence & Business Media, 2013. Fortschritte der Akustik, DAGA, pp. 452–455, 2015.
[15] Michael D Zoltowski, “On the performance analysis of the [28] Franz Zotter Christian Schörkhuber and Robert Höldrich,
mvdr beamformer in the presence of correlated interference,” “Signal-dependent encoding for first-order ambisonic micro-
IEEE Transactions on Acoustics, Speech, and Signal Pro- phones,” Fortschritte der Akustik, DAGA, 2017.
cessing, vol. 36, no. 6, pp. 945–947, 1988. [29] J. Meyer and G. Elko, “A highly scalable spherical mi-
[16] Nicolas Epain and Craig T Jin, “Super-resolution sound crophone array based on an orthonormal decomposition of
field imaging with sub-space pre-processing,” in Acoustics, the soundfield,” in 2002 IEEE International Conference on
Speech and Signal Processing (ICASSP), 2013 IEEE Inter- Acoustics, Speech, and Signal Processing, May 2002, vol. 2,
national Conference on. IEEE, 2013, pp. 350–354. pp. II–1781–II–1784.
[17] Tahereh Noohi, Nicolas Epain, and Craig T Jin, “Direction of [30] Boaz Rafaely and Maor Kleider, “Spherical microphone ar-
arrival estimation for spherical microphone arrays by combi- ray beam steering using wigner-d weighting,” IEEE Signal
nation of independent component analysis and sparse recov- Processing Letters, vol. 15, pp. 417–420, 2008.
ery,” in Acoustics, Speech and Signal Processing (ICASSP), [31] J. Atkins, “Robust beamforming and steering of arbitrary
2013 IEEE International Conference on. IEEE, 2013, pp. beam patterns using spherical arrays,” in 2011 IEEE Work-
346–349. shop on Applications of Signal Processing to Audio and
[18] Ralph Schmidt, “Multiple emitter location and signal param- Acoustics (WASPAA), Oct 2011, pp. 237–240.
eter estimation,” IEEE transactions on antennas and propa- [32] Symeon Delikaris-Manias, Juha Vilkamo, and Ville Pulkki,
gation, vol. 34, no. 3, pp. 276–280, 1986. “Signal-dependent spatial filtering based on weighted-
[19] O Nadiri and B Rafaely, “Localization of multiple speakers orthogonal beamformers in the spherical harmonic domain,”
under high reverberation using a spherical microphone array IEEE/ACM Transactions on Audio, Speech and Language
and the direct-path dominance test,” IEEE/ACM Transac- Processing (TASLP), vol. 24, no. 9, pp. 1507–1519, 2016.
tions on Audio, Speech and Language Processing (TASLP), [33] Benjamin Bernschütz, Christoph Pörschmann, Sascha Spors,
vol. 22, no. 10, pp. 1494–1505, 2014. Stefan Weinzierl, and Begrenzung der Verstärkung, “Soft-
limiting der modalen amplitudenverstärkung bei sphärischen
[20] Symeon Delikaris-Manias and Ville Pulkki, “Cross pat-
mikrofonarrays im plane wave decomposition verfahren,”
tern coherence algorithm for spatial filtering applications uti-
Proceedings of the 37. Deutsche Jahrestagung für Akustik
lizing microphone arrays,” IEEE Transactions on Audio,
(DAGA 2011), pp. 661–662, 2011.
Speech, and Language Processing, vol. 21, no. 11, pp. 2356–
2367, 2013. [34] Boaz Rafaely, Barak Weiss, and Eitan Bachmat, “Spatial
aliasing in spherical microphone arrays,” IEEE Transactions
[21] Symeon Delikaris-Manias, Despoina Pavlidi, Ville Pulkki, on Signal Processing, vol. 55, no. 3, pp. 1003–1010, 2007.
and Athanasios Mouchtaris, “3d localization of multiple
audio sources utilizing 2d doa histograms,” in Signal Pro- [35] Florian Hollerweger, Periphonic sound spatialization in
cessing Conference (EUSIPCO), 2016 24th European. IEEE, multi-user virtual environments, Citeseer, 2006.
2016, pp. 1473–1477.
[22] Ville Pulkki, “Virtual sound source positioning using vector
base amplitude panning,” Journal of the Audio Engineering
Society, vol. 45, no. 6, pp. 456–466, 1997.
[23] Sébastien Moreau, Jérôme Daniel, and Stéphanie Bertet,
“3d sound field recording with higher order ambisonics–
objective measurements and validation of a 4th order spheri-
cal microphone,” in 120th Convention of the AES, 2006, pp.
20–23.
[24] Thushara D Abhayapala, “Generalized framework for spher-
ical microphone arrays: Spatial and frequency decomposi-
tion,” in Acoustics, Speech and Signal Processing, 2008.
ICASSP 2008. IEEE International Conference on. IEEE,
2008, pp. 5268–5271.
[25] Heinz Teutsch, Modal array signal processing: principles
and applications of acoustic wavefield decomposition, vol.
348, Springer, 2007.
[26] David L Alon, Jonathan Sheaffer, and Boaz Rafaely, “Ro-
bust plane-wave decomposition of spherical microphone ar-
ray recordings for binaural sound reproduction,” The Jour-
nal of the Acoustical Society of America, vol. 138, no. 3, pp.
1925–1926, 2015.
DAFX-8
DAFx-419
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT useful in deducing which type of electric guitar could have pro-
duced a particular tone.
In this paper, we introduce an approach to estimate the pickup po-
In 1990, research on synthesising electric guitar sound was
sition and plucking point on an electric guitar for both single notes
proposed by Sullivan whereby the Karplus-Strong algorithm is ex-
and chords recorded through an effects chain. We evaluate the ac-
tended to include a pickup model with distortion and feedback ef-
curacy of the method on direct input signals along with 7 different
fects [1]. Commercial firms such as Roland and Line 6 produce
combinations of guitar amplifier, effects, loudspeaker cabinet and
guitar synthesisers that use hexaphonic pickups allowing them to
microphone. The autocorrelation of the spectral peaks of the elec-
process sound from each string separately to mimic the sound of
tric guitar signal is calculated and the two minima that correspond
many other popular guitars [2, 3]. They model the pickups of pop-
to the locations of the pickup and plucking event are detected. In
ular electric guitars including the effects of pickup position, height
order to model the frequency response of the effects chain, we flat-
and magnetic apertures. Lindroos et al. [4] introduced a paramet-
ten the spectrum using polynomial regression. The errors decrease
ric electric guitar synthesis where conditions that affect the sound
after applying the spectral flattening method. The median absolute
can be changed such as the force of the pluck, plucking point and
error for each preset ranges from 2.10 mm to 7.26 mm for pickup
pickup position. Further details of modelling the magnetic pickup
position and 2.91 mm to 21.72 mm for plucking position estimates.
are studied by Paiva et al. [5], which include the effect of pickup
For strummed chords, faster strums are more difficult to estimate
width, nonlinearity and circuit response.
but still yield accurate results, where the median absolute errors
for pickup position estimates are less than 10 mm. Since electric guitar synthesis requires the pickup and pluck-
ing positions to be known, a method to estimate these parame-
ters could be useful when trying to replicate the sound of popular
1. INTRODUCTION guitarists from a recording. Several papers propose techniques to
estimate the plucking point on an acoustic guitar, using either a
The popularity of the electric guitar began to rise in the mid 1950s frequency domain approach [6, 7] or a time domain approach [8].
and it soon became, and has since remained, one of the most im- A technique to estimate the plucking point and pickup position of
portant instruments in Western popular music. Well known guitar an electric guitar is proposed by Mohamad et al. [9] where direct
players can often be recognised by the distinctive electric guitar input electric guitar signals are used in the experiments.
tone they create. Along with the player’s finger technique, the
Our previous work has dealt with recordings of isolated guitar
unique tone they achieve is produced by their choice of electric
tones. The first major contribution of the current paper is to extend
guitar model, guitar effects and amplifiers. Some popular musi-
the previous work to estimate the pickup and plucking locations of
cians prefer using one or two choices of guitar models for record-
electric guitar signals that are processed through a combination of
ings and live performances. Some guitar enthusiasts are keen to
emulated electric guitar effect, amplifier, loudspeaker and micro-
know how their favourite guitar players produce their unique tones.
phone. This can bring us closer to estimating pickup and plucking
Indeed, some will go as far as to purchase the same electric gui-
locations of real-world electric guitar recordings. We also intro-
tar model, effects and amplifiers to replicate the sounds of their
duce a technique to flatten the spectrum before calculating the au-
heroes.
tocorrelation of the spectral peaks and finding the two minima of
The tones produced by popular electric guitar models such as
the autocorrelation. The second major contribution is to investigate
Fender and Gibson are clearly distinguishable from each other,
the performance of our method on strummed chords and propose
with different pickup location, width, and sensitivity, along with
modifications to mitigate the effects of overlapping tones. We per-
circuit response of the model leading to different timbres. The
form experiments with chords strummed at different speeds and
pickup location of an electric guitar contributes to the sound signif-
with different bass notes to determine optimal parameters for our
icantly, where it produces a comb-filtering effect on the spectrum.
method.
The tonal differences can be heard just by switching the pickup
configuration of the guitar. Thus, if the type of electric guitar is In Sec. 2, we explain the comb-filtering effects produced in
known, the estimated pickup position can help distinguish which electric guitar tones. Sec. 3 describes the method for estimating the
pickup configuration is selected. Equally, where the guitar model pickup and plucking position of an electric guitar. Sec. 4 explains
on a recording is not known, pickup position estimates could be the datasets that are used in this paper. There are two datasets
where one is for testing the effects of different combinations of
∗ Zulfadhli Mohamad is supported by the Malaysian government electric guitar effects, amplifier, loudspeaker and microphone and
agency, Majlis Amanah Rakyat (MARA) the other is for testing the effects of various chords. We evalu-
DAFX-1
DAFx-420
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ate our method for various chains of effects in Sec. 5 and various
chords in Sec. 6. The conclusion is presented in Sec. 7.
DAFX-2
DAFx-421
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
3.3. Spectral flattening Since the plucking point ρ, and pickup position d, produce two
comb-filtering effects as shown in Eq. (1), the delay of the comb
In this paper, we introduce an approach to flatten the spectrum of filters can be estimated using the autocorrelation of the log mag-
the observed data Xk . We do this because the ideal model of Sec. 2 nitude of the spectral peaks. The log-correlation as described by
DAFX-3
DAFx-422
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 8: Seven combinations of emulated electric guitar effects, amplifier, loudspeaker and microphone.
Traube and Depalle [7] is calculated as: pickup position and the other as the estimated plucking point.In
this example, the absolute errors for the pickup position and pluck-
K
X ing point estimates are 3.62 mm and 1.93 mm respectively.
Γ(τ ) = log(X̄k2 ) cos( 2πkτ
T
) (4)
k=1
Without flattening the spectrum, the troughs are not as appar-
where T is the period of the signal. For an electric guitar, it is ex- ent and it could be difficult to detect the two minima. An example
pected that we would see two minima in the log-correlation where is given in Fig. 6, where the electric guitar is played on the open
the time lag of one trough τd indicates the position of the pickup 5th string, plucked 90 mm from the bridge with the pickup sit-
and the time lag of the other τρ indicates the position of the pluck- uated at 102 mm from the bridge. It shows two log-correlations
ing event. Therefore, the estimated pickup position, dˆ and pluck- of the spectral peaks where one is with spectral flattening (solid
ing point ρ̂ can be calculated by finding the time lags, τ̂d and τ̂ρ line) and the other is without spectral flattening (dashed line). We
in the log-correlation using trough detection, where dˆ = τ̂dTL and can see that the troughs corresponding to the pickup and plucking
τ̂ L positions are emphasised if the spectrum has been flattened. The
ρ̂ = ρT .
plucking point estimate is also closer to the known plucking point.
DAFX-4
DAFx-423
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 9: Box plot of absolute error in pickup position estimates Figure 10: Box plot of absolute error in plucking point estimates
for each preset. Note that the y-axis is in log scale. for each preset. Note that the y-axis is in log scale.
DAFX-5
DAFx-424
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 11: Absolute errors for pickup position estimates of chords. Figure 13: Box plot of absolute error in pickup position estimates
The crosses represent E major chords, circles represent A major for chords for each preset. Each preset has three box plots, where
chords and triangles represent G major chords. Note that the y- the boxes (from left to right) are for fast, slow and very slow strums
axis is in log scale. respectively. Note that the y-axis is in log scale.
ble 1 shows the median absolute errors for pickup and plucking po-
sition estimates. The average median absolute error across all pre- note is plucked after a few cycles of the first note. Furthermore,
sets for pickup and plucking position estimates decrease by 0.14 we manually measure the time between the first and second note,
mm and 1.58 mm respectively using PRSF compared to LRSF. tc . For our method to be unaffected by the strum, the shortest
Overall, the median absolute errors decrease when the spectral time allowed between the first and second note would be 36.4ms
flattening methods are applied. This suggests that we improved (3 cycles of note 82.41 Hz) for the worst case scenario of the first
the method by introducing a technique to flatten the spectrum. pitch being E2, the lowest pitch on the guitar. However, natural
strumming of a guitar leads to values of tc of 80ms for slow strums
6. RESULTS: CHORDS and 20ms for fast strums.
The method described in Sec. 3 is used to estimate the pickup
In this section, we test the accuracy of our method on strummed and plucking positions of each chord, where the spectrum is flat-
chords. The chords played are E major, A major and G major, tened using PRSF. The fundamental frequencies of the first note
where the first string to be struck is the 6th string for all chords struck on each chord are known in advance, which are 82.41 Hz
(downstrokes). Each chord is strummed at 3 different speeds and (E major and A major) and 98.00 Hz (G major). To present the
3 positions. Since the pickup and pluck positions are unlikely to worst case scenario, the A major chord is played in second inver-
change during the strum, and our method only requires the first sion (i.e. with a low E in bass). Multiple pitch estimation could be
few pitch periods of the electric guitar tone, it should be possible used to estimate the fundamental frequency of the first note [16].
to estimate the pickup and plucking positions where the second The total number of harmonics K is set to 40 for Direct Input, Jazz,
DAFX-6
DAFx-425
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Blues 1, Funk and Pop presets and 30 for Blues 2, Rock and Metal the first string struck has a higher pitch (or shorter period).
presets. The first 2 cycles are taken for the STFT analysis when Further investigation could look into distinguishing between
tc is shorter than 40ms and the first 3 cycles are taken when tc is pickup position and plucking point estimates. Plucking positions
longer than 40ms. A shorter window is needed for faster strums so vary constantly while the pickup position almost always remains
that less of the second note is included in the STFT analysis. fixed in one place; distinguishing the estimates from each other
Fig. 11 and 12 show the absolute errors for pickup and pluck- might therefore be achieved by determining which estimates devi-
ing position estimates respectively for direct input signals. The ate more frequently over a sequence of notes. An electric guitar
absolute errors increase for faster strums, nevertheless, most of the commonly has an option to mix two pickups together, so in our
errors are less than 20 mm for both pickup and plucking position further work, we are looking into estimating the pickup positions
estimates even though the second note starts to bleed into the win- and plucking point of a mixed pickup signal.
dow. In Fig. 11 and 12, the shortest tc for E major, G major and A
major chords are 15ms, 11ms and 17ms respectively. This means 8. REFERENCES
that the second note for each chord overlaps 38%, 45% and 10% of
the analysed window respectively. Chords with later second notes [1] C. R. Sullivan, “Extending the Karplus-Strong algorithm to
(i.e. a smaller overlap) yield more accurate results, for example, synthesize electric guitar timbres with distortion and feed-
in Fig. 11 shows that the pickup estimates are all less than 2 mm back,” Computer Music Journal, vol. 14, no. 3, pp. 26–37,
error for A major chord at tc = 17ms. Furthermore, the accuracy 1990.
of the pickup and plucking position estimates increases when tc is
more than 20ms. [2] A. Hoshiai, “Musical tone signal forming device for a
stringed musical instrument,” 22 November 1994, US Patent
Fig. 13 and 14 show the box plots of absolute errors in pickup
5,367,120.
and plucking position estimates respectively for each preset. Simi-
lar to single note guitar tones, the errors increase as the signal gets [3] P. J. Celi, M. A. Doidic, D. W. Fruehling, and M. Ryle,
more harmonically distorted. Errors also increase for fast strums. “Stringed instrument with embedded dsp modeling,” 7
Nevertheless, the median absolute errors for pickup position esti- September 2004, US Patent 6,787,690.
mates are less than 10 mm. [4] N. Lindroos, H. Penttinen, and V. Välimäki, “Parametric
electric guitar synthesis,” Computer Music Journal, vol. 35,
7. CONCLUSIONS no. 3, pp. 18–27, 2011.
[5] R. C. D. Paiva, J. Pakarinen, and V. Välimäki, “Acoustics
We have presented a technique to estimate the pickup position and modeling of pickups,” Journal of the Audio Engineering
and plucking point on an electric guitar recorded through a typ- Society, vol. 60, no. 10, pp. 768–782, 2012.
ical signal processing chain i.e. electric guitar effects, amplifier,
[6] C. Traube and J. O. Smith, “Estimating the plucking point on
loudspeaker and microphone with different settings for each equip-
a guitar string,” in Proceedings of the Conference on Digital
ment. For each preset, the median absolute errors for pickup posi-
Audio Effects (DAFx), 2000, pp. 153–158.
tion and plucking point estimates are 2.10 mm – 7.26 mm and 2.91
mm – 21.72 mm, where errors increase when signals are more dis- [7] C. Traube and P. Depalle, “Extraction of the excitation point
torted. The other aspects of the signal chain appear to have little location on a string using weighted least-square estimation
effect on our results. Pickup position estimates can be used to dis- of a comb filter delay,” in Proceedings of the Conference on
tinguish which pickup is selected for a known electric guitar. For Digital Audio Effects (DAFx), 2003.
an unknown electric guitar, the estimates can be used to distinguish [8] H. Penttinen and V. Välimäki, “A time-domain approach to
between typical electric guitar models. estimating the plucking point of guitar tones obtained with
The method can reliably estimate the pickup position of most an under-saddle pickup,” Applied Acoustics, vol. 65, no. 12,
clean, compressed and slightly overdriven tones i.e. with the Jazz, pp. 1207–1220, 2004.
Blues 1, Blues 2, Pop and Funk presets, where 89% – 99% of the
errors are less than 10 mm. For Rock and Metal presets, 57% and [9] Z. Mohamad, S. Dixon, and C. Harte, “Pickup position and
63% of the errors are less than 10 mm respectively. Nevertheless, plucking point estimation on an electric guitar,” in IEEE
the median absolute errors for both presets are less than 8 mm. International Conference on Acoustics, Speech, and Signal
We also introduced a flattening method using linear and polyno- Processing, 2017 (ICASSP), 2017, pp. 651–655.
mial regression, where the errors decrease after applying the spec- [10] P. Masri, Computer modelling of sound for transformation
tral flattening method. The median absolute errors for pickup and and synthesis of musical signals., Ph.D. thesis, University of
plucking position estimates decrease by 0.14 mm and 1.58 mm Bristol, 1996.
respectively across all presets compared to linear regression flat- [11] A. De Cheveigné and H. Kawahara, “YIN, a fundamental
tening method. frequency estimator for speech and music,” The Journal of
Furthermore, we evaluate our method for various downstroke the Acoustical Society of America, vol. 111, no. 4, pp. 1917–
chords. Pickup and pluck positions for most chords are detected 1930, 2002.
correctly, with errors similar to those observed for single notes.
A small number of outliers are observed which are mostly caused [12] S. Dixon, M. Mauch, and D. Tidhar, “Estimation of harpsi-
by overlapping tones disturbing our analysis method. The pickup chord inharmonicity and temperament from musical record-
position estimates are quite robust to downstroke chords where the ings,” The Journal of the Acoustical Society of America, vol.
median absolute errors for each preset are less than 10 mm. The 131, no. 1, pp. 878–887, 2012.
errors increase for faster strums (tc less than 30ms). This may [13] J. O. Smith, “Spectral audio signal processing,” Available at:
suggest that upstroke chords would pose less of a problem, where https://ccrma.stanford.edu/ jos/sasp/, accessed April 8, 2017.
DAFX-7
DAFx-426
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-8
DAFx-427
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-428
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 https://goo.gl/9aWhTX 2 https://goo.gl/TzQgsB
DAFX-2
DAFx-429
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Class Quantity of Class Quantity of more conservative than the approach in [28]. In order to select
Name Samples Name Samples
Ambience 92 Animals 173
the correct set of audio features that fit our dataset we chose the
Cartoon 261 Crashes 266 feature set that provided us with lowest mean OOB error over all
DC 6 DTMF 26 the feature selection iterations.
Drones 75 Emergency Effects 158 On each step of the audio feature selection process, we cluster
Fire and Explosions 106 Foley 702 the data using a Gaussian Mixture Model (GMM). GMM’s are an
Foley Footsteps 56 Horror 221 unsupervised method for clustering data, on the assumption that
Household 556 Human Elements 506
data points can be modelled by a gaussian. In this method, we
Impacts 575 Industry 378
Liquid-Water 254 Multichannel 98 specify the number of clusters and get a measure of GMM qual-
Multimedia 1223 Noise 43 ity using the Akaike Information Criterion (AIC). The AIC is a
Production Elements 1308 Science Fiction 312 measure of the relative quality of a statistical model for a given
Sports 319 Technology 219 dataset. We keep increasing the number of clusters used to create
Tones 33 Transportation 460 each GMM, while performing 10-fold cross-validation until the
Underwater 73 Weapons 424 AIC measure stops increasing. This gives us the optimal number
Weather 54
of clusters to fit the dataset.
Table 2: Original label classification of the Adobe Sound Effects
Dataset. DC are single DC offset component signals. DTMF is 3.4. Hierarchical Clustering
Dual Tone Multi Frequency - a set of old telephone tones.
There are two main methods for hierarchical clustering. Agglom-
erative clustering is a bottom up approach, where the algorithm
starts with singular clusters and recursively merges two or more
3.2. Feature Extraction of the most similar clusters. Divisive clustering is a top down ap-
The dataset described in Section 3.1 was used. We used Essentia proach, where the data is recursively separated out into a fixed
Freesound Extractor to extract audio features [24], as Essentia al- number of smaller clusters.
lows for extraction of a large number of audio features, is easy to Agglomerative clustering was used in this paper, as it is fre-
use in a number of different systems and produced the data in a quently applied to problems within this field [10, 11, 12, 13, 26,
highly usable format [25]. 180 different audio features were ex- 17]. It also provides the benefit of providing cophonetic distances
tracted, and all frame based features were calculated using a frame between different clusters, so that the relative distances between
size of 2048 samples with a hop size of 1024, with the exception of nodes of the hierarchy are clear. Agglomerative clustering was
pitch based features, which used a frame size of 4096 and the hop performed, on the feature reduced dataset, by assigning each in-
size 2048. The statistics of these audio features were then calcu- dividual sample in the dataset as a cluster. The distance was then
lated, to summarise frame based features over the audio file. The calculated for every cluster pair based on Ward’s method [29],
statistics used are the mean, variance, skewness, kurtosis, median, s
mean of the derivative, the mean of the second derivative, the vari- 2nci ncj
d(ci , cj ) = euc(xci , xcj ) (1)
ance of the derivative, the variance of the second derivative, the nci + ncj
maximum and minimum values. This produced a set of 1450 fea-
tures, extracted from each file. Sets of features were removed if where for clusters ci and cj , xc is the centroid of a cluster c,
they provided no variance over the dataset, thus reducing the orig- nc is the number of elements in a cluster c and euc(xci , xcj ) is
inal feature set to 1364 features. All features were then normalised the euclidean distance between the centroids of clusters ci and cj .
to the range [0, 1]. This introduces a penalty for clusters that are too large, which re-
duces the chances of a single cluster containing the majority of the
3.3. Feature Selection dataset and that an even distribution across a hierarchical structure
is produced. The distance is calculated for all pairs of clusters, and
We performed feature selection using a similar method to the one the two clusters with the minimum distance d are merged into a
described in [26], where the authors used a Random Forest classi- single cluster. This is performed iteratively until we have a sin-
fier to determine audio feature importance. gle cluster. This provides us with a full structure of our data, and
Random forests are an unsupervised classification technique we can visualise our data from the whole dataset, down to each
where a series of decision trees are created, each with a random individual component sample.
subset of features. The out-of-bag (OOB) error was then calcu-
lated, as a measure of the random forests classification accuracy.
3.5. Node Semantic Context
From this, it is possible to allocate each feature with a Feature Im-
portance Index (FII), which ranks all audio features in terms of In order to interpret the dendrogram produced from the previous
importance by evaluating the OOB error for each tree grown with step, it is important to have an understanding of what is causing
a given feature, to the overall OOB error [27]. the separation at each of the node points within the dendrogram.
In [26] the authors eliminated the audio features from a Ran- Visualising the results of machine learning algorithms is a chal-
dom Forest that had an FII less than the average FII and then grew lenging task. According to [30] decision trees are the only clas-
a new Random Forest with the reduced audio feature set. This sification method which provides a semantic explanation of the
elimination process would repeat until the OOB error for a newly classification. This is because a decision tree faciliates inspection
grown Random Forest started to increase. of individual features and threshold values, allowing interpretation
Here, we opted to eliminate the 1% worst performing audio of the separation of different clusters. This is not possible with
features on each step of growing a Random Forest, similar to but any other classification methods. As such, we undertook feature
DAFX-3
DAFx-430
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.355
selection and then grew a decision tree to provide some semantic
meaning to the results.
Each node point can be addressed as a binary classification
problem. For each node point, every cluster that falls underneath
one side is put into a single cluster, and everything that falls on the 0.345
other side of the node is placed in another separate cluster. Every-
thing that does not fall underneath the node is ignored. This pro- 0.34
duces two clusters, which represent the binary selection problem at
that node point. From this node point, a random forest is grown to
0.335
perform the binary classification between the two sets and feature
selection is then performed as described in Section 3.3. The main
difference here is that only the five most relevant features, based 0.33
where i is the class and p(i) is the fraction of objects within class
4.3. Sound Effects Taxonomy Result
i following the branch. The decision trees are grown using the
CART algorithm [31]. To allow for a more meaningful visualisa- The audio features used for classification were related to their se-
tion of the proposed taxonomy, the audio features and values were mantic meanings by manual inspection of the audio features used
translated to a semantically meaningful context based on the audio and the feature definitions. This is presented in Figure 6. As can
interpretation of the audio feature. The definitions of the particu- be seen, the two key factors that make a difference to the clustering
lar audio features were investigated and the authors identified the are periodicity and dynamic range.
perceptual context of these features, providing relevant semantic Periodicity is calculated as the relative weight of the tallest
terms in order to describe the classification of sounds at each node peak in the beat histogram. Therefore strongly periodic signals
point. have a much higher relative peak weight than random signals,
which we expect to have near-flat beat histograms. Dynamic range
is represented by the ratio of analysis frames under 60dB to the
4. RESULTS AND EVALUATION number over 60dB as all audio samples were loudness normalised
and all leading and trailing silence was removed, as discussed in
4.1. Feature Extraction Results Section 3.2. Further down the taxonomy, it is clear that periodicity
stands out as a key factor, in many different locations, along with
Figure 2 plots the mean OOB error for each Random Forest that
the metric structure of periodicity, calculated as the weight of the
is grown for each iteration of the audio feature selection process.
second most prominent peak in the beat histogram. Structured mu-
In total there were 325 iterations of the feature selection process,
sic with beats and bars will have a high metrical structure, whereas
where the lowest OOB error occurred at iteration 203 with a value
single impulse beats or ticks will have a high beat histogram at one
of 0.3242. This reduced the number of audio features from 1450
point but the rest of the histogram should look flat.
to 193.
Figure 3 depicts the mean OOB error for each Random Forest
feature selection iteration against the optimal amount of clusters, 4.4. Evaluation
where the optimal amount of clusters was calculated using the AIC
To evaluate the results of the produced sound effect taxonomy, as
for each GMM created. We found the optimal amount of clusters
presented in Figure 6, the generated taxonomy was compared to
to be 9, as this coincides with the minimum mean OOB error in
the original sound effect library classification scheme, as presented
Figure 3.
in Section 3.1. The purpose of this is to produce a better under-
standing of the resulting classifications, and how it compares to
4.2. Hierarchical Clustering Results more traditional sound effects library classifications. It is not ex-
pected that out clusters will appropriately represent an pre-existing
Having applied agglomerative hierarchical clustering to the re- data clusters, but that it may give us a better insight into the repre-
duced dataset, the resultant dendrogram can be seen in Figure 4. sentation of the data.
The dotted line represents the cut-off for depth analysis, chosen Each of the 9 clusters identified in Figures 5 and 6 were eval-
based on the result that the optimal choice of clusters is 9. uated by comparing the original classification labels found in Ta-
The results of the pruned decision trees are presented in Fig- ble 2 to the new classification structure. This is presented in Fig-
ure 5. Each node point identified the specific audio feature pro- ure 7, where each cluster has a pie chart representing the distribu-
vides the best split in the data, to create the structure as presented tion of original labels from the dataset. Only labels that make up
in Figure 4. more than 5% of the dataset were plotted.
DAFX-4
DAFx-431
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.355
Mean OOB error for each Random Forest
0.35
0.345
0.34
0.335
0.33
0.325
0.32
2 4 6 8 10 12 14 16 18 20 22
Optimal number of clusters
Figure 3: Mean OOB Error for each Random Forest grown plot-
ted against optimal number of clusters for each feature selection
iteration
DAFX-5
DAFx-432
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 5: Machine learned structure of sound effects library, where clusters are hierarchical clusters. The single audio feature contributing
to the separation is used as the node point,with normalised audio feature values down each branch to understand the impact the audio
feature has on the sound classification. The ∗ represents a feature separation where the classification accuracy is less than 80%, never less
than 75%.
is also caused by the analysis and evaluation of a single produced tion. We demonstrated the importance of the periodicity, dynamic
sound effect library. This artefact may be avoided with a large range and spectral features for classification. It should be noted
range of sound effects from different sources. that although the entire classification was performed in an unsu-
Section 4.4 shows that the current classification system for pervised manner, there is still a perceptual relevance to the results
sound effects may not be ideal, especially since expert sound de- and there is a level of intuition provided by the decision tree and
signers often know what sonic attributes they wish to obtain. This our semantic descriptors. Furthermore, the clustering and structure
is one of the reasons that audio search tools have become so promi- will be heavily reliant on the sound effects library used.
nent, yet many audio search tools only work using tag metadata We also demonstrated that current sound effect classification
and not the sonic attributes of the audio files. and taxonomies may not be ideal for their purpose. They are both
Our produced taxonomy is very different from current work. non-standard and often place sonically similar sounds in very dif-
As presented in Section 2, most literature bases a taxonomy on ferent categories, potentially making it challenging for a sound de-
either audio source, environmental context or subjective ratings. signer to find an appropriate sound. We have proposed a direction
for producing new sound effect taxonomies based purely on the
6. CONCLUSION sonic content of the samples, rather than source or context meta-
data.
Given a commercial sound effect library, a taxonomy of sound ef- In future work, validation of the results on larger sound ef-
fects has been learned using unsupervised learning techniques. fect datasets could aid in evaluation. By using the hierarchical
At the first level, a hierarchical structure of the data was ex- clustering method, one can also produce a cophonetic distance be-
tracted and presented in Figure 4. Following from this, a decision tween two samples. This would allow identification of how the
tree was created and pruned, to allow for visualisation of the data, distance can correlate with perceived similarity and may provide
as in Figure 5. Finally a semantically relevant context was applied some interesting and insightful results. Further development of
to data, to produce a meaningful taxonomy of sound effects which the evaluation and validation of the results, perhaps through per-
is presented in Figure 6. A semantic relationship between different ceptual listening tests, would be of beneficial to this field of re-
sonic clusters was identified. search. It is also possible to look at the applications of hierarchical
The hierarchical clusters of the data provide deeper under- clustering towards other types of musical sounds, such as musical
standing of the separating attributes of sound effects, and gives us instrument classification. Hierarchical clustering is able to provide
an insight into relevant audio features for sound effect classifica- more information and context than many other unsupervised clus-
DAFX-6
DAFx-433
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 6: Machine learned taxonomy, where each node separation point is determined by hierarchical clustering and text within each node
is an semantic interpretation of the most contributing audio feature for classification. Each final cluster is given a cluster number and a
brief semantic description. The ∗ represents a feature separation where the classification accuracy is less than 80%, never less than 75%.
Figure 7: Dataset labels per cluster, where all labels that make up more than 5% of the dataset were plotted
DAFX-7
DAFx-434
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
tering methods. Further evaluation of clusters produced could be [15] Iain McGregor et al., “Sound and soundscape classification:
undertaken, as well as a deeper analysis into each of the identified establishing key auditory dimensions and their relative im-
clusters, to produce a deeper taxonomy. portance,” in 12th International Conference on Auditory Dis-
play, London, UK, June 2006.
7. ACKNOWLEDGEMENTS [16] Torben Holm Pedersen, The Semantic Space of Sounds,
Delta, 2008.
This work was supported by EPSRC. Many thanks to anonymous [17] James Woodcock et al., “Categorization of broadcast audio
internal reviews. objects in complex auditory scenes,” Journal of the Audio
Engineering Society, vol. 64, no. 6, pp. 380–394, 2016.
8. REFERENCES [18] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello,
“A dataset and taxonomy for urban sound research,” in Pro-
[1] Erling Wold et al., “Content-based classification, search, and ceedings of the 22nd ACM international conference on Mul-
retrieval of audio,” IEEE multimedia, vol. 3, no. 3, pp. 27–36, timedia. ACM, 2014, pp. 1041–1044.
1996.
[19] Davide Rocchesso and Federico Fontana, The sounding ob-
[2] Grégoire Lafay, Nicolas Misdariis, Mathieu Lagrange, and ject, Mondo estremo, 2003.
Mathias Rossignol, “Semantic browsing of sound databases
[20] Edgar Hemery and Jean-Julien Aucouturier, “One hundred
without keywords,” Journal of the Audio Engineering Soci-
ways to process time, frequency, rate and scale in the central
ety, vol. 64, no. 9, pp. 628–635, 2016.
auditory system: a pattern-recognition meta-analysis,” Fron-
[3] R Murray Schafer, The soundscape: Our sonic environment tiers in computational neuroscience, vol. 9, 2015.
and the tuning of the world, Inner Traditions/Bear & Co,
[21] Justin Salamon and Juan Pablo Bello, “Unsupervised feature
1993.
learning for urban sound classification,” in IEEE Interna-
[4] Luigi Russolo and Francesco Balilla Pratella, The art of tional Conference on Acoustics, Speech and Signal Process-
noise:(futurist manifesto, 1913), Something Else Press, ing (ICASSP), 2015, pp. 171–175.
1967. [22] Justin Salamon and Juan Pablo Bello, “Deep convolutional
[5] Luigi Russolo, “The art of noises: Futurist manifesto,” Audio neural networks and data augmentation for environmental
culture: Readings in modern music, pp. 10–14, 2004. sound classification,” IEEE Signal Process. Lett., vol. 24,
no. 3, pp. 279–283, 2017.
[6] Ian Stevenson, “Soundscape analysis for effective sound
design in commercial environments,” in Sonic Envi- [23] David J. M. Robinson and Malcolm J. Hawksfords, “Psy-
ronments Australasian Computer Music Conference. Aus- choacoustic models and non-linear human hearing,” in 109th
tralasian Computer Music Association, 2016. Audio Engineering Society Convention, Los Angeles, CA,
USA, September 2000.
[7] PerMagnus Lindborg, “A taxonomy of sound sources in
restaurants,” Applied Acoustics, vol. 110, pp. 297–310, 2016. [24] Dmitry Bogdanov et al., “Essentia: An audio analysis library
for music information retrieval,” in ISMIR, 2013, pp. 493–
[8] William W Gaver, “What in the world do we hear?: An eco- 498.
logical approach to auditory event perception,” Ecological
psychology, vol. 5, no. 1, pp. 1–29, 1993. [25] David Moffat, David Ronan, and Joshusa D. Reiss, “An eval-
uation of audio feature extraction toolboxes,” in Proc. 18th
[9] Olivier Houix et al., “A lexical analysis of environmental International Conference on Digital Audio Effects (DAFx-
sound categories.,” Journal of Experimental Psychology: 15), November 2015.
Applied, vol. 18, no. 1, pp. 52, 2012.
[26] David Ronan, David Moffat, Hatice Gunes, and Joshua D.
[10] James A Ballas, “Common factors in the identification of an Reiss, “Automatic subgrouping of multitrack audio,” in
assortment of brief everyday sounds.,” Journal of experimen- Proc. 18th International Conference on Digital Audio Effects
tal psychology: human perception and performance, vol. 19, (DAFx-15). DAFx-15, November 2015.
no. 2, pp. 250, 1993.
[27] Leo Breiman, “Random forests,” Machine learning, vol. 45,
[11] Brian Gygi, Gary R Kidd, and Charles S Watson, “Similarity no. 1, pp. 5–32, 2001.
and categorization of environmental sounds,” Perception &
[28] Robin Genuer, Jean-Michel Poggi, and Christine Tuleau-
psychophysics, vol. 69, no. 6, pp. 839–855, 2007.
Malot, “Variable selection using random forests,” Pattern
[12] Kirsteen M Aldrich, Elizabeth J Hellier, and Judy Edworthy, Recognition Letters, vol. 31, no. 14, pp. 2225–2236, 2010.
“What determines auditory similarity? the effect of stimulus [29] Joe H Ward Jr, “Hierarchical grouping to optimize an objec-
group and methodology,” The Quarterly Journal of Experi- tive function,” Journal of the American statistical associa-
mental Psychology, vol. 62, no. 1, pp. 63–83, 2009. tion, vol. 58, no. 301, pp. 236–244, 1963.
[13] Monika Rychtáriková and Gerrit Vermeir, “Soundscape cat- [30] David Baehrens et al., “How to explain individual classifica-
egorization on the basis of objective acoustical parameters,” tion decisions,” Journal of Machine Learning Research, vol.
Applied Acoustics, vol. 74, no. 2, pp. 240–247, 2013. 11, no. Jun, pp. 1803–1831, 2010.
[14] William J Davies et al., “Perception of soundscapes: An [31] Leo Breiman et al., Classification and regression trees, CRC
interdisciplinary approach,” Applied Acoustics, vol. 74, no. press, 1984.
2, pp. 224–231, 2013.
DAFX-8
DAFx-435
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT
Research on perception of music production practices is mainly
concerned with the emulation of sound engineering tasks through
lab-based experiments and custom software, sometimes with un-
skilled subjects. This can improve the level of control, but the va-
lidity, transferability, and relevance of the results may suffer from
this artificial context. This paper presents a dataset consisting of
mixes gathered in a real-life, ecologically valid setting, and per-
ceptual evaluation thereof, which can be used to expand knowl-
edge on the mixing process. With 180 mixes including parameter
settings, close to 5000 preference ratings and free-form descrip-
tions, and a diverse range of contributors from five different coun-
tries, the data offers many opportunities for music production anal-
ysis, some of which are explored here. In particular, more experi-
enced subjects were found to be more negative and more specific
in their assessments of mixes, and to increasingly agree with each
other.
1. INTRODUCTION
Figure 1: Online interface to browse the contents of the Mix Eval-
Many types of audio and music research rely on multitrack audio uation Dataset
for analysis, training and testing of models, or demonstration of
algorithms. For instance, music production analysis [1], automatic
mixing [2], audio effect interface design [3], instrument grouping the research and the designed systems. Even when some mixes are
[4], and various types of music information retrieval [5] all require available, extracting data from mix sessions is laborious at best.
or could benefit from a large number of raw tracks, mixes, and For this reason, existing research typically employs lab-based mix
processing parameters. This kind of data is also useful for budding simulations, which means that its relation to professional mixing
mix engineers, audio educators, and developers, as well as creative practices is uncertain.
professionals in need of accompanying music or other audio where The dataset presented here is therefore based on a series of
some tracks can be disabled [6]. controlled experiments wherein realistic, ecologically valid mixes
Despite this, multitrack audio is scarce. Existing online re- are produced — i.e. by experienced engineers, in their preferred
sources of multitrack audio content typically have a relatively low environment and using professional tools — and evaluated. The
number of songs, offer little variation, are restricted due to copy- sessions can be recreated so that any feature or parameter can be
right, provide little to no metadata, or lack mixed versions and extracted for later analysis, and different mixes of the same songs
corresponding parameter settings. An important obstacle to the are compared through listening tests to assess the importance and
widespread availability of multitrack audio and mixes is copyright, impact of their attributes. As such, both high-level information,
which restricts the free sharing of most music and their compo- including instrument labels and subjective assessments, and low-
nents. Furthermore, due to reluctance to expose the unpolished level measures can be taken into account. While some of the
material, content owners are unlikely to share source content, pa- data presented here has been used in several previous studies, the
rameter settings, or alternative versions of their music. While there dataset is now consolidated and opened up to the community, and
is no shortage of mono and stereo recordings of single instruments can be browsed on c4dm.eecs.qmul.ac.uk/multitrack/
and ensembles, any work concerned with the study or processing MixEvaluation/, see Figure 1.
of multitrack audio therefore suffers from a severe lack of relevant With close to 5000 mix evaluations, the dataset is by far the
material. largest study of evaluated mixes known to the authors. MedleyDB,
This impedes reproduction or improvement of previous studies another resource shared with researchers on request, consists of
where the data cannot be made public, and comparison of different raw tracks including pitch annotations, instrument activations, and
works when there is no common dataset used across a wider com- metadata [7]. [8] analyses audio features extracted from a total
munity. It further limits the generality, relevance, and quality of of 1501 unevaluated mixes from 10 different songs. The same au-
DAFX-1
DAFx-436
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
thors examine audio features extracted from 101 mixes of the same
song, evaluated by one person who classified the mixes in five pref-
erence categories [9]. In both cases, the mixes were created by
anonymous visitors of the Mixing Secrets Free Multitrack Down-
load Library [10], and principal component analysis preceded by
outlier detection was employed to establish primary dimensions of
variation. Parameter settings or individual processed stems were
not available in these works.
This paper introduces the dataset and shows how it allows to
further our understanding of sound engineering, and is structured
as follows. Section 2 presents the series of acquisition experi-
ments wherein mixes were created and evaluated. The resulting
Figure 2: Example interface used to acquire subjective assess-
data is described in Section 3. Section 4 then demonstrates how
ments of nine different mixes of the same song, created with the
this content can be used to efficiently obtain knowledge about mu-
Web Audio Evaluation Tool
sic production practices, perception, and preferences. To this end,
previous studies and key results based on part of this dataset are
listed, and new findings about the influence of subject expertise
based on the complete set are presented. Finally, Section 5 offers dustry standard DAW, except for the PXL group who used Apple
concluding remarks and suggestions for future research. Logic Pro X. Instructions explicitly forbade outboard processing,
recording new audio, sample replacement, pitch and timing cor-
rection, rearranging sections, or manipulating audio in an external
2. METHODOLOGY editor. Beyond this, any kind of processing was allowed, including
automation, subgrouping, and multing.
2.1. Mix creation
Mix experiments and listening tests were conducted at seven in- 2.2. Perceptual evaluation
stitutions located in five different countries. The mix process was The different mixes were evaluated in a listening test using the
maximally preserved in the interest of ecological relevance, while interface presented in [13]. With the exception of groups McG
information such as parameter settings was logged as much as pos- and MG, the browser-based version of this interface from the Web
sible. Perceptual evaluation further helped validate the content and Audio Evaluation Tool [14] was used, see Figure 2.
investigate the perception and preference in relation to mix prac- As dictated by common practices, this listening test was con-
tices. ducted in a double blind fashion [15], with randomised presenta-
Students and staff members from sound engineering degrees at tion order [16], minimal visual information [17], and free and im-
McGill University (McG), Dalarna University (DU), PXL Univer- mediate switching between time-aligned stimuli [18]. The inter-
sity College (PXL), and Universidade Católica Portuguesa (UCP) face presented multiple stimuli [19] on a single, ‘drag-and-drop’
created mixes and participated in listening tests. In addition, em- rating axis [20], and without ticks to avoid build-up around these
ployees from a music production startup (MG) and researchers marks [21]. A ‘reference’ was not provided because it is not de-
from Queen Mary University of London (QM) and students from fined for this exceedingly subjective task. Indeed, even commer-
the institution’s Sound and Music Computing master (SMC) took cial mixes by renowned mix engineers prove not to be appropriate
part in the perceptual evaluation stage as well. reference stimuli, as these are not necessarily rated more highly
Table 1 lists the songs and the corresponding number of mixes than mixes by students [12].
created from the source material, as well as the number of subjects For the purpose of perceptual evaluation, a fragment consist-
evaluating (a number of) these mixes. Numbers between paren- ing of the second verse and chorus was used. With an average
theses refer to additional mixes for which stems, Digital Audio length of one minute, this reduced the strain on the subjects’ at-
Workstation (DAW) sessions, and parameter settings are not avail- tention, likely leading to more reliable listening test results. It also
able. These correspond to the original release or analogue mixes, placed the focus on a region of the song where the most musical
see Section 3.2. Songs with an asterisk (*) are copyrighted and elements were active. In particular, the elements which all songs
not available online, whereas raw tracks to others can be found via have in common (drums, lead vocal, and a bass instrument) were
the Open Multitrack Testbed1 [11]. For two songs, permission to all active here. A fade-in and fade-out of one second were applied
disclose artist and song title was not granted. Evaluations with an at the start and end of the fragment [1].
obelus (†) indicate that subjects included those who produced the The headphones used were Beyerdynamic DT770 PRO for
mixes. Consistent anonymous identifiers of the participants (e.g. group MG and Audio Technica M50x for group SMC. In all other
‘McG-A’) allow exclusion of this segment or examination of the cases, the listening tests took place in dedicated, high quality lis-
associated biases [12]. tening rooms at the respective institutions, the room impulse re-
The participants produced these mixes in their preferred mix- sponses of which are included in the dataset. This knowledge
ing location, so as to achieve a natural and representative spread could be used to estimate the impact of the respective playback
of environments without a bias imposed by a specific acoustic systems, although in these cases the groups differ significantly in
space, reproduction system, or playback level. The toolset was other aspects as well.
restricted somewhat so that each mix could be faithfully recalled Comments in other languages than English (DU, PXL, and
and analysed in depth later, with a limited number of software plu- UCP) were translated by native speakers of the respective lan-
gins available, typically consisting of those which come with the guages, who are also proficient in English and have a good knowl-
respective DAWs. All students used Avid Pro Tools 10, an in- edge of audio engineering.
DAFX-2
DAFx-437
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Overview of mixed content, with number of mixes (left side) and number of subjects (right side) per song
H
Table 2 shows additional details of the perceptual evaluation I
Excellent
experiments. F
A
B
Good
3. CONTENT
E
H
Preference
G
G
Raw tracks of the mixes can be found via the Open Multitrack B
Testbed1 . The first two sections of Table 1 are newly presented Poor
F
I A
here and were recorded by Grammy-winning engineers. Six of
the ten songs are made available in their entirety under a Creative
C
Bad D
Commons BY 4.0 license. Raw tracks to songs from the last two C
D
sections of the table can be downloaded from Weathervane Mu- 0 50 100 150 200
Time [seconds]
250 300
DAFX-3
DAFx-438
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
3.4. Comments metadata further allows tracks and mixes to be found through the
Testbed’s search and browse interfaces.
A single numerical rating does not convey any detailed informa-
tion about what aspects of a mix are (dis)liked. For instance, a
3.5.2. Genre
higher score for mixes with a higher dynamic range [12] may relate
to subtle use of dynamic range compression (e.g. preference for The source material was selected in coordination with the pro-
substantial level variations), but also to a relatively loud transient gramme’s teachers from the participating institutions, because they
source (e.g. preference for prominent snare drums). In addition, fit the educational goals, were considered ecologically valid and
the probability of spurious correlation increases as an ever larger homogeneous with regard to production quality, and were deemed
number of features is considered. Furthermore, subjects tend to to represent an adequate spread of genre. Due to the subjective
be frustrated when they do not have the ability to express their nature of musical genre, a group of subjects were asked to com-
thoughts on a particular attribute, and isolated ratings do not pro- ment on the genres of the songs during the evaluation experiments,
vide any information about the difficulty, focus, or thought process providing a post hoc confirmation of the musical diversity. Each
associated with the evaluation task. song’s most often occurring genre label was added to Table 1 for
For this reason, free-form text response in the form of com- reference.
ment boxes is accommodated, facilitating in-depth analysis of the
perception and preference with regard to various music production 3.6. Survey responses and subject demographics
aspects. The results of this ‘free-choice profiling’ also allow learn-
ing how subjects used and misused the interface. An additional, The listening test included a survey to establish the subjects’ gen-
practical reason for allowing subjects to write comments is that der, age, experience with audio engineering and playing a musi-
taking notes on shortcomings or strengths of the different mixes cal instrument (in number of years and described in more detail),
helps them to keep track of which fragment is which, facilitating whether they had previously participated in (non-medical) listen-
the complex task at hand. ing tests, and whether they had a cold or condition which could
Extensive annotation of the comments is included in the form negatively affect their hearing.
of tags associated with each atomic ‘statement’ of which the com-
ment consists. For instance, a comment ‘Drums a little distant. 4. ANALYSIS
Vox a little hot. Lower midrange feels a little hollow, otherwise
pretty good.’ comprises four separate statements. Tags then indi- 4.1. Prior work
cate the instrument (‘drums’, ‘vocal’, ...), feature (‘level (high)’,
‘spectrum’, ...), and valence (‘negative’/‘positive’). The McG portion of this dataset has previously been used in stud-
ies on mix practices and perception, as detailed below.
The XML structure of the Web Audio Evaluation Tool out-
The mild constraints on tools used and the availability of pa-
put files was adopted to share preference ratings, comments, and
rameter settings allows one to compare signal features between
annotation data associated with the content.
different mixes, songs, or institutions, and identify trends. A de-
tailed analysis of tendencies in a wide range of audio features —
3.5. Metadata extracted from vocal, drum (kick drum, snare drum, and other),
bass, and mix stems — appeared in [27]. As an example, Fig-
3.5.1. Track labels ure 4 shows the ITU-R BS.1770 loudness [28] of several processed
stems for two songs, as mixed by engineers from two institutions
Reliable track metadata can serve as a ground truth that is neces- (McG and UCP). No significant differences in balance choices are
sary for applications such as instrument identification, where the apparent here.
algorithm’s output needs to be compared to the actual instrument. Correlation between preference ratings and audio features ex-
Providing this data makes this dataset an attractive resource for tracted from the total mixes have shown a higher preference for
training or testing such algorithms as it obviates the need for man- mixes with relative higher dynamic range, and mixes with a rela-
ual annotation of the audio, which can be particularly tedious if the tively strong phantom centre [12].
number of files becomes large. In [29], relative attention to each of these categories was quan-
The available raw tracks and mixes are annotated on the Open tified based on annotated comments. Figure 5 shows the relative
Multitrack Testbed1 , including metadata describing for instance proportion of statements referring to detailed feature categories for
the respective instruments, microphones, and take numbers. This the complete dataset (all groups).
DAFX-4
DAFx-439
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
80 948 487
154 266
1932
275 720 3237 1090 1674
Relative Loudness [LU]
10
40
15
20
20
0
Drums Kick Snare Rest Vocal Bass DU SMC PXL QM MG UCP McG Am Student Pro
Instrument Group
Figure 4: Stem loudness relative to the total mix of Fredy V’s In (a) Proportion of negative (red) vs. positive (green) state-
The Meantime and The DoneFors’ Lead Me, as mixed by students ments
100
from the McG and UCP groups
Other 20
Reverb
0
DU MG SMC QM PXL McG UCP Am Student Pro
Level Group
DAFX-5
DAFx-440
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
.90 the expectation that they are trained to spot and articulate prob-
lems with a mix. Conversely, one could suppose amateur subjects
Amateur .767 .775 .801
lack the vocabulary or previous experience to formulate detailed
.85 comments about unfavourable aspects, instead highlighting fea-
tures that tastefully grab attention and stand out in a positive sense.
Student .775 .800 .865 The dataset and potential extensions offer interesting opportu-
nities for further cross-analysis, comparing the practices, percep-
.80 tion, and preferences of different groups. At this point, however,
the dataset is heavily skewed towards Western musical genres, en-
Pro .801 .865 .885 gineers, and subjects, and experienced music producers. Extension
of the acquisition experiments presented here, with an emphasis
.75 on content from countries outside of North America and Western
Amateur Student Pro Europe, can mitigate this bias and help answer new research ques-
tions. In addition, a substantially larger dataset can be useful for
Figure 7: Level of agreement between groups of different expertise
analysis which requires high volumes of data, such as machine
learning of music production practices [31].
aAB 6. ACKNOWLEDGEMENTS
rAB = (1)
dAB + aAB
The authors would like to thank Frank Duchêne, Pedro Pestana,
where aAB and dAB are the total number of agreeing and dis- Henrik Karlsson, Johan Nordin, Mathieu Barthet, Brett Leonard,
agreeing pairs of statements, respectively, where a pair of state- Matthew Boerum, Richard King, and George Massenburg, for fa-
ments consists of a statement from group A and a statement from cilitating and contributing to the experiments. Special thanks also
group B on the same topic. go to all who participated in the mix creation sessions or listening
Between and within the different levels of expertise, agree- tests.
ment increases consistently from amateurs over students to pro-
fessionals, see Figure 7. In other words, skilled engineers are
less likely to contradict each other when evaluating mixes (high 7. REFERENCES
within-group agreement), converging on a ‘common view’. This
supports the notion that universal ‘best practices’ with regard to [1] Alex Wilson and Bruno M. Fazenda, “Perception of audio
mix engineering do exist, and that perceptual evaluation is more quality in productions of popular music,” Journal of the Au-
reliable and efficient when the participants are skilled. Conversely, dio Engineering Society, vol. 64, no. 1/2, pp. 23–34, Jan/Feb
amateur listeners tend to make more statements which are chal- 2016.
lenged by other amateurs, as well as more experienced subjects [2] Enrique Perez Gonzalez and Joshua D. Reiss, “Automatic
(low within-group and between-group agreement). As the within- mixing: Live downmixing stereo panner,” Proc. Digital Au-
group agreement of amateurs is lower than any between-group dio Effects (DAFx-07), Sep 2007.
agreement, this result does not indicate any consistent differences [3] Spyridon Stasis, Ryan Stables, and Jason Hockman, “A
of opinion between two groups. For instance, there is no evidence model for adaptive reduced-dimensionality equalisation,”
that ‘amateurs want the vocal considerably louder than others’. Proc. Digital Audio Effects (DAFx-15), Dec 2015.
Such distinctions may exist, but revealing them requires in-depth
analysis of the individual statement categories. The type of agree- [4] David Ronan et al., “Automatic subgrouping of multitrack
ment analysis proposed here can be instrumental in comparing the audio,” Proc. Digital Audio Effects (DAFx-15), Nov 2015.
quality of (groups of) subjects, on the condition that the evaluated [5] Justin Salamon, “Pitch analysis for active music discovery,”
stimuli are largely the same. 33rd Int. Conf. on Machine Learning, June 2016.
Further analysis is necessary, for instance to establish which [6] Chris Greenhalgh et al., “GeoTracks: Adaptive music for ev-
of these differences are significant and not spurious or under influ- eryday journeys,” ACM Int. Conf. on Multimedia, Oct 2016.
ence of other factors.
[7] Rachel Bittner et al., “MedleyDB: a multitrack dataset for
annotation-intensive MIR research,” 15th International So-
5. CONCLUSION ciety for Music Information Retrieval Conf. (ISMIR 2014),
Oct 2014.
A dataset of mixes of multitrack music and their evaluations was [8] Alex Wilson and Bruno Fazenda, “Variation in multitrack
constructed and released, based on contributions from five differ- mixes: Analysis of low-level audio signal features,” Journal
ent countries. The mixes were created by skilled sound engineers, of the Audio Engineering Society, vol. 64, no. 7/8, pp. 466–
in a realistic setting, and using professional tools, to minimally dis- 473, Jul/Aug 2016.
rupt the natural mixing process. By including parameter settings
and limiting the set of software tools, mixes can be recreated and [9] Alex Wilson and Bruno Fazenda, “101 mixes: A statisti-
analysed in detail. Previous studies using this data were listed, and cal analysis of mix-variation in a dataset of multi-track mu-
further statistical analysis of the data was presented. sic mixes,” Audio Engineering Society Convention 139, Oct
In particular, it was shown that expert listeners are more likely 2015.
to contribute negative and specific assessments, and to agree with [10] Mike Senior, Mixing Secrets, Taylor & Francis, www.
others about various aspects of the mix. This is consistent with cambridge-mt.com/ms-mtk.htm, 2012.
DAFX-6
DAFx-441
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[11] Brecht De Man et al., “The Open Multitrack Testbed,” Audio [28] Recommendation ITU-R BS.1770-4, “Algorithms to mea-
Engineering Society Convention 137, Oct 2014. sure audio programme loudness and true-peak audio level,”
[12] Brecht De Man et al., “Perceptual evaluation of music mix- Oct 2015.
ing practices,” Audio Engineering Society Convention 138, [29] Brecht De Man and Joshua D. Reiss, “Analysis of peer re-
May 2015. views in music production,” Journal of the Art of Record
[13] Brecht De Man and Joshua D. Reiss, “APE: Audio Percep- Production, vol. 10, July 2015.
tual Evaluation toolbox for MATLAB,” Audio Engineering [30] David Ronan et al., “The impact of subgrouping practices
Society Convention 136, Apr 2014. on the perception of multitrack mixes,” Audio Engineering
[14] Nicholas Jillings et al., “Web Audio Evaluation Tool: A Society Convention 139, Oct 2015.
browser-based listening test environment,” 12th Sound and [31] Stylianos Ioannis Mimilakis et al., “New sonorities for jazz
Music Computing Conf., July 2015. recordings: Separation and mixing using deep neural net-
[15] Floyd E. Toole and Sean Olive, “Hearing is believing vs. be- works,” 2nd AES Workshop on Intelligent Music Production,
lieving is hearing: Blind vs. sighted listening tests, and other Sep 2016.
interesting things,” Audio Engineering Society Convention
97, Nov 1994.
[16] S. Bech and N. Zacharov, Perceptual Audio Evaluation -
Theory, Method and Application, John Wiley & Sons, 2007.
[17] Jan Berg and Francis Rumsey, “Validity of selected spatial
attributes in the evaluation of 5-channel microphone tech-
niques,” Audio Engineering Society Convention 112, Apr
2002.
[18] Neofytos Kaplanis et al., “A rapid sensory analysis method
for perceptual assessment of automotive audio,” Journal of
the Audio Engineering Society, vol. 65, no. 1/2, pp. 130–146,
Jan/Feb 2017.
[19] Sean Olive and Todd Welti, “The relationship between per-
ception and measurement of headphone sound quality,” Au-
dio Engineering Society Convention 133, Oct 2012.
[20] Christoph Völker and Rainer Huber, “Adaptions for the
MUlti Stimulus test with Hidden Reference and Anchor
(MUSHRA) for elder and technical unexperienced partici-
pants,” DAGA, Mar 2015.
[21] Jouni Paulus, Christian Uhle, and Jürgen Herre, “Perceived
level of late reverberation in speech and music,” Audio Engi-
neering Society Convention 130, May 2011.
[22] Brecht De Man, Kirk McNally, and Joshua D. Reiss, “Per-
ceptual evaluation and analysis of reverberation in multitrack
music production,” Journal of the Audio Engineering Soci-
ety, vol. 65, no. 1/2, pp. 108–116, Jan/Feb 2017.
[23] Ute Jekosch, “Basic concepts and terms of ‘quality’, re-
considered in the context of product-sound quality,” Acta
Acustica united with Acustica, vol. 90, no. 6, pp. 999–1006,
November/December 2004.
[24] Slawomir Zielinski, “On some biases encountered in modern
listening tests,” Spatial Audio & Sensory Evaluation Tech-
niques, Apr 2006.
[25] Francis Rumsey, “New horizons in listening test design,”
Journal of the Audio Engineering Society, vol. 52, pp. 65–
73, Jan/Feb 2004.
[26] Stanley P. Lipshitz and John Vanderkooy, “The Great De-
bate: Subjective evaluation,” Journal of the Audio Engineer-
ing Society, vol. 29, no. 7/8, pp. 482–491, Jul/Aug 1981.
[27] Brecht De Man et al., “An analysis and evaluation of audio
features for multitrack music mixtures,” 15th International
Society for Music Information Retrieval Conf. (ISMIR 2014),
Oct 2014.
DAFX-7
DAFx-442
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-443
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
that it is increased by a unit value at each rising octave, as be illustrated by the following analogy: the idea is similar to ap-
plying a stroboscope to the rotating phase of each bin, at the bin
ρ(f ) = 1 + log2 (f /f − ). (2) frequency (phase demodulation), and to select the magnitude of
the sufficiently slow rotations. This approach has some relevant
The goal was to bring together standard tools of signal pro-
interests for the visualizer application:
cessing (Fourier [10], or also constant-Q [11], wavelet analysis [12],
etc) and a natural musical representation of frequencies, in order • the targeted accuracy is consistant with musical applica-
to explore its applicative interests in a real-time context. A first tions and can be adjusted independently from the analysis
real-time tool were built, based on a Fourier analysis (see figure 1) window duration;
and tested. • it is robust to noisy environnement since the most unsta-
tionary is a component, the cleaner is the output;
• in particular, for tuning tasks, a sustained long note played
by an instrumentist can be significantly enhanced compared
to fast notes (played by some neighbors before a repetition),
by using a very selective threshold (1 Hz), while the tuning
accuracy is very high.
3. SCIENTIFIC PRINCIPLE
3.1. Analyzer
The analyzer takes as input a monophonic sound, sampled at a
frequency Fs . It delivers four outputs to be used in the visualizer:
(1) a tuned frequency grid (Freq_v) and, for each frame n, (2)
the associated spectrum magnitudes (Amp_v) and (3) demodulated
phases (PhiDem_v), (4) a "phase constancy" index (PhiCstcy_v).
Its structure is described in figure 2. It is decomposed into 7 basic
blocks, labelled (A0) to (A6), see figure 2.
Blocks (A0-A1) are composed of a gain and a Short Time Fourier
Transform with a standard window (Hanning, Blakman, etc)
Figure 1: Basic representation of the spectrum on a spiral aba-
of duration T (typically, about 50 ms) and overlapped frames.
cus (figure 3.18 extracted from [13]): the signal is composed of
The initial time of frame n is tn = nτ (n ≥ 0) where
a collection of pure sines, analyzed with a Hanning window (du-
τ < T is the time step.
ration 50 ms). The line thickness is proportional to the magnitude
(dB scale on a 100dB range), the color corresponds to the phase Blocks (A2-A3) process interpolations. A frequency grid (block
(in its demodulated version to slow down the color time-variation, A2) with frequencies
without information loss, and with a circular colormap to avoid ?
jumps between 2π rd and 0 rd). Freq_v(k) = f ? . 2(mk −m )/12
DAFX-2
DAFx-444
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-3
DAFx-445
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 4: C major chord (Fender-Rhodes piano): the line thick- Figure 5: C major chord (Fender-Rhodes piano): the line thickness
ness depends on the loudness but is not corrected by the phase depends on the loudness and is corrected by the phase constancy
constancy index; the color saturation corresponds to the phase index. The grey parts of figure 4 are rejected.
constancy index. The grey parts are those to reject.
DAFX-4
DAFx-446
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 7: Sonogram in its standard fft-mode: the brightness is Figure 9: The Snail tuner view, showing the rectified analysis re-
related to the energy (Loudness scale). In this display mode, the gion with the hexagonal shape on top of it. The Music mode anal-
central parts of the thick lines are colored according the frequency ysis is "on" so that the frequency lobe has not the sharpest size,
precision refinement based on the demodulated phases. indicating a more relaxed analysis.
DAFX-5
DAFx-447
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5. CONCLUSIONS
DAFX-6
DAFx-448
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
a Analysis process,
Figure 11: Overviews of the Snail processes :
b Display process,
c Communication process.
DAFX-7
DAFx-449
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[9] Elaine Chew, Towards a Mathematical Model for Tonal- [28] Website, “AudioUnit,”
ity, Ph.D. thesis, Massachusetts Institute of Technology, MA, https://developer.apple.com/reference/audiounit.
USA, 2000. [29] Website, “AAX,” http://apps.avid.com/aax-portal/.
[10] L. Cohen, Time-Frequency Analysis, Prentice-Hall, New [30] Website, “iOS10,” http://www.apple.com/ios/ios-10/.
York, 1995, ISBN 978-0135945322.
[31] T. Hélie and C. Picasso, “The snail: a new way to analyze
[11] J.C. Brown and M.S. Puckette, “An efficient algorithm for and visualize sounds,” in Training School on "Acoustics for
the calculation of a constant Q transform,” JASA, vol. 92, no. violin makers", COST Action FP1302, ITEMM, Le Mans,
5, pp. 2698–2701, 1992. France, 2017.
[12] S. Mallat, A Wavelet Tour of Signal Processing, Academic [32] T. Hélie, C. Picasso, and André Calvet, “The snail : un nou-
Press, 3rd edition edition, 2008. veau procédé d’analyse et de visualisation du son,” Pianistik,
[13] T. Hélie, Modélisation physique d’instruments de musique magazine d’Europiano France, vol. 104, pp. 6–16, 2016.
et de la voix: systèmes dynamiques, problèmes directs et [33] Website, “Scala Home Page,” http://www.huygens-
inverses, Habilitation à diriger des recherches, Université fokker.org/scala/.
Pierre et Marie Curie, 2013.
[14] F. Auger and P. Flandrin, “Improving the readability of
time-frequency and time-scale representations by the reas-
signment method,” IEEE Transactions on Signal Processing,
vol. 43, no. 5, pp. 1068–1089, 1995, doi:10.1109/78.382394.
[15] P. Flandrin, F. Auger, and E. Chassande-Mottin, Time-
frequency reassignment: From principles to algorithms,
chapter 5, pp. 179–203, Applications in Time-Frequency
Signal Processing. CRC Press, 2003.
[16] S. Marchand, “Improving Spectral Analysis Precision with
an Enhanced Phase Vocoder using Signal Derivatives,” in
Digital Audio Effects (DAFx), Barcelona, Spain, 1998, pp.
114–118.
[17] B. Hamilton and P. Depalle, “A Unified View of Non-
Stationary Sinusoidal Parameter Estimation Methods Using
Signal Derivatives,” in IEEE ICASSP, Kyoto, Japan, 2012,
doi: 10.1109/ICASSP.2012.6287893.
[18] B. Hamilton, P. Depalle, and S. Marchand, “Theoretical
and Practical Comparisons of the Reassignment Method and
the Derivative Method for the Estimation of the Frequency
Slope,” in IEEE WASPAA, New Paltz, New York, USA, 2009,
pp. 345–348.
[19] T. Hélie (inventor) and CENTRE NATIONAL DE LA
RECHERCHE SCIENTIFIQUE (assignee), “Procédé de
traitement de données acoustiques correspondant à un sig-
nal enregistré,” French Patent App. FR 3 022 391 A1., 2015
Dec 18,
(and Int. Patent App. WO 2015/189419 A1. 2015 Dec 17).
[20] Website, “Midi association,”
https://www.midi.org/specifications/item/table-1-summary-
of-midi-message.
[21] Technical Committee ISO/TC 43-Acoustics, “ISO 226:2003,
Acoustics - Normal equal-loudness-level contours,” 2003-
08.
[22] Website, “Upmitt,” https://www.behance.net/upmitt.
[23] Website, “JUCE,” https://www.juce.com.
[24] Website, “OpenGL,” https://www.opengl.org.
[25] Website, “OpenFrameworks,” http://openframeworks.cc.
[26] Website, “iOSaddon,” http://ofxaddons.com/categories/2-
ios.
[27] Website, “VST (Steinberg),” https://www.steinberg.net/.
DAFX-8
DAFx-450
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT
Loopers become more and more popular due to their growing fea-
tures and capabilities, not only in live performances but also as a
rehearsal tool. These effect units record a phrase and play it back
in a loop. The start and stop positions of the recording are typically
the player’s start and stop taps on a foot switch. However, if these
cues are not entered precisely in time, an annoying, audible gap (a) in-time
may occur between the repetitions of the phrase. We propose an al-
gorithm that analyzes the recorded phrase and aligns start and stop
positions in order to remove audible gaps. Efficiency, accuracy
and robustness are achieved by including the phase information
of the onset detection function’s STFT within the beat estimation
process. Moreover, the proposed algorithm satisfies the response (b) early stop cue
time required for the live application of beat alignment. We show
that robustness is achieved for phrases of sparse rhythmic content
for which there is still sufficient information to derive underlying
beats.
DAFX-1
DAFx-451
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
tatum grids
onset detection
function O(t)
Figure 2: Qualitative illustration of the basic approach: Different tatum grids get placed over the onset detection function O(t). The degree
of matching provides information about the presence of different tatums in the signal.
The next section describes the structure of the proposed al- cies, extending the tatum grids to complex exponential functions.
gorithm with its individual components. The subsequent sections This results in the tempogram, which holds a measure for the pres-
then give a detailed description of these components followed by ence of a tatum in each ODF frame. Here, instead of picking the
a performance evaluation of the algorithm including real time ca- tatum with the highest magnitude for each block, a dynamic pro-
pability, perceptual evaluation, and performance examples. gramming (DP) approach is used to find the optimum tempo path:
a utility function is maximized with a gain for high magnitudes
and a penalty for high tatum differences between two consecu-
2. OUTLINE OF THE BEAT-ALIGNMENT ALGORITHM
tive frames. This prevents paths with unlikely large variation in
The beat-aligning looper should align the loop’s start and stop cues tempo. Different from existing DP approaches for beat estima-
so that no pauses or skips occur when the loop is repeated. The tion like [11, 12], we don’t tie up beats to discrete onsets of the
basic idea is to find the tatum of a recorded signal and align the recorded signal as those onsets may be subject to the inaccuracy
start and stop cues to the corresponding beats. Tatum estimation of human motor action. For this application it is advantageous
is done by calculating an onset-detection-function (ODF) to which to retrieve a beat grid without possible fluctuations. This is done
different tatum-grids are correlated to pick the one that fits best by using phase information: the phase shift of the complex expo-
(see Fig. 2). Due to a possible variability of the tempo within the nential function’s periodicity (Fourier transformation) can be ex-
recorded phrase, the tatum estimation shouldn’t be done for the tracted by calculating the phase of the found optimum path. The
entire sequence at once, but rather in blocks. phase directly measures the location of the beats as a beat occurs
There is a variety of onset detectors. [5, 6] give a good overview every time the phase strides a multiple of 2π. This approach is
of mainly energy based onset detectors. A more sophisticated illustrated in Fig. 3. This leads to an increased precision similar to
approach based on recurrent neural networks is presented in [7]. the instantaneous frequency tracking by [13]. As a result, without
However, as the audio signal from the guitar is expected to be the necessity of a finely sampled set of tatum candidates efficiency
clean without any effects (e.g. delay, reverb or noise) applied to is increased (see Sec. 5).
it, we were able to choose a less complex onset detector. There- The following modifications were made regarding the enhance-
fore, the spectral flux log filtered approach proposed by Böck et ment of the tempogram and information extraction:
al. [8] is used (Sec. 3). It is an advancement of the spectral flux Modified tempogram The tempogram is enhanced by the infor-
algorithm by Masri [9]. The main differences are the processing in mation of how well the phase of a frame conforms with its
Mel frequency bands and the usage of the logarithm for the magni- expected value, calculated by the time difference between
tudes to resolve loudness levels appropriately. The latter combined two consecutive frames and the tatum value.
with calculating the difference between two consecutive frames Phase information In contrast to the tatum information of the op-
(see Sec. 3.4) leads to an utilization of magnitude ratios instead of timum path, the phase of the path is used. It contains all the
differences. This compensates for temporal variability (dynamic) information needed to calculate the positions of the beats.
of the audio signal as parts with high amplitudes do not get em- Especially if the set of tatum grids is chosen to be coarse for
phasized in comparison to parts with lower amplitudes. low-cost implementation, the phase can be used to calculate
Additionally, Adaptive Whitening suggested by Stowell and the actual tatum more precisely.
Plumbley [10] is used to compensate for variability in both time
Phase reliability A measure for the phase’s reliability is used to
and frequency domain. It normalizes the magnitude of each STFT
spot frames, in which the phase information becomes mean-
bin with regards to a preceding maximum value and compensates
ingless. This happens when there are no significant onsets.
spectral differences as higher frequencies often have lower energy
For unreliable frames the beats will be discarded and re-
(spectral roll-off). Without the spectral whitening, lower frequency
placed by an interpolating beat placement.
content tends to dominate over higher frequency content that con-
tains important onset information, as the surprisingly good results As a last step, the start and stop cues obtained by pressing a
of the simple HFC (high frequency content) ODF demonstrate [9]. foot switch are aligned to the nearest estimated beat. There are no
As a basis of beat estimation, the tempo estimation approach additional rules for the beat-alignment (e.g. loop length in beats
proposed by Wu et al. [11] was employed (Sec. 4). In their pro- has to be a multiple of 4) to keep enough responsibility for phras-
posal, an STFT of the ODF is calculated for a subset of frequen- ing to the player.
DAFX-2
DAFx-452
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0 0
the spectrogram |X(n, k)| are gathered by filter windows F (k, b)
with B = 50 overlapping filters with center frequencies from
94 Hz to 15 375 Hz. Each window has the same width on the Mel-
-1 -i scale. The windows are not normalized to constant energy, which
yields an emphasis on higher frequencies. The Mel spectrogram
phase
The spectral flux log filtered onset detection function consists of a 3.4. Difference
block-wise transformation into the frequency domain followed by
adaptive whitening. The signal then runs through a filter bank con- The final step to derive the onset detection function O(n) is to
sisting of 50 Mel frequency bands. Last, after taking the logarithm, calculate the difference between the frame n and its previous frame
the temporal differences are summed up to obtain the ODF. n − 1, with a subsequent summation of all spectral bins. The half-
wave rectifier function H(x) = x+|x|
2
ensures that only onsets are
3.1. Short Time Fourier Transform (STFT) considered. Altogether, we obtain the following equation:
Prior to the transformation into the frequency domain the signal B−1
X log log
x(t) is blocked into N overlapping frames with a length of K = O(n) = H(Xfilt (n, b) − Xfilt (n − 1, b)). (7)
2048 samples and the hopsize h = 480 samples, resulting in b=0
an overlap of roughly 77 %. With fs = 48 000 Hz subsequent Fig. 4 depicts the above-mentioned steps for the calculation of
frames are 10 ms apart (resulting in an ODF sampling frequency the onset detection function.
of fs,O = 100 Hz). Each frame is windowed with a Hann window
w(t). The windowed signals xn (t) can be calculated with
xn (t) = x(t + nh)w(t), n ∈ [0, N − 1], t ∈ [0, K − 1]. (1)
Each frame is transformed into the frequency domain using the
discrete Fourier transform:
K−1
X 2πkt
X(n, k) = xn (t) · e−i K , k ∈ [0, K − 1] (2)
t=0
with n denoting the frame, k the frequency bin, and t the discrete- (a) X(n, k) without whitening (b) X(n, k) with whitening
time index.
DAFX-3
DAFx-453
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2πn
L−1
X −i
M (j, τ ) = O(n + j, τ )w(n)e τ fs,O , j ∈ [0, J − 1] (8)
n=0
with j being the ODF frame index, L the ODF frame length, (c) κ = 20 (d) κ = 100
w(n) the Hann window function, fs,O denoting the ODF sampling
frequency (here 100 Hz) and τ ∈ T denoting the tatum out of a Figure 5: Effect of different κ values on the modified tempogram.
set T with different tatum values between 60 ms and 430 ms. An The range of the depicted values is between 0 and 1, with dark blue
ODF frame length of L = 150 yielded good results, meaning that representing 0, yellow representing a value of 1. a) and b) show
one tempogram value represents a 1.5 s frame and the beginning (1 − |∆Φmapped (j, τ )|)κ , c) and d) show the modified tempograms
of one frame is fs,O
1
= 10 ms apart from the beginning of its M ′ (j, τ ).
predecessor. The total number of time-steps j can be expressed as
J = N − L + 1.
To emphasize on phase continuity the phase difference be-
tween two consecutive frames dΦ(j, τ ) is calculated and com-
ˆ ). The difference of both results 4.2. Optimum Tatum Path
pared to the expected value dφ(τ
into the phase deviation matrix: The optimum tatum path can be extracted out of the modified tem-
ˆ ) pogram by maximizing the utility function U (τ , θ). This function
∆Φ(j, τ ) = dΦ(j, τ ) − dφ(τ (9)
is designed in order that high absolute tempogram values |M ′ (j, τ )|
with (tatum conformance) are advantaged and high leaps in tempo/tatum
dΦ(j, τ ) = ∠M (j, τ ) − ∠M (j − 1, τ ) and (10) will result in a penalty (second term in equation (15)). The goal is
to find a series of tatum values τ = [τ0 , ...τj , ...τJ −1 ], with τj the
ˆ ) = 2π .
dφ(τ (11) tatum value for ODF frame j, which maximizes the utility function
τ fs,O
∆Φ
∆Φmapped = (( + 1) mod 2) − 1 (12) J −1 J −1
π X X 1 1
maps the values to a range of [−1, 1], with a value of 0 indicating U (τ , θ) = |M ′ (j, τj )| − θ | − |, (15)
τj−1 τj
a perfect phase deviation of 0. The modified tempogram M ′ (j, τ ) j=0 j=1
DAFX-4
DAFx-454
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
100 1
80 0.8
60 0.6
40 0.4
20 0.2
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12
Figure 6: Example of filling gaps with low phase reliability. The black line represents the phase with gaps as the phase reliability (purple)
drops below 0.1. The estimated beats (blue) are gathered with the phase information, the interpolated beats (orange) are filled in by
interpolation. The impression of the phase keeping its value right before and after a gap is an artifact of phase-unwrapping.
Basically, after initialization the first frame j = 0, for every The offset of L2 issues from the phase information being calculated
tatum τj of the frame j the algorithm looks for that tatum τj−1 for a frame with length L (see equation (8)). So the center of
of the previous frame which yields the most rewarding transition each frame was chosen for the mapping. Nevertheless, the phase
τj−1 → τj . With memorizing τj−1,max for every D(j, τ ), the information is valid for the beginning of each frame, therefore,
optimum path can be found by backtracking τj−1,max starting with the phase itself also has to be adjusted to the frame center by the
the tatum arg max D(J − 1, τ ) of the last frame. ˆ ). So the extraction of the phase can be
expected amount L2 dφ(τ
τ
The optimum path extracted for the previous shown tempogram formulated as follows:
is depicted in Fig. 7. The path (red line) follows the leaps in tempo. L L ˆ
φ(n) = φ(j+ ) = ∠M (j, τj )+ dφ(τ j ), for j ∈ [0, J−1]. (19)
2 2
The τ (n) values are not bound to those of the tatum set T and, as
a consequence, are sampled with a higher precision.5 Averaging
these values results into the mean tatum τ̄ , which can be used for
Figure 7: Resulting optimum tatum path for a hi-hat signal with the phase extrapolation and interpolation of beat gaps (described
leaps in tempo. hereafter).
Additionally to the angle of the tempogram, the magnitude is
used as a measure of phase reliability. If the magnitude is lower
than a predefined threshold, the phase information is considered
4.3. Beat Placement and Start/Stop Alignment meaningless. Low tempogram magnitudes can occur in frames
As described above, the beat placement uses the phase φ(n) of with only few or no onsets. In that case, the phase information
the optimum tatum path τ = [τ0 , ...τj , ...τJ −1 ]. The phase can be gets discarded and the resulting gaps get filled with equally spaced
obtained from the tempogram M (j, τj ) by calculating the angle of beats corresponding to the mean tatum. An example of interpo-
the complex values. Also the modified tempogram M ′ (j, τ ) can lated beats is shown in Fig. 6.
be used here, as it holds the same phase information. To find a The last step is to align the start and stop cues to the estimated
phase value for every time step n of the ODF, the tempogram time beat positions. This is easily done by shifting each cue the closest
steps j have to be mapped to n: beat.
4 transition from negative to positive values
L
j→n=j+ . (18) 5 see the example in Section 5.2 for demonstration
2
DAFX-5
DAFx-455
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
5. PERFORMANCE EVALUATION
0.5
5.3. Greatest Common Divisor (GCD) of Existing Tatums
0.25
The algorithm tries to find that tatum which fits best to all occur-
0
0 1 2 3 4 5 6 7 8 9 10 11 ring tatums. This mechanism is demonstrated with a signal con-
(a) Onset detection function sisting of alternating blocks of quaver notes and quaver triplets. At
tempo 100 bpm, the time difference between two successive qua-
ver notes is τ1 = 300 ms and for quaver triplets τ2 = 200 ms,
respectively. As expected, these tatum values also occur in the
tempogram with a high presence measure (see Fig. 10). However,
the optimum tatum path was found for a tatum τ3 , which does not
occur explicitly, but is implied by the two tatums τ1 and τ2 by be-
ing the greatest common divisor τ3 = gcd(τ1 , τ2 ) = 100 ms. All
occurring events can be assigned to beats placed with that tatum.
(b) Tempogram
DAFX-6
DAFx-456
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Results of time profiling of two audio files with different Table 2: Results of validation of the algorithm. Values are given
durations and different number of tatums nτ . The algorithm was in ms and show the resulting gaps introduced by the algorithm.
implemented in Matlab and executed on a 2.3 GHz Intel Core i7 Marked (*) L16th levels indicate the algorithm finding the tatum
processor. ODF: onset detection function; TG: tempogram calcu- for quaver notes (300 ms), otherwise the tatum of semiquavers
lation; path: optimum path search; BE: beat estimation (150 ms) is found.
DAFX-7
DAFx-457
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6. CONCLUSIONS AND OUTLOOK [10] D. Stowell and M. Plumbley, “Adaptive whitening for im-
proved real-time audio onset detection,” in Proceedings of
We proposed a real-time capable beat-aligning guitar looper. It is the International Computer Music Conference (ICMC’07),
mainly characterized by its ability to support the temporally accu- Copenhagen, Denmark, August 27-31, 2007, pp. 312–319.
rate handling of the looper without any further prior information
[11] F. H. F. Wu, T. C. Lee, J. S. R. Jang, K. K. Chang, C. H. Lu,
and additional constrains. Start and stop cue alignment is done
and W. N. Wang, “A Two-Fold Dynamic Programming Ap-
automatically during the first repetition of the recorded phrase. In-
proach to Beat Tracking for Audio Music with Time-Varying
dependent of tempo, this adjustment guarantees that the resulting
Tempo,” in Proceedings of the 12th International Society
gap stays within rhythmic inaudibility. For traceability, this was
for Music Information Retrieval Conference, ISMIR, Miami,
evaluated within an ecologically valid setup.
Florida, USA, October 24-28, 2011, pp. 191–196.
It is obvious that the computation time of the described algo-
rithm is dominated by the optimum path search algorithm. Infor- [12] D. P. W. Ellis, “Beat Tracking by Dynamic Programming,”
mal tests showed, that a divide and conquer approach reduces the Journal of New Music Research, vol. 36, no. 1, pp. 51–60,
complexity and computation time. In this approach the optimum Mar. 2007.
path search combines the output of separately analyzed tatum oc- [13] B. Boashash, “Estimating and interpreting the instantaneous
tave sets, instead of the entire tatum set. frequency of a signal—Part I: Fundamentals,” Proceedings
As the beat-aligning looper retrieves all beats of the recorded of the IEEE, vol. 80, no. 4, pp. 520–538, Apr. 1992.
phrase, it is a promising basis for further development of an au-
tomatic accompaniment (e.g. rhythm section) for practice or live- [14] S. Hibi, “Rhythm perception in repetitive sound sequence,”
performance purposes. Journal of the Acoustical Society of Japan (E), vol. 4, no. 2,
A detailed view on the conducted listening tests and the eval- pp. 83–95, 1983.
uation of the algorithm can be found in [15]. [15] D. Rudrich, “Timing-improved Guitar Loop Pedal based
on Beat Tracking,” M.S. thesis, Institute of Electronic Mu-
sic and Acoustics, University of Music and Performing Arts
7. REFERENCES
Graz, Jan. 2017.
[1] V. Zuckerkandl, “Sound and Symbol,” Princeton University
Press, 1969.
[2] T. Baumgärtel, Schleifen: zur Geschichte und Ästhetik des
Loops, Kulturverlag Kadmos, Berlin, 2015.
[3] M. Grob, “Live Looping - Growth due to limitations,”
http://www.livelooping.org/history_concepts/theory/growth-
along-the-limitations-of-the-tools/, 2009, Last accessed on
Jan 11, 2017.
[4] J. Bilmes, Timing is of the Essence: Perceptual and Compu-
tational Techniques for Representing, Learning, and Repro-
ducing Expressive Timing in Percussive Rhythm, Ph.D. the-
sis, Massachusetts Institute of Technology, Program in Me-
dia Arts & Sciences, 1993.
[5] S. Dixon, “Onset detection revisited,” in Proceedings of
the 9th International Conference on Digital Audio Effects
(DAFx-06), Montreal, Canada, September 18-20, 2006, pp.
133–137.
[6] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies,
and M. B. Sandler, “A tutorial on onset detection in music
signals,” IEEE Transactions on Speech and Audio Process-
ing, vol. 13, no. 5, pp. 1035–1047, 2005.
[7] S. Böck, A. Arzt, F. Krebs, and M. Schedl, “Online real-time
onset detection with recurrent neural networks,” in Proceed-
ings of the 15th International Conference on Digital Audio
Effects (DAFx), York, UK, September 17-21, 2012.
[8] S. Böck, F. Krebs, and M. Schedl, “Evaluating the On-
line Capabilities of Onset Detection Methods,” in Proceed-
ings of the 13th International Society for Music Informa-
tion Retrieval Conference, ISMIR 2012, Mosteiro S.Bento
Da Vitória, Porto, Portugal, October 8-12, 2012, pp. 49–54.
[9] P. Masri, Computer modelling of sound for transformation
and synthesis of musical signals, Ph.D. thesis, University of
Bristol, 1996.
DAFX-8
DAFx-458
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT different people (relativity) [2, 3]. These aspects lead to a poor user
experience in sound search applications, when the search yields a
Usually, a sound designer achieves artistic goals by editing
lot of results, and a designer would need to listen through many
and processing the pre-recorded sound samples. To assist navi-
sounds before even starting doing the creative work [4]. To narrow
gation in the vast amount of sounds, the sound metadata is used:
down the search, an experienced professional may want to exclude
it provides small free-form textual descriptions of the sound file
specific libraries from search or use boolean expressions to include
content. One can search through the keywords or phrases in the
synonyms, but all these actions are far from the creative work.
metadata to find a group of sounds that can be suitable for a task.
One way to approach this problem could be augmenting the
Unfortunately, the relativity of the sound design terms complicate
keyword-based search with audio analysis algorithms to estimate
the search, making the search process tedious, prone to errors and
some sound parameters that are pertinent to the workflow. There
by no means supportive of the creative flow. Another way to ap-
are a number of concepts in sound design which have more or less
proach the sound search problem is to use sound analysis. In this
unambiguous meanings that many people agree on (like whooshes,
paper we present a simple method for analyzing the temporal evo-
risers, hits or pads to name a few), and we can model these sounds
lution of the “whoosh” sound, based on the per-band piecewise
without doing a substantial amount of knowledge elicitation work.
linear function approximation of the sound envelope signal. The
These models can be used in the search context to filter out the
method uses spectral centroid and fuzzy membership functions to
sounds which do not fit into the model parameters a user is inter-
estimate a degree to which the sound energy moves upwards or
ested in.
downwards in the frequency domain along the audio file. We eval-
Of course, a comprehensive method to address the problem
uated the method on a generated dataset, consisting of white noise
would include many of such models, providing a sound designer
recordings processed with different variations of modulated band-
with a diverse set of content-based search filters. But in this pa-
pass filters. The method was able to correctly identify the centroid
per we will only consider a model of a whoosh sound, as it has
movement directions in 77% sounds from a synthetic dataset.
a relatively straightforward definition and is widely used in the
production. The AudioSet ontology [5] defines whoosh as “a sibi-
1. INTRODUCTION lant sound caused by a rigid object sweeping through the air.” In
sound design this term is usually interpreted in a more generic
There are two different approaches to sound design according to way, for example as “movements of air sounds that are often used
how a sound is generated and used [1]. A sample-oriented ap- as transitions, or to create the aural illusion of movement” [6].
proach is based on the processing of the recorded sounds, which This movement is an essential quality of a whoosh sound and
are then arranged according to a moving picture content. A proce- can be expressed as volume or spectral movements. The defini-
dural approach implies building up signal synthesis and processing tive characteristics of this movement are relatively slow attack and
chains and making them reactive to happenings on the screen. But decay, a prominent climax point, and often a noticeable spectral
the complexity of implementing DSP algorithms for artistic tasks change from the higher frequencies down to the lower ones or the
made this approach rarely used in the industry, hence the sample- opposite way around. Sound of this class are widely used, e.g.
oriented method is used most in sound design. In this case the in game menu navigation, trailer scene transitions, for sounds of
design process starts with searching for sounds in libraries. space ships passing by, and so forth.
To facilitate search and decrease the time spent searching, li- In this paper we describe a whoosh model and a set of audio
brary manufacturers provide textual annotations for sounds that analysis algorithms to fit an arbitrary sound file into this model.
could be written in filenames, in PDF files, or encoded in spread- The model is used to estimate the spectral movement descriptor
sheet for integration with sound metadata management software. quantifying the degree of the whoosh transition up or down along
In either case, sound search is basically a keyword-search in a cor- the spectrum. The model is based on splitting the sound into fre-
pus of text. This format makes it almost impossible to annotate quency bands, and approximating the volume envelopes in each
many aspects of sounds, as the annotations will grow too big to band as piecewise linear function. This representation is similar to
be manageable by a human. So, the creators of sound libraries the multi-stage envelopes used in sound synthesizers1 , so we can
provide annotations that are short and concise. The quality and easily extract attack and decay values from it. We test our method
richness of metadata varies from company to company, but mostly on a synthetic dataset consisting of different variations of white
all provide keywords for common sound categories, representing noise processed with a modulated band-pass filter.
purpose, materials, actions and other high-level sound-related con- The paper is organized as follows. Section 2 provides an over-
cepts. view of the related work. Section 3 describes the whoosh model
The keywords may have synonyms, change meaning depend-
ing on a context (polysemy), or even be interpreted differently by 1 Also known as breakpoint function.
DAFX-1
DAFx-459
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
and its centroid metric. In Section 4 we describe procedures to fit was discussed in the Introduction, the main property of a whoosh
an arbitrary sound to the model. Section 5 describes the evaluation sound is the movement which can be temporal, spectral or both at
procedures, and Section 6 reports the evaluation results. Lastly, in the same time. This movement is characterized by the prominent
Section 7, we discuss the results, and then conclude the paper in peak somewhere in the middle of the sound: the intensity rises to
Section 8 by summarizing contributions and possible applications. this peak from silence and decays down after passing it. A spec-
tral change can also happen together with the intensity, often with
2. RELATED WORK prominent transitions from higher to lower part of the spectrum or
other way around. We build our model with the proposition that
A good introduction to the sound retrieval problem can be found two movements together define the whoosh sound.
in [7]. This work explored what words subjects used to describe We continue with the following assumptions:
sound, and grouped verbal descriptions into three major classes:
1. The sound intensity movement of the whoosh sound can be
sound itself (onomatopoeia, acoustic cues), sounding situation (what,
described using either an attack-decay (AD) or an attack-
how, where, etc.) and sound impression (adjectives, figurative
hold-decay (AHD) envelopes. This comes directly from the
words, etc.). These classes provide different conceptual tools to
proposition we mentioned just above.
organize sound.
The sound impression class deals with the interpretation of 2. The model will be used in the search context, where a user
how words connected to sound [8, 9, 10]. This class consists of already knows what sounds are whooshes, and only wants
descriptions that are usually used in informal communication, and to filter out some of them according to the settings. Thus, a
they often do not have established interpretations. Thus, building a recognition between whoosh and non-whoosh sounds is not
recognition system for them would require the elicitation of strict needed.
definitions (like the authors of [11] and [12] did).
The sounding situation class contains descriptions of setting The purpose of the model is to describe the temporal envelope
in which a sound production occurred. They may include a sound and the spectral movement of a whoosh sound. The model consists
source (a bird, a car), a sounding process (twittering, rumbling), of a simplified multi-band representation of the sound intensity en-
a place (a forest, a garage), a time of production (morning), etc. velope and a movement metric, which quantifies a degree to which
Essentially, commercial sound libraries mostly annotated with the the frequency content moved up or down in the spectrum.
concepts of this class, but adding all word variations and synonyms The attack-decay envelope is an ideal representation of a whoosh
to enrich the metadata would sacrifice readability and search sim- movement, but in practice, it is often hard to define one single
plicity. Moreover, treating linguistic relativity may require a more peak, especially in the long sounds. In this case an attack-hold-
deep understanding of what is actually recorded in a sound. For decay envelope could be a better fit. The AHD envelope can be
example, a long hit sound in the user interface library could be modeled as a piecewise linear function defined somewhere at a
much shorter than a short hit from the trailer library. One could sound with length L:
address this issue by adding a trailer keyword to the annotation,
but this is just the one example of many possible whoosh usages,
a1 x + b1 , if x ∈ [xi0 , xi1 )
and covering up them all may just be impossible to achieve with
a x + b , if x ∈ [xi , xi )
limited human resources. To some extent these problems can be Ei (x) =
2 2 1 2
, x ∈ [0, L] (1)
addressed with natural language processing or knowledge elicita- a3 x + b3 , if x ∈ [xi2 , xi3 ]
tion [3, 13, 14], but treating the relativity problem with these meth- 0, otherwise
ods would require a substantial amount of knowledge engineering
to categorize and organize sound concepts. There are ontologies where x is an arbitrary time point in a sound file, xi0 and xi3 are
that provide some structured information about audio-related con- the beginning and the end of the envelope, and xi1 and xi2 are the
cepts [5, 15, 16, 17] and with some adaptations they can assist in breaking points. The AD envelope model is a special case of AHD,
the sound search. For example, in [12] authors describe the im- where xi1 = xi2 . A sound can contain several non-overlapping
plementation of the timbre attribute prediction software that uses sound envelopes, and i denotes an index of the envelope in the
concepts from the ontology provided in [15]. sound.
The sound itself class describes acoustic features of the sound For each sound we define two fuzzy membership functions for
“as one hears it.” It includes sound representation models (mu- the “beginning” and the “end” linguistic variables:
sic theory, spectromorphology, onomatopoeia, etc.). For example,
in [18] authors explore the relevance of sound vocalizations com- x
mbeg (x) = ∈ [0, 1] , (2)
pared to the actual sound. Interestingly, in this paper authors con- L
nected vocalizations with the morphological profiles of the sound, mend (x) = 1 − mbeg (x) ∈ [0, 1] , (3)
based on Schaeffer’s spectromorphology concepts. There is a sub-
stantial work done on automatic sound morphological profile recog- which results can be interpreted as “to what extent the time point x
nition [19, 20, 21], but real sound designers rarely have to deal with is located in the beginning or in the end of a sound.” We use these
the morphological concepts. functions to weigh an arbitrary envelope value:
DAFX-2
DAFx-460
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
We are using them to calculate a cumulative peak value for the noise and irregularities in sound. To overcome this, we discard all
linguistic variable l: points between the first and the last climax-points. The two peak
X i i points left will be used as xi1 and xi2 , thus creating the AHD model.
pl = (Vl (x1 ) + Vli (xi2 )), (5) If there is only one peak point found after performing this opera-
i tion, then we can create the AD model, where (xi1 = xi2 ). The AD
which value can be interpreted as “how much of sound energy lies model can then be simplified by removing the middle piece from
in the beginning or in the end of a sound.” the piecewise model in the Eq. 1.
We use the piecewise functions (Eq. 1) to compress envelope Having all breakpoints of the piecewise model found, deter-
signals in the each band of a sound, passed though a crossover. mining the a and b coefficients for each linear function is as trivial,
With a set of models in each frequency band, we can estimate the as finding the equation of the line given two points.
centroid of the peak values for each linguistic variable, adapting
the standard spectral centroid formula [22]: 5. EVALUATION
P
bpb We use the following method parameters in the evaluation:
Cl = Pb bl , (6)
b pl • The sample rate is set to 44100 Hz.
where b is a band index, and pbl is a cumulative peak value of the • The RMS window size for the envelope signal estimation is
band b for the linguistic variable l. set to 20 ms.
And finally, the centroid descriptor is a difference between
• For the separating sounds in the sound file we set the non-
centroids of two linguistic variables:
silence split threshold Tss to 0.001 (-60dB), and non-silence
merge threshold Tsm to 500 ms. This means, that all non-
Cd = Cend − Cbeg (7)
silence regions with value higher than 0.001 will be merged
Cd estimates how the centroid changes over time, with neg- together if the silence gap between them is shorter than half
ative values suggesting a decrease, and positive values — an in- a second.
crease. • The crossover is implemented using a bank of 4-order But-
terworth filters. It can have a different number of frequency
4. FITTING A SOUND TO THE MODEL bands, but the split frequencies should to be spread evenly
on the equivalent rectangular bandwidth scale (ERB) [23].
To construct the model we need to find values for all non-input This provides a perceptual weighting for comparing signal
variables in the Eq. 1. Our strategy here is to first find the bound- energies between different frequency bands.
aries of the envelope (points xi0 and xi3 ), and then to find breaking • To find the envelope bounds in frequency bands, the Tss
points xi1 and xi2 . Knowing envelope values at these points, we value is set to 75 ms.
can calculate the linear function coefficients a and b in each piece
by solving a trivial linear equation system. This approach is sim- • To find the high energy regions in envelopes, the peak split
ilar to the one used in [19], where the maximum sound envelope threshold (Tps ) is set to 80% of the maximum envelope
values are being connected to the both, beginning and end of the signal value between the bounds. The peak merge thresh-
sound, thus forming a two-segment AD model. In our method we old Tpm is set to 5 ms. I.e. the high energy regions with
construct the AHD model which can often be more realistic ap- the envelope value more than 80% of the maximum will be
proximation. The rest of this section will explain, how exactly all merged together into one if the gap between them is shorter
these steps has been performed. than 5 ms. The individual peaks will be identified as the
In sound libraries a single audio recording can contain several maximum value in the each region.
variations of the same sound separated by silence. This is a com- First, we demonstrate the method for a simple signal by feed-
mon practice, e.g. for the sounds of weapon shooting or footsteps, ing in a 1-second sine wave with frequency sweeping from 50 to
etc. So, we start with looking for the long silence segments and 10000 Hz into it (Fig. 1).
use them as separators to split recordings into multiple sounds. We also test the method on a synthetic dataset generated with
The separated sounds are then split into frequency bands with Csound [24]. It consists of 1-second sound files with the variations
a crossover filter bank, and the RMS envelope signals are esti- of the white noise processed by a modulated band-pass filter. The
mated for each band. Based on these signals we split each band modulation linearly changes the center frequency and the band-
into regions separated by silence. A simple threshold processing width of the filter from cf1 to cf2 , and from bw1 to bw2 respec-
is used to find the initial set of regions, that may get merged if they tively. We choose the set of center frequencies and bandwidths for
are close to each other. modulation as follows:
The piecewise models are then built for each resulting region.
The beginning and the end points (xi0 and xi3 ) are defined by re- • Create an list of tuples (fl , fh ) with all possible pairwise
gions’ bounds, thus we need to find and connect climax points combinations of the set {0, 100 . . . 20000}, so that fl < fh .
to create the linear representation. We do this by essentially the fl and fh are the bottom and the top frequencies of the pass-
same algorithm as we used in the silence-split operation above, band respectively.
but with a higher threshold value relative to the maximum enve- • Combine the resulted tuples into a list of ((fl1 , fh1 ), (fl2 , fh2 )).
lope value in the region. The new operation will yield a number of Each element of the list represents a pair of initial and target
“high-energy” regions, and their maximum values are used as cli- modulation parameters. The parameters are transformed
max points for the multi-segment model. However, in many cases, into the center frequency and bandwidth pairs as follows:
the resulted model will contain a large number of segments due to bwi = fhi − fli , cfi = fli + 0.5bwi , where i ∈ {1, 2}.
DAFX-3
DAFx-461
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Number of occurences
103
This procedure leaves us with 365330 variations of the initial
and the target filter frequency-bandwidth pairs. We pass these val-
102
ues into Csound [24] to generate the sound files3 . Our method is
evaluated using different number of frequency bands. The goal of
this experiment is to find, in which situations our method fails to 101
correctly predict the spectral direction of whoosh sounds.
100
0 1 2 3 4 5
6. RESULTS abs(Cd)
(a)
Fig. 1 shows how the method works on a trivial example: a sine
wave with frequency rising up linearly from 1000 Hz to 10000
Hz. We see how the algorithm fits the linear model to the envelope
signals in each band. The red vertical lines depict peak positions.
500 Correct prediction
Values at the peaks are weighted for each linguistic variable as
follows (Eq. 4): Incorrect prediction
Number of occurences
1 1
400
Vbeg ≈ 0.002 Vend ≈0
2 2
Vbeg ≈ 0.354 Vend ≈ 00 300
3 3
Vbeg ≈ 0.314 Vend ≈ 0.04
4 4
Vbeg ≈ 0.007 Vend ≈ 0.347 200
Which yields the following centroids after applying Eq. 6:
Cbeg ≈ 1.445 Cend ≈ 2.893 100
The “end” centroid is higher, suggesting the sound’s frequency
content is moving up in the spectrum. 0
0.2 0.4 0.6 0.8 1.0 1.2 1.4
abs(Cd) sample mean (n=30)
(b)
0.00
0.0 1.0
Time (seconds) In the second experiment we test the method on a set of syn-
thetically prepared sounds, as described in the previous section.
Figure 1: A sine sweep analyzed with the described method. The From the 365330 variations of filter frequency sweeps of the white
frequency goes up linearly from 1000 to 10000 Hz. The figure noise signal, the centroid movement direction of 282815 sounds
presents four graphs with signal envelopes and linear functions for (approx. 77%) were identified correctly.
each frequency band (the lowest band is at the top row). Split fre- Fig. 2-a shows the distribution of the absolute centroid de-
quencies are 422, 1487 and 4596 Hz. Y-axis in each graph is the scriptor for both correct and incorrect predictions. We see that on
envelope amplitude. All graphs have identical scales, so only the average, the prediction errors happen more often when the cen-
bottom one’s scale is annotated. Black solid lines are the enve- troid movement is low (Fig. 2-b). There is a slight positive rela-
lope signals. Yellow lines are the piecewise models (Eq. 1) fitted tionship between the absolute centroid descriptor and the predic-
to the envelopes. Red horizontal lines depict the envelope’s peak tion correctness (r ≈ 0.319). Approximately the same correlation
value position. Dotted and dashed gray lines are fuzzy member- exists between the difference of the start and the end center fre-
ship functions for the “beginning” and the “end” linguistic vari- quencies of the band-bass filter modulation in generated sounds
ables respectively. The signal in the first band is weak compared (r ≈ 0.317).
to the other bands, but it passed the threshold defined by Tss . We test how the algorithm performs when evaluated with dif-
2 We decided not to include samples with such prominent spectral tran-
ferent number of frequency bands. The obvious outcome of us-
ing less bands is lesser frequency domain precision and inability
sition, as it may be easy for the algorithm to identify the whoosh direction
in such cases. to recognize whoosh direction when spectral variations of a sound
3 The second-order Butterworth filter was used, see the butterbp opcode, are completely covered by a single band. So, the ideal method may
URL: http://csound.github.io/docs/manual/butterbp. fail to recognize the direction of some whooshes when evaluated
html with a few bands, but it will eventually succeed after increasing the
DAFX-4
DAFx-462
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.0
0.8
0.6
Precision
0.4
0.2
0.2
0.0
2 4 6 8 10 12 14 16 18 20 0.0
Number of frequency bands 0.0 1.0
Time (seconds)
(a) “Unstable” group
(a) 6 bands, incorrect pred., Cbeg ≈ 4.46, Cend ≈ 4.37
1.0
0.8
0.6
Precision
0.4
0.2
0.0
2 4 6 8 10 12 14 16 18 20
Number of frequency bands
(b) “Stable” group
0.0
0.0 1.0
number of bands. But test results are not that simple. We observed Time (seconds)
three group of sounds:
(b) 8 bands, correct pred., Cbeg ≈ 6.06, Cend ≈ 6.57
1. Sounds where the estimated whoosh direction is the same
for all numbers of frequency bands. Figure 4: An example of noisy peak detection. The graph
2. Sounds where the direction estimation is incorrect for a low axes are the same as in the Fig. 1. The noise sound (cf1 =
number of bands, but it improves after increasing their num- 9292, bw1 9697, cf2 = 11413, bw2 = 7071) has been analyzed
ber and does not regress anymore. with the method two times: with 6 and 8 bands. Note the peak
position in the bottom bands: although the envelope signal is es-
3. Sounds, where the direction estimation varies for different
sentially the same, the estimated peak position vary significantly
number of frequency bands without any apparent pattern
due to noise in the signal. This lead to the incorrect direction pre-
(e.g. it estimates correctly for 4 and 6 bands, then regresses
diction in the case (a).
at 10 bands, then improves for 18 bands, etc.).
For convenience we call first two groups of sounds stable and
the third one unstable. The unstable subset contains 9506 sounds First, there is an evident correlation between absolute Cd and
(including both, correct and incorrect predictions), which is 26% the method precision. This descriptor is a difference between cen-
of the test data. Fig. 3 compares the bandwise method precision in troid values weighted for the two linguistic variables representing
the two subsets. We see that the precisions are approximately 0.58 a temporal placement of an event. This suggests, that the method
and 0.81 for the unstable and stable respectively. The precision may have worse performance for the sounds with subtle spectral
grow as the number of frequency bands increase, and stabilize at centroid movements. Currently, the implementation lacks notion
approximately 8 bands in both groups. of the “still” category, which contains sounds that does not perceiv-
ably go “up” nor “down”, and many incorrect predictions might
7. DISCUSSION go into this class. But the implementation of this class requires
to conduct a qualitative research determining, what is a perceptu-
In this section we will briefly discuss the method performance ally significant whoosh movement. This is an important question
analysis results from the previous section. for the content-based sound search, but it is out of scope of this
DAFX-5
DAFx-463
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-6
DAFx-464
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[14] Eugene Cherny, Johan Lilius, Johannes Brusila, Dmitry [28] Brian McFee, Matt McVicar, Oriol Nieto, Stefan Balke, Carl
Mouromtsev, and Gleb Rogozinsky, “An approach for struc- Thome, Dawen Liang, Eric Battenberg, Josh Moore, Rachel
turing sound sample libraries using ontology,” in Proc. Inter- Bittner, Ryuichi Yamamoto, Dan Ellis, Fabian-Robert Stoter,
national Conference on Knowledge Engineering and Seman- Douglas Repetto, Simon Waloschek, CJ Carr, Seth Kranzler,
tic Web (KESW 2016), Prague, Czech Republic, Sept. 21-23, Keunwoo Choi, Petr Viktorin, Joao Felipe Santos, Adrian
2016, pp. 202–214. Holovaty, Waldir Pimenta, and Hojin Lee, “librosa 0.5.0,”
[15] Andy Pearce, Tim Brookes, and Russell Mason, “Hierarchi- Feburary 2017, Available at https://doi.org/10.
cal ontology of timbral semantic descriptors,” Audio Com- 5281/zenodo.293021.
mons project deliverable D5.1, Surrey, UK, 2016, available
at http://www.audiocommons.org/materials.
[16] Tohomiro Nakatani and Hiroshi G. Okuno, “Sound ontol-
ogy for computational auditory scene analysis,” in Proceed-
ing for the Fifteenth national Conference on Artificial Intelli-
gence (AAAI-98), Madison, WI, USA, July 26-30, 1998, pp.
1004–1010.
[17] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,
Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann,
Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al.,
“Dbpedia–a large-scale, multilingual knowledge base ex-
tracted from wikipedia,” Semantic Web, vol. 6, no. 2, pp.
167–195, 2015.
[18] Guillaume Lemaitre, Olivier Houix, Frédéric Voisin, Nicolas
Misdariis, and Patrick Susini, “Vocal imitations of non-vocal
sounds,” PLoS ONE, vol. 11, no. 12, pp. 1–28, Dec., 2016.
[19] Geoffroy G. Peeters and Emmanuel Deruty, “Automatic mor-
phological description of sounds,” The Journal of the Acous-
tical Society of America, vol. 123, no. 5, pp. 3801–3801,
2008.
[20] Geoffroy Peeters and Emmanuel Deruty, “Sound indexing
using morphological description,” IEEE Transactions on Au-
dio, Speech, and Language Processing, vol. 18, no. 3, pp.
675–687, 2010.
[21] Lasse Thoresen and Andreas Hedman, “Spectromorphologi-
cal analysis of sound objects: an adaptation of Pierre Schaef-
fer’s typomorphology,” Organised Sound, vol. 12, no. 2, pp.
129, 2007.
[22] Wikipedia, “Spectral centroid,” accessed June 21,
2017, Available at https://en.wikipedia.org/
wiki/Spectral_centroid.
[23] Julius O Smith and Jonathan S Abel, “Bark and ERB bi-
linear transforms,” IEEE Transactions on speech and Audio
Processing, vol. 7, no. 6, pp. 697–708, 1999.
[24] Victor Lazzarini, Steven Yi, John ffitch, Joachim Heintz,
Øyvind Brandtsegg, and Iain McCurdy, Csound - A Sound
and Music Computing System, Springer, 2016.
[25] Eamonn Keogh, Selina Chu, David Hart, and Michael Paz-
zani, “An online algorithm for segmenting time series,”
in Proc. IEEE International Conference on Data Mining
(ICDM 2001), 2001, pp. 289–296.
[26] Vito MR Muggeo, “Segmented: an R package to fit regres-
sion models with broken-line relationships,” R news, vol. 8,
no. 1, pp. 20–25, 2008.
[27] Frederic Font, Gerard Roma, and Xavier Serra, “Freesound
technical demo,” in Proceedings of the 21st ACM Inter-
national Conference on Multimedia, New York, NY, USA,
2013, MM ’13, pp. 411–412, ACM.
DAFX-7
DAFx-465
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT as the location of fingerprint on bow, the finger shifting data, pitch
The same music composition can be performed in different ways, values, and velocity of finger, that may help amateur violinists im-
and the differences in performance aspects can strongly change prove their performing skills.
the expression and character of the music. Experienced musicians The analysis of musician style is also used in sound synthe-
tend to have their own performance style, which reflects their per- sis to improve the quality of synthetic music pieces. For exam-
sonality, attitudes and beliefs. In this paper, we present a data- ple, Mantaras et al. [8] studied compositional, improvisational,
driven analysis of the performance style of two master violinists, and performance AI systems to characterize the expressiveness of
Jascha Heifetz and David Fyodorovich Oistrakh to find out their human-generated music. Kirke et al. [9] studied factors including
differences. Specifically, from 26 gramophone recordings of each testing status, expressive representation, polyphonic ability, and
of these two violinists, we compute features characterizing perfor- performance creativity, to highlight the possible future directions
mance aspects including articulation, energy, and vibrato, and then for media synthesis. Li et al. [10] analyzed how expressive musi-
compare their style in terms of the accents and legato groups of the cal terms of violin are realized in audio performances and proposed
music. Based on our findings, we propose algorithms to synthesize a number of features to imitate human-generated music.
violin audio solo recordings of these two masters from scores, for The style differences among violinists have been studied for
music compositions that we either have or have not observed in a long time [11, 12, 13, 14]. For instance, Jung [15] presented
the analysis stage. To our best knowledge, this study represents an analysis of the playing style of three violinists, Jascha Heifetz,
the first attempt that computationally analyzes and synthesizes the David Fyodorovich Oistrakh and Joseph Szigeti. He argued that
playing style of master violinists. the precision in the performance of Heifetz gives listeners the feel-
ing that Heifetz is “cold” and “unemotional.” In contrast, Oistrakh
was described as “warm” and “capable of communicating emo-
1. INTRODUCTION
tional feelings.” To some extent, the style of Heifetz and Oistrakh
can be considered as two extremes of the spectrum, making the
Music performance is composed of two essential parts: the compo-
comparison between them interesting.
sitions of the composers (scores) and the actual interpretation and
In this paper, we present a step forward into expressive per-
performance of the performers (sounds) [1]. A piece of music can
formance analysis and synthesis of violin music by learning from
be interpreted differently depending on the style of the performers,
the music of these two masters, Heifetz and Oistrakh. To this end,
by means such as extending the duration of accents or manipulat-
we first compile a new dataset consisting of some manual annota-
ing the energy of individual notes in a note sequence, etc.
tions of 26 excerpts from famous violin concertos that these two
Performance aspects that characterize the style of a violinist
masters have both played (cf. Section 2). As perceptually Heifetz
may include interpretational factors such as articulation and en-
and Oistrakh are quite different in terms of the velocity and accent
ergy, and playing techniques such as shifting, vibrato, pizzicato
of their music, we choose to focus on the analysis of the articula-
and bowing. For example, for the articulation aspect of music,
tion, energy, and vibrato aspects of the music in this dataset and
the time duration of individual note (DR) and the overlap in time
propose a few features (cf. Section 3.1). In particular, we find that
between successive notes (or key overlap time, KOT) in the audio
DR and KOT are indeed useful in characterizing the differences in
performance have been found important in music analysis and syn-
the accents of their music (cf. Section 3.2). Then, we combine the
thesis [2]. These two features characterize the speed of the music
vibrato synthesis procedure proposed by Yang et al. [16] and the
and the link between successive notes in a note sequence. Maestre
proposed features to obtain synthetic sounds that imitate the style
et al. [3] used them to develop a system for generating an expres-
of these two violinists (cf. Section 4). We hope that this endeavor
sive audio from a real saxophone signal, whereas Bresin et al. [4]
can make us closer to the dream of reproducing the work of the
used them to analyze the articulation strategies of pianists.
two masters via expressive synthesis.
Research has also been done concerning the techniques of vi-
olin playing. For instance, Matsutani [5] analyzed the relationship
between performed tones and the force of bow holding. Ho et al. 2. DATASET
[6] proposed that we can achieve good intonation while performing
vibrato by maintaining temporal coincidence between the intensity To compare the style of Heifetz and Oistrakh, we choose the music
peak and the target frequency. Baader et al. [7] presented a quan- pieces that both of them have played before. In particular, we focus
titative analysis of the coordination between bowing and fingering on the work recorded in the prime of their lives, for they are con-
in violin playing. These studies all collected scientific data, such sidered to be representative of their style. In addition, we pick only
DAFX-1
DAFx-466
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
to be the one with the next longest duration. For instance, a dotted
eighth note will be considered as a quarter note in our analysis.
DAFX-2
DAFx-467
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
where Of f setn and Onsetn are the offset and onset times of this
note. The KOT, on the other hand, represents the overlap between
two successive notes. Higher KOT indicates faster transition be-
tween notes. We calculate KOT for the n-th note by:
If there is no overlap in time between the n-th note and the next
note, we can calculate the key detached time (KDT): Figure 3: The spectrograms of Brahms’ Violin Concerto, Mov. I,
bars 129-132 performed by (a) Heifetz and (b) Oistrakh, with red
KDTn = Onsetn+1 − Of f setn . (3) solid vertical lines and green dashed lines indicating the onset and
offset times of accents (red) and their preceding notes, which are
In such cases, we consider KOT as the negative KDT. typically non-accents (green), respectively. We can see that the
The excerpt-level features of DR and KOT for an excerpt are KOTs (blue) of (a) Heifetz are larger than those of (b) Oistrakh.
composed of the mean value of DR and KOT for accents and non-
accents for each type of notes that presents in the excerpt. For
example, the excerpt shown in Figure 3 only contains eighth and Then, the energy normalization is applied again for the ECF0 (t)
sixteenth notes. Therefore, the articulation parameters of this ex- of each note, to keep the maximal peak value equal to one. Lastly,
cerpt are computed by averaging the note-level DR and KOT val- the mean energy contour for a note type is computed by averaging
ues of the eighth note accents, eighth note non-accents, sixteenth the energy contours of all the notes belonging to that note type.
note accents, and sixteenth note non-accents, respectively. Moreover, the mean energy of an individual note En is calcu-
Please note that we do not normalize DR and KOT by the lated by summing the amplitudes of the spectrogram across all the
tempo of each excerpt, because we consider the tempo also as an frequency bins, expressed in dB scale, divided by the note length:
indicator of the expression style of a performer. P
20 log( k M (t, k))
En = . (5)
3.1.2. Energy note length
Energy is an essential characteristic to distinguish playing style. Finally, an excerpt-level feature, accent energy ratio (AER), rep-
Performers might follow the dynamic notations of music score in resenting the energy ratio of accents to non-accents within an ex-
their interpretation. However, oftentimes they would also choose cerpt, is calculated by:
to emphasize a particular note, phrase, or chord by playing it louder, P
according to their own opinion, to convey different expressions. In ( p EAi ) / p
AER = Pq i=1 , (6)
this paper, the note-level energy features are the mean energy of ( j=1 EN Aj ) / q
each individual note and the mean energy contour for each of the
six types of notes. Besides, an excerpt-level feature, accent energy where EA is the mean energy of an accent, EN A is the mean en-
ratio, is also considered. ergy of a non-accent, p is the number of accents, and q is the num-
Before calculating the energy features, we have to carry out the ber of non-accents. For calculating ECF0 (t) and En , we use a
energy normalization due to the different recording environments frame size of 1,024 samples and a hop size of 64 samples.
and equipment of the recordings. The normalization is performed
by dividing the energy of an excerpt by its maximum value, mak- 3.1.3. Vibrato
ing the range of the values [0, 1]. After that, we use the short-time
Fourier transform with Hanning window to calculate the energy Vibrato, the frequency modulation of fundamental frequency, is
contour of each note, based on its corresponding fundamental fre- another essential element in violin performance. We consider the
quency F0 , expressing in dB scale. To be specific, given the spec- two common note-level features, vibrato rate (VR) and vibrato
trogram M (t, k) of a note for each time frame t and frequency bin extent (VE). VR represents the rate of vibrato, i.e. the number
k, the energy contour of the note with respect to F0 is calculated of periods in one second. VE represents the frequency deviation
by: between a peak and its nearby valley. We adopt the vibrato extrac-
ECF0 (t) = 20 log(M (t, F0 )) . (4) tion process presented by Li et al. [10]. To determine whether a
DAFX-3
DAFx-468
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Figure 4: (a) KOT and (b) DR values of each individual note of the
same excerpt used in Figure 3. The circles in (a) indicate the notes
right before the accents, while in (b) the circles mark the accents.
3.2. Difference between Heifetz and Oistrakh sponding spectrogram, we can see that the DR of the accent takes
a large proportion of the legato group, compared with those of the
The musical notations, accent and legato (slur), are considered in non-accents. In addition, from Figure 4(b), where we use circles
the comparison of articulation and energy features between Heifetz to represent the DRs of accents, we have similar finding. This in-
and Oistrakh. Specifically, we focus on the comparison of the dicates that Oistrakh tends to prolong the duration of accents in
KOTs of the notes right before accents, the DRs of accents, and comparison to Heifetz. As for the non-accents, both Heifetz and
the average energy curve of legato groups. Two excerpts with ob- Oistrakh use similar speed to perform them.
vious characteristics are taken as running examples below.
3.2.3. Average energy curve of legato groups
3.2.1. KOTs of the notes right before accents
In addition to the energy contour of an individual note, the excerpt-
We observe that the KOTs of the notes right before the accents, level energy curve is also an important factor in playing style anal-
which are typically non-accents, are important factors to distin- ysis. Let’s take Sibelius’ Violin Concerto, Mov. III, bars 133-135
guish Heifetz from Oistrakh. Let’s take Brahms’ Violin Concerto, as an example this time. As shown in Figure 5(a), the average
Mov. I, bars 124-127 as an example. As shown in Figure 3, we energy of each individual note constitutes quite different energy
mark three accents and the notes right before them by red solid curves for the performances of Heifetz and Oistrakh. We mark
lines and green dashed lines, respectively. According to equation the maximal energy differences in green according to the maxi-
(2), we mark the values of KOTs in blue horizontal lines. It can be mal peak value and the minimal valley value within each of these
seen that the KOTs of Oistrakh are smaller than those of Heifetz. two excerpt-level energy curves. We can see that Heifetz demon-
This observation is actually in line with our subjective listening strates an obvious contrast between the energy of the first two bars,
experience: Oistrakh often halts for a moment before playing the whereas Oistrakh uses nearly the same energy for the two bars.
accents, while Heifetz, on the contrary, plays without a stop. Fig- To analyze the energy variation during a set of notes, we use
ure 4(a) shows the KOT of each note and the circles represent the the legato notations on the score and partition all the notes of this
KOTs of the notes before accents. Apparently, not only such KOT excerpt into six legato groups and one non-legato group, which
values but also the mean KOT of Heifetz are higher than those of includes the first note and the last three notes. Figure 5(b) shows
Oistrakh, implying that Heifetz performs in faster tempo. the mean energy of the six legato groups, from which we also see
the gradual tendency of energy increases from the performance of
3.2.2. DRs of accents Heifetz, which corresponds to the crescendo notation.
It is hard to say anything from the resulting legato-level en-
From Figure 3, we can also observe for accents, the performance ergy curves, for they seem to differ legato by legato. To simplify
by Oistrakh has larger DRs than that by Heifetz. Besides, accord- our comparison between the style of the two masters, we take the
ing to the legato notation of bar 130 on the score and its corre- average of the energy curves for all the legato groups and show
DAFX-4
DAFx-469
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-5
DAFx-470
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-6
DAFx-471
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-472
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
8. REFERENCES [16] Chih-Hong Yang, Pei-Ching Li, Alvin WY Su, Li Su, and Yi-
Hsuan Yang, “Automatic violin synthesis using expressive
[1] Sandrine Vieillard, Mathieu Roy, and Isabelle Peretz, “Ex- musical term features,” in Proc. of the 19th International
pressiveness in musical emotions,” Psychological research, Conference on Digital Audio Effects (DAFx), 2016.
vol. 76, no. 5, pp. 641–653, 2012.
[17] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and
[2] Roberto Bresin, “Articulation rules for automatic music per- Ryuichi Oka, “RWC music database: Database of copyright-
formance,” in Proc. of the International Computer Music cleared musical pieces and instrument sounds for research
Conference 2001 (ICMC), 2001. purposes,” Transactions of Information Processing Society
[3] Esteban Maestre, Rafael Ramírez, Stefan Kersten, and of Japan(IPSJ), vol. 45, no. 3, pp. 728–738, 2004.
Xavier Serra, “Expressive concatenative synthesis by reusing
samples from real performance recordings,” Computer Mu-
sic Journal, vol. 33, no. 4, pp. 23–42, 2009.
[4] Roberto Bresin and Giovanni Umberto Battel, “Articulation
strategies in expressive piano performance analysis of legato,
staccato, and repeated notes in performances of the andante
movement of Mozart’s sonata in G Major (K 545),” Journal
of New Music Research, vol. 29, no. 3, pp. 211–224, 2000.
[5] Akihiro Matsutani, “Study of relationship between per-
formed tones and force of bow holding using silent violin,”
Japanese journal of applied physics, vol. 42, no. 6R, pp.
3711, 2003.
[6] Tracy Kwei-Liang Ho, Huann-shyang Lin, Ching-Kong
Chen, and Jih-Long Tsai, “Development of a computer-
based visualised quantitative learning system for playing vio-
lin vibrato,” British Journal of Educational Technology, vol.
46, no. 1, pp. 71–81, 2015.
[7] Andreas P Baader, Oleg Kazennikov, and Mario Wiesendan-
ger, “Coordination of bowing and fingering in violin play-
ing,” Cognitive brain research, vol. 23, no. 2, pp. 436–443,
2005.
[8] Ramon Lopez De Mantaras and Josep Lluis Arcos, “AI and
music: From composition to expressive performance,” AI
magazine, vol. 23, no. 3, pp. 43, 2002.
[9] Alexis Kirke and Eduardo Reck Miranda, “A survey of
computer systems for expressive music performance,” ACM
Computing Surveys (CSUR), vol. 42, no. 1, pp. 3, 2009.
[10] Pei-Ching Li, Li Su, Yi-Hsuan Yang, and Alvin WY Su,
“Analysis of expressive musical terms in violin using score-
informed and expression-based audio features.,” in Proc.
of the 16th International Society for Music Information Re-
trieval Conference (ISMIR), 2015, pp. 809–815.
[11] Rafael Ramirez, Esteban Maestre, Alfonso Perez, and Xavier
Serra, “Automatic performer identification in celtic violin
audio recordings,” Journal of New Music Research, vol. 40,
no. 2, pp. 165–174, 2011.
[12] Eitan Ornoy, “Recording analysis of JS Bach’s G Minor Ada-
gio for solo violin (excerpt): A case study,” Journal of Music
and Meaning, vol. 6, pp. 2–47, 2008.
[13] Heejung Lee, Violin portamento: An analysis of its use
by master violinists in selected nineteenth-century concerti,
VDM Publishing, 2009.
[14] Fadi Joseph Bejjani, Lawrence Ferrara, and Lazaros Pavlidis,
“A comparative electromyographic and acoustic analysis of
violin vibrato in healthy professional violinists,” 1990.
[15] Jae Won Noella Jung, Jascha Heifetz, David Oistrakh,
Joseph Szigeti: Their contributions to the violin repertoire
of the twentieth century, Ph.D. thesis, Florida State Univer-
sity, 2007.
DAFX-8
DAFx-473
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-474
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
warm or bright. This shows the distribution of points from each Term C Term C Term C
class are separable along PC 2. In the other PCs, these descriptors warm 1.5 box 0.7 boom 0.4
occupy very similar ranges, suggesting that the distinction between deep 1.4 clear 0.5 tin 0.4
warm and bright is heavily loaded onto the second PC. To identify bright 1.1 thin 0.5 crunch 0.3
the salient features associated with each dimension, we correlate full 0.9 mud 0.4 harsh 0.2
each feature vector with the first two PCs. The audio features with air 0.8 fuzz 0.4 smooth 0.1
correlations which satisfy |r| > .7 and p < .001 are show below. cream 0.8 rasp 0.4
−40 −20 0 20 40 60 Table 1: The confidence scores (C) for terms in the feature differ-
ence timbre space.
4
crunch
mud
40
PC 1: Irregularity (r = 0.985), Irregularityp (r = 0.964),
smooth Kurtosisp (r = 0.935), Skews (r = 0.929),
2
warm
cream boom Irregularityh (r = 0.927), Kurtosish (r = 0.919),
20
fuzz
deep
full box
Skewp (r = 0.890), Std (r = 0.873), RMS (r = 0.873),
Skewh (r = 0.865), Kurtosiss (r = 0.835), Var (r = 0.812).
PC 2
tinrasp
0
KI 0
PC 2: Centroidp (r = −0.855), Centroidh (r = −0.853),
bright Rolloffs (r = −0.852), Stdh (r = −0.845),
−20
p
Where, the subscript s denotes features extracted from a magni-
−4
2.2. Synonyms
PC 2
bright subjective evaluation. Terms with less than 4 entries are omitted
warm for readability and the distances between data points are calculated
using Ward criterion [16]. The clusters are illustrated in Figure 3,
where the prefix E: represents transforms taken from the equaliser
−5 0 5 10 15 and D: represents transforms taken from the distortion. The posi-
tion, µd,k of a term, d, in the kth dimension of the audio feature
PC 1 space is calculated as the mean value of feature k across all Nd
transforms labelled with that descriptor, given in Eq. 1.
Figure 2: Transforms labelled with warm and bright in the feature
difference timbre space. 1 X
Nd
µd,k = x̄d,n,k (1)
Nd n=1
DAFX-2
DAFx-475
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
E:mud
E:boom PN
an µl − µ
D:smooth Pn=c+1 = , µl ≤ µ < µu or µu < µ ≤ µ l
c
D:crunch a
n=1 n µ − µu
E:full (2)
E:deep Where µl and µu are the spectral centroid of the lower and
E:warm
D:warm upper bands, and c is the highest frequency spectral component in
D:cream the lower band.
D:fuzz Alternatively, Brookes et al. [13] employ a spectral tilt to mod-
D:bright ify spectral centroid, applying gain to the partials of a signal as a
D:harsh
E:box linear function of their frequency. This allows the spectral cen-
D:rasp troid to be altered in frequency, whilst still retaining the frequency
E:clear content of the signal. A disadvantage of this method is that the
E:bright change in centroid cannot be easily parameterised as it depends on
E:tin
E:thin the content of the signal being processed.
E:air
E:harsh 3.1. Proposed Model
We propose a more flexible method for directly manipulating the
30 25 20 15 10 5 0 spectral centroid of the input signal using a nonlinear device (NLD).
The effects of an NLD are more easily predicted for sinusoidal in-
puts, generating a series of harmonics of the input frequency. To
Figure 3: Clustering of descriptors from the both the distortion ensure a sinusoidal input to the NLD, the system is restricted to
and equaliser. processing only tonal signals with a single pitch, which can be
represented by their fundamental frequency (f0 ). A low-pass filter
The agglomerative clustering process demonstrates that the is applied to isolate the f0 which is then processed with an NLD
data points tend to cluster based on relative spectral energy. For generating a series of harmonics relevant to the signal. This is
example, terms that typically describe signals with increased high then high-pass filtered, leaving a band that consists solely of gener-
frequency energy such as bright, thin, tin and air all have low ated harmonics. A second band is generated by low-pass filtering
cophenetic distances to each other. Similarly, terms typically asso- the signal at the spectral centroid and the relative levels are then
ciated with increased low frequency energy such as boom, warm, adjusted in order to manipulate the frequency of the centroid µs .
deep and full fall within the same cluster. In this case, the clus- Separating the bands at the spectral centroid in this way ensures
ter containing warm and the cluster containing bright are clearly that their respective spectral centroids sit either side of the input’s
separated, with a minimum cophenetic distance of 22.6. centroid after processing has been applied. In this instance, we
generate harmonics in the high frequency band by applying Full
Wave Rectification (FWR) to the isolated fundamental, this en-
3. PARAMETERISATION OF SPECTRAL CENTROID sures the system is positive homogeneous. A schematic overview
of the system is presented in Figure 4.
Given the correlation between the warmth / brightness dimension
(PC 2) and spectral centroid, we investigate methods for manipu- f0 Tracker
lating the feature directly, thus being able to increase brightness by
increasing the centroid, or increase warmth by lowering the spec-
tral centroid. A primitive method for moving the centroid towards LPF FWR HPF mH
a given frequency (µ) is to increase the spectral energy at that fre-
quency, either by introducing a sinusoidal component to the signal
or to use a resonant filter, centered around µ. Whilst this works x[n] µ Tracker + y[n]
conceptually, it is destructive to the original structure of the spec-
trum. As the centroid is moved towards the desired frequency the
LPF mL
spectrum is dominated by a sinusoid or resonance at µ.
Less destructive methods include that used by Zacharakis et
al [12], where the spectrum is split into two bands, one above and
one below the spectral centroid. The relative amplitudes of these Figure 4: The system employed in the warmth / harshness effect.
bands can then be altered to manipulate the frequency of the spec-
tral centroid. This more accurately preserves the original signal’s This is conceptually similar to the two band method proposed
structure, as no additional spectral components are introduced and by [12], but allows the second band to contain frequency content
the relative levels of partials within each band remain the same. which was not in the original signal. This method has advantages
Using this method the new spectral centroid will lie somewhere over manipulating the amplitudes of two bands using only linear
between the respective centroids of the two bands. The relative filters. For example, the nonlinear device can still be used to re-
gains of the two bands required to reproduce a given spectral cen- construct the high frequency portion of the signal and the relative
troid µ, can be calculated using Equation 2. To facilitate precise gains adjusted similarly to if two filters were used. Alternatively,
control of the spectral centroid these bands should not share any the properties of the NLD can be altered to change the upper band’s
frequency components. centroid. This provides more flexibility allowing the centroid to
DAFX-3
DAFx-476
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
be changed independently of some other features. For example, (F), two electric guitars (G1 and G2), a marimba (M), an oboe
changing the gains of two bands will change the spectral slope of (O), a saxophone (S), a trumpet (T) and a violin (V). The sig-
the signal. If instead additional partials are introduced to the upper nals were adjusted to have equal loudness prior to experimenta-
band, with amplitudes which are determined by the signal’s current tion. Firstly the effects are evaluated objectively by comparing
slope, the centroid can be changed, whilst the slope is unaltered. them to the analysis performed in section 2. Secondly the effects
The effect is controlled using a single parameter p, which are evaluated subjectively through a series of perceptual listening
ranges from 0 to 1 and is used to calculate the respective gains tests.
mL and mH , applied to the low and high frequency bands using
Equation 3. 4.1. Objective Evaluation
The performance of the effect is evaluated objectively by examin-
mH = p3 ing how it manipulates the features of the test signals. The effect
mL = 1 − mH (3) is used to process each of the signals with its parameter p set to
0 (warm), 0.5 (bright) and 1 (harsh). The audio features of the
When p = 0 the output is a low pass filtered version of the in- unprocessed and processed signals in each of these applications
put signal resulting in a lower spectral centroid than the input. This are calculated in the same manner as in the SAFE plug-ins. These
corresponds to transforms described as warm in the SAFE dataset. audio features are then compared to those taken from the SAFE
When p = 0.5 additional harmonic energy is introduced into the dataset.
signal, meaning the transform should be perceived as bright. To Each combination of descriptor and test signal is measured to
achieve this, the exponent in pn was set experimentally, so that find the distance between changes in the feature space caused by
the Mahalanobis distance between the bright cluster and the trans- the effect, and points labelled with the descriptor in the feature
form’s feature differences is minimised. When p = 1 the out- difference timbre space. The performance of the effect, with a
put signal consists primarily of high order harmonics resulting particular parameter setting on a particular test signal is measured
in an extreme increase in spectral centroid. This is perceived as by projecting the extracted audio features to a point in the timbre
Harshness, which in the SAFE dataset is defined as an increase in space. The Mahalanobis distance, M (x, d), between this point x,
spectral energy at high frequencies, as shown in Figure 5. and the distribution of transforms labelled with the relevant term
d, is taken using Equation 4
q
M (x, d) = (x − µd )T Σ−1 d (x − µd ) (4)
4
bright harsh warm d, is represented by two distributions of transforms, one from the
distortion and one from the equaliser, the Mahalanobis distance
0 5 10 15 20 from both distributions is taken and the minimum distance is con-
sidered the measure of performance.
Bark Band
Figure 5: Mean Bark spectra of transforms labelled warm, bright 4.2. Subjective Evaluation
and harsh in the SAFE dataset. Here, Bark spectra are used to
To assess the performance of the effect subjectively, a series of
represent the spectral envelopes of the transforms as these are the
perceptual listening tests were undertaken. For the purposes of
only spectral representations collected in-full by the SAFE plug-
testing, the warmth / brightness effect was implemented a DAW
ins due to data protection.
plug-in. Participants were presented with a DAW session contain-
ing a track for each of the test signals. The plug-in was used on
When transforms are applied using a distortion, bright and each track with a simple interface labelled “Plug-In 1”, shown in
harsh are considered to be very similar, with a low cophenetic dis- Figure 6. To mitigate influence of the plug-in’s interface layout on
tance of 10.6 (see Figure 3), however they exhibit some subtle dif- the result of the experiment, the direction of the parameter slider
ferences in transforms taken from the equaliser, with a cophenetic was randomised for each participant. The order of the tracks in the
distance of 16.5. This is demonstrated in Figure 1, where harsh DAW session was also randomised to mitigate any effect the order
sits below bright in PC 2. of tracks may have on results.
For each signal, participants were asked to first audition the
4. MODEL VALIDATION effect to become accustomed to how changing the parameter value
affects that particular input signal. Once they had investigated the
The performance of the effect is evaluated using a set of ten test operation of the effect they were asked to label the parameter slider
signals, comprising two electric bass guitars (B1 and B2), a flute at three positions (p is equal to 0, 0.5 and 1) with a term they
DAFX-4
DAFx-477
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
10
warm bright harsh
8
Mahalanobis Distance
6
4
2
Figure 6: The interface used for assessing the performance of the
warmth / brightness effect.
0
B1 B2 F G1 G2 M O S T V
felt best described the timbre of the effect at that parameter set- Test Signal
ting. A list of available terms was provided in a drop down list
at each of the 3 parameter values to be labelled, pictured in Fig-
ure 6. The available terms were airy, boomy, boxy, bright, clear, Figure 7: Mahalanobis distances for the warmth / brightness ef-
creamy, crunchy, deep, full, fuzzy, harsh, muddy, raspy, smooth, fect.
thin, tinny and warm. These were chosen for their confidence
scores and number of entries in the SAFE dataset. For each com-
bination of signal and parameter positions, there is an intended de- to be more statistically significant.
scriptor (those the effects were designed to elicit) and a descriptor
provided by the participant. 5.2. Listening test Results
We compare the participant responses against the hierarchical
clustering performed in Section 2.2. The dendrogram shown in The mean cophenetic distances between the participants’ annota-
Figure 3 provides information about how similar the transforms tions of the effect’s parameters and the descriptors warm, bright
described by certain terms are. This information can be used as a and harsh taken from the SAFE dataset are shown in Figure 8.
metric describing the proximity of the users’ responses to the in- Here the error bars represent the 95% confidence intervals for each
tended response. The proximity of two descriptors is measured as mean. To show the performance of each instrument sample, mark-
the cophenetic distance between the clusters in which the two de- ers on the y-axis indicate the cophenetic distances that correspond
scriptors lie. Where a descriptor appears twice in the dendrogram to the cluster heights for the groups containing bright from the dis-
(from both the distortion and equaliser) the combination of points tortion, bright from the EQ, and warm from both plug-ins, as per
which yield the lowest cophenetic distance is used. All listening Figure 3.
tests were undertaken using circumaural headphones in a quiet lis-
tening environment. In total 22 participants took part in the listen-
40
ing tests, all of whom reported no known hearing problems. On warm bright harsh
average participants took 25 minutes to complete the test. warmth D:bright E:bright
Cophenetic Distance
30
5. RESULTS / DISCUSSION
DAFX-5
DAFx-478
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.01 0.01 0.01 0.02 0.04 0.05 0.03 0.05 0.02 0.03 0.25 0.10 0.18 0.05 0.03 0.09 0.04 bright
0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.24 0.02 0.08 0.10 0.10 0.17 0.24 harsh
0.06 0.09 0.04 0.04 0.06 0.05 0.20 0.10 0.12 0.13 0.01 0.02 0.00 0.05 0.01 0.00 0.00 warm
deep
smooth
boom
full
air
cream
warm
mud
box
clear
fuzz
bright
rasp
thin
tin
crunch
harsh
Figure 9: A matrix showing subjective mis-classifications from the warmth / brightness effect, organised by the frequency of the mis-
classifications across the effect’s intended terms.
jective accuracies, however two of the instrument samples (oboe then present a model for the manipulation of timbral features using
and trumpet) have mean distances which are larger than the cluster a combination of filters and nonlinear excitation, and show that the
heights, suggesting there is more ambiguity in their definition. model is able to manipulate the respective warmth and brightness
Figure 9, shows a mis-classification matrix comparing the us- of an audio signal. We verify this by performing objective and
age of terms by test participants and the terms that the effects were subjective evaluation on a set of test signals, and show that sub-
designed to produce. Each cell in the matrix represents the fre- jects describe the transforms with synonymous descriptors 54%,
quency of which each of the available descriptors (bottom) was 74% and 80% of the time for warmth, brightness and harshness
used to describe the corresponding timbral effect at a given pa- respectively.
rameter position. Above the figure is a dendrogram representing By using a NLD component in the feature manipulation pro-
the clustering of terms based on their frequency of usage. cess, we are able to increase the flexibility of the timbral modifier.
From the figure, it is clear that warm is often correctly as- This is because other audio features can be preserved, whilst the
signed to the intended transform, but summing the cells of the row spectral centroid is modified independently. Conversely, the al-
shows that participants only used descriptors in the same cluster as gorithm is currently limited to pitched monophonic sound sources
warm 54% of the time. This suggests that the addition of low fre- due to its reliance on tracking the f0 of the input signal. This issue
quency energy to the signal does not necessarily invoke the use of will be addressed in future work.
synonyms of warm. When describing the effect, participants used
a term related to the intended descriptor 74% of the time for bright 7. REFERENCES
and 80% of the time for harsh, suggesting that these transforms
were perceptually more similar to the transforms in the dataset. [1] J. M. Grey, “Multidimensional perceptual scaling of musical
timbres,” the Journal of the Acoustical Society of America,
vol. 61, no. 5, pp. 1270–1277, 1977.
6. CONCLUSION
[2] A. Zacharakis, K. Pastiadis, J. D. Reiss, and G. Papadelis,
We first present empirical findings from our analysis of a dataset of “Analysis of musical timbre semantics through metric and
semantically annotated audio effect transforms, and show that the non-metric data reduction techniques,” in Proceedings of the
warmth / brightness dimension can be loaded heavily onto a sin- 12th International Conference on Music Perception and Cog-
gle PC. The variance of this PC is explained predominantly by the nition (ICMPC12), 2012, pp. 1177–1182.
centroid of the peak, harmonic peak, and magnitude spectra. We [3] M. Sarkar, B. Vercoe, and Y. Yang, “Words that describe
DAFX-6
DAFx-479
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-480
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT human category are expected to exhibit high arousal with valence
determined by contextual cues from the other sound sources present.
This paper contains the results of a study making use of a set
of B-format soundscape recordings, presented in stereo UHJ for- In terms of the sound sources identified it is hypothesised
mat as part of an online listening test, in order to investigate the that birdsong and traffic will be the most commonly identified
relationship between soundscape categorisation and subjective eval- sources, and that traffic noise and human activity (e.g. conversation
uation. Test participants were presented with a set of soundscapes or footsteps), when present, will have a significant effect on the
and asked to rate them using the Self-Assessment Manikin (SAM) category rating.
and in terms of three soundscape categories: natural, human, and This paper starts with a presentation of the methods used in the
mechanical. They were also asked to identify the important sound study, including the collection of the soundscape data used in the
sources present in each soundscape. Results show significant re- test, the stereo UHJ conversion process, and the development of the
lationships between soundscape categorisation and the SAM, and test procedure including the SAM and category questions.
indicate particularly significant sound sources that can affect these The results of the test are then presented, starting with a brief
ratings. overview of the demographics of the test participants. The SAM
results are then examined, after which a summary of the category
ratings for each presented soundscape is presented alongside the
1. INTRODUCTION
sound sources identified. Correlational analysis is then used to
This paper presents the results of study investigating the relation- indicate the relationships between the different metrics: firstly com-
ship between soundscape preferences ratings made using the self- paring the SAM results with the category ratings; then comparing
assessment manikin (SAM) [1] and three category ratings (natural, category ratings with the percentages of the sound sources identified
human, and mechanical). Following previous studies the SAM has in each case that belong to each of the three categories. The paper
now been established as an appropriate and powerful tool for the concludes with a consideration of avenues for further research.
evaluation of soundscapes: for example in [2] where results from
the SAM were compared with the use of semantic differential (SD)
pairs; or in [3], where it was used to establish the ecological validity 2. METHODS
of stereo UHJ conversions of B-format soundscape recordings.
This paper presents additional results from the study presented This section begins with an examination of the soundscape data
in [3] exploring further the utility of the SAM in soundscape eval- collected for this study, including the motivation behind the choice
uation and research. Here the purpose was to investigate the rela- of locations where the recordings where made and the contents of
tionship between soundscape preference ratings, and soundscape those recordings. The conversion of the recorded B-format sound-
sound source identification and categorisation. scapes to stereo UHJ format is considered before a presentation
A soundscape can be considered as the aural equivalent of a of the assessment methods used in the listening test: the SAM,
landscape [4], and soundscape research is a field where environ- and the soundscape categorisation and sound source identification
mental recordings are used and analysed to give an insight into a questions, and the overall test procedure.
location’s acoustic characteristics beyond noise level measurement
alone. A soundscape can therefore be considered as being com-
prised of a set of individual sounds sources coming together in a 2.1. Data Collection
particular context to form an environmental experience.
These sound sources can broadly be divided between three The data used in this study were collected from various locations
categories: natural, human, and mechanical. This test was designed around Yorkshire in the United Kingdom, including: Dalby forest, a
in order to see how the presence of particular sounds belonging natural environment; Pickering, a suburban/rural environment; and
to each of these categories can affect the subjective rating of a Leeds city centre, a highly developed urban environment. All of the
soundscape in terms of emotional experience, and in terms of how soundscape recordings were made in B-format using a Soundfield
the soundscape as a whole is perceived as belonging to each of STM 450 microphone [5].
these three categories. Table 1 gives details of the sound sources present in each of
It is hypothesised firstly that the soundscapes that are rated as the 16 clips used in the listening test. Each of these clips were
being more mechanical will exhibit low valence and high arousal in 30 seconds long and extracted from the 10 minutes of soundscape
the SAM results, and that highly natural soundscapes will exhibit recording made at each location. These clips were used in their
high valence and low arousal. Soundscapes highly rated in the original B-format in a previous study [2].
DAFX-1
DAFx-481
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Details of the sound sources present in the two 30 second long clips (labelled A and B) recorded at each of the eight locations.
2.2. Stereo UHJ Conversion The use of SD pairs is a method originally developed by Os-
good to indirectly measure a person’s interpretation of the meaning
In order to present the recorded soundscape material online, the B- of certain words [9]. The method involves the use of a set of bipolar
format signals had to be converted to a suitable two-channel format. descriptor scales (for example ‘calming-annoying’ or ‘pleasant-
It was decided to make use UHJ stereo format, where the W, X, unpleasant’) allowing the user to rate a given stimulus. SD pairs are
and Y channels of a B-format recording are used to translate the a well established aspect of listening test methodology in Sound-
horizontal plane of the soundfield into two-channel UHJ format [6]. scape research [10–17]. Whilst useful in certain scenarios, they can
This is a two channel format that can be easily shared online be time-consuming and unintuitive [2]. An alternative subjective
and reproduced over headphones, allowing the B-format recordings assessment tool to use is the SAM.
presented in a previous study to be used here with the spatial content The SAM is a method for measuring emotional responses de-
of the W, X, and Y channels preserved in reproduction. veloped by Bradley and Lang in 1994 [18]. It was developed from
The following equations are used to convert from the W, X, factor analysis of a set of SD pairs rating both aural [19, 20] and
and Y channels of the B-format signal to two stereo channels: visual stimuli [1] (using, respectively, the International Affective
Digital Sounds database, or IADS, and the International Affective
S = 0.9397W + 0.1856X (1) Picture System, or IAPS). The three factors developed for rating
D = j(−0.342W + 0.5099X) + 0.6555Y (2) emotional response to a given stimuli are:
L = 0.5(S + D) (3) • Valence: How positive or negative the emotion is, ranging
R = 0.5(S − D) (4) from unpleasant feelings to pleasant feelings of happiness.
• Arousal: How excited or apathetic the emotion is, ranging
where j is a +90◦ phase shift and L and R are the left and right from sleepiness or boredom to frantic excitement.
channels respectively of the resultant stereo UHJ signal [7]. Note
that the Cartesian reference for B-format signals is given by ISO • Dominance: The extent to which the emotion makes the
standard 2631 [8], and the Z channel of the B-format recording is subject feel they are in control of the situation, ranging from
not used. not at all in control to totally in control.
These results were then used by Bradley and Lang to create the
2.3. Assessment SAM itself as a set of pictorial representations of the three identified
factors. The version of the SAM used in this experiment (as shown
This section will cover the assessment methods used in this study, in Fig. 1) contained only the Valence and Arousal dimensions
starting with the SAM, followed by the category rating and sound following results from a previous study [2].
source identification questions. This section concludes with a sum-
mary of the overall test procedure.
2.3.2. Category Rating
2.3.1. The Self-Assessment Manikin The soundscape recordings used in this test (as detailed in Table 1)
were selected in order to cover as wide a range of soundscapes
A previous study [2] made a direct comparison between semantic as possible. In order to determine what such a set of soundscape
differential (SD) pairs and the Self-Assessment Manikin (SAM) as recordings would contain, a review of soundscape research indi-
methods for measuring a test participant’s experience of a sound- cated that in a significant quantity of the literature [17, 21–25] three
scape. main groups of sounds are identified:
DAFX-2
DAFx-482
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2.4. Test Procedure Fig. 4 shows a summary of the category ratings results. As with the
SAM results it is clear from this plot that the selected soundscapes
The listening test in this study was presented online using Qualtrics cover a wide range of ratings in each category.
[26], and the overall procedure was as follows:
1. An introductory demographics questionnaire including ques-
tions regarding age, gender, nationality, and profession. 3.3. Sound Source Identification
From the results of previous studies [2, 3] it is not expected Fig. 5 shows a pie chart of all of the sound sources identified by
that any of these demographic factors will have an effect on test participants. In total there were 1369 sound source instances
the results. identified which contained 24 unique sound sources (as listed in
2. An introduction to the questions accompanying each sound- Fig. 5). The overall breakdown of these instances between the three
scape, covering the SAM, the categorisation question, and categories is as follows: 38.9% natural, 26.9% human, and 34.2%
the sound source identification entry field. Two example mechanical.
DAFX-3
DAFx-483
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
2
Valence
1.5 Arousal
0.5
Value
-0.5
-1
-1.5
-2
1A 1B 2A 2B 3A 3B 4A 4B 5A 5B 6A 6B 7A 7B 8A 8B
Soundscape Clip
1.5 100
90
1
80
0.5 70
Value
0 60
%
50
-0.5
40
-1 30
Nat 20 Natural
-1.5
Hum Human
Mec 10 Mechanical
-2
1A 1B 2A 2B 3A 3B 4A 4B 5A 5B 6A 6B 7A 7B 8A 8B 0
1A 1B 2A 2B 3A 3B 4A 4B 5A 5B 6A 6B 7A 7B 8A 8B
Soundscape Clip
Soundscape Clip
In order to make a comparison between the sound sources identified Fig. 7 shows scatter plots of the mean category ratings against
and the categorisation of each clip, it was decided that the sound the mean valence and arousal ratings. The result of a Pearson’s
sources identified for each clip should be grouped by category and correlation analysis [27] of the data presented in these plots can be
then the number of sound sources instances in each of these cate- seen in Table 2.
gories should be expressed as a percentage of the total number of
sound source instances in each case. Fig. 6 shows the sound source Table 2: Correlation coefficient and p-values comparing the valence
category percentage break down for each soundscape recordings. results with category ratings.
DAFX-4
DAFx-484
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1.5
1
Valence Score
0.5
-0.5
-1
-1.5
-2
1.5
0.5
Arousal Score
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Natural Score Human Score Mechanical Score
Figure 7: Scatter plots showing the mean category and SAM ratings for each clip. The ellipses surrounding each data point indicate the
standard error associated with each mean value. Trend lines have been included to reflect the correlation results shown in Table 2.
relationship where p ≤ 0.05 it indicates a statistically significant soundscapes containing very few or no human sounds (i.e. clips
correlation [27]. 1A-2B, as seen in Fig. 6).
The natural and mechanical category ratings show a very strong The results shown in Table 2 support the hypotheses laid out
set of correlations with the SAM results, where the Natural category in Section 1 detailing the expected relationships between the three
is shown to have a highly significant positive correlation with the category ratings and the SAM.
valence dimension, and a highly significant negative one with the
arousal dimension. The mechanical category exhibits the opposite 3.4.2. Category ratings and categorised sounds source per-
relationship with each dimension. From this it can be said that centages
the locations rated as highly natural are typically rated as pleasant
and relaxing, and locations that are rated as highly mechanical are Table 3 shows the correlation results indicating the relationships be-
unpleasant and invoke feelings of stress and anxiety. These are tween the category ratings and the sound source percentage results.
the emotions indicated by these SAM results when considering the The following section includes a discussion of the results and the
circumplex model of affect [28]. other results presented in this section.
The human category rating results also show a (just-significant)
negative correlation with valence, and a positive correlation with 4. DISCUSSION
arousal. This is perhaps due to the fact that most of the soundscapes
that include human sounds also contain mechanical sounds, partic- This section contains a discussion of the correlation results pre-
ularly traffic noise. Examining the top middle plot in Fig. 7 gives sented in Table 3, considering the following three areas of the
an indication of why the relationship between the human category correlation results in turn:
ratings and valence is only just significant. Whilst the lowest hu- • Comparison of categorisation results.
man category ratings corresponding to high valence, as the human
category rating increases there is a ‘dip’ where the valence values • Comparison of the categorised sound sources.
for human ratings around 1 are in fact lower than those for higher • Comparison of categorisation results with the categorised
human ratings. This is due to a disproportionate number of the sounds sources.
DAFX-5
DAFx-485
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 3: Correlation results for the category ratings and sound source percentage results. Indicated here are the R values. The numbers
in boldface indicate correlation results where p ≤ 0.05 (a significant result); and the presence of an asterisk indicates p ≤ 0.01 (a highly
significant result).
Following this discussion, the key sound sources and soundscapes presence of these sound sources in the soundscapes has impacted
that explain these results are identified and examined in further on the SAM and category ratings of those soundscapes.
detail.
There is significant negative correlation between natural and hu- Table 1 indicates that the soundscape clips recorded in Dalby Forest
man, and natural and mechanical ratings, indicating that where contain many instances of birdsong. As can be seen in Fig. 4,
soundscapes were rated as more natural they were also rated as less these soundscapes (clips 1A-2B) were rated as most belonging to
human and less mechanical. Perhaps surprisingly there is not a sig- the natural category, and, as shown in Fig. 3, received the highest
nificant relationship between the human and mechanical category valence and lowest arousal ratings of all the soundscape clips. This
ratings. result is in accord with the correlation results presented in Table 2
that indicate the same relationship between higher natural category
4.2. Sound source percentage results ratings and the SAM results.
It is also interesting to note the difference in category rating
The correlation results comparing the sound source category per- between clips 7B and 8A. These clips were recorded at two nearby
centage results in Table 3 show that the only significant relationship locations in Leeds: clip 7B was recorded next to a busy road, and
is the negative correlation between the percentage of sound sources clip 8A was recorded in the middle of a small park. These two
identified that are in the natural category, and those belonging to the clips are fairly similar, but the presence of birdsong in the latter is
human category.This is indicated by the sound source categorisation partly responsible for a significant difference in the natural category
results for clips 6A and 6B as shown in Fig. 6. rating, which contributes to the relative increase in valence rating
For both of these clips the overwhelming majority of sound (and decrease in arousal rating) as seen in Fig. 3.
sources identified (> 70%) were in the human category, with the
remaining sound sources belonging to the mechanical category and
no identified sounds in the natural category. Likewise the results
4.4.2. Traffic
for clips 1A-2B contain very high percentages of natural sound
sources, some mechanical sources, and no human ones.
The correlation between the mechanical category ratings and the
SAM results (negative for valence, positive for arousal), as indicated
4.3. Categories and percentages by Table 2, is shown clearly in the difference between SAM results
and category ratings for clips 2A and 2B. These two clips were
There is a significant correlation between the category ratings and
recorded in the same location next to a lake in a forest: Clip 2A
percentage of identified sound sources for each of the three cate-
contains the recording a single car driving on a nearby road where
gories. This gives some indication that the percentage breakdown of
clip 2B does not. As shown in Fig. 4 the presence of this car is
the identified sound sources into each category is a valuable metric
responsible for big increase in the mechanical category ratings, and
that in some way reflects the overall categorisation of a soundscape.
a big decrease in the natural category rating, for clip 2A relative to
The natural category rating shows significant negative corre-
clip 2B.
lation with the human and mechanical sound source percentage
metric, as does the natural sound sources percentage metric with Fig. 3 shows the effect of this car on the SAM ratings for this
the human and mechanical category ratings. The two pairs of clip. It is interesting to note how big a difference the presence of
variables that show no significant correlation in this group are hu- a single car can make. This same pattern can be seen to a lesser
man category rating and mechanical percentage metric, and the extent in the results for clips 1A and 1B. These were both recorded
mechanical category rating and the human percentage metric. at different positions in the forest next to a footpath away from
any roads. In this case it is the presence of an aeroplane passing
overhead in clip 1B that is not present in clip 1A that results in a
4.4. Key Sound Sources similar pattern of differences between the two clips.
As identified in Fig. 5, the three most commonly identified sound It is also worth noting that the clips recorded at roadside loca-
sources were traffic noise, birdsong, and conversation; sound sources tion were rated as having the lowest valence levels (clips 4A-5A
belonging to the mechanical, natural. and human categories respec- and 7A-7B), and that these clips were also rated as most belonging
tively. This section will consider each one in turn and see how the to the mechanical environment category.
DAFX-6
DAFx-486
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
4.4.3. Conversation the SAM and category ratings that exhibited with correlational
relationships previously identified.
The most significant sound source identified as belonging to the Future work will include a listening test presenting the same
human category was conversation. It’s presence in clip 3B and soundscapes used in this study alongside spherical panoramic im-
absence in clip 3A, whilst not responsible for any appreciable dif- ages of the recordings sites. Test participants will again be asked
ference in SAM results, produced a big change in category ratings rate their experience of the environment using the SAM, to cate-
resulting in a much higher human category rating for clip 3B. gorise the environment, and to identify the key aural and visual
Fig. 6 also indicates the perceptual dominance of conversation features. This test should produce results that answer the following
when present in a soundscape recording - clips 6A and 6B are questions:
the only soundscapes for which no natural sound sources were
identified by test participants, despite some birdsong and wind • Will the presence of different visual features change the
noise being present in those recordings. SAM results, indicating a change in the evoked emotional
states?
5. SUMMARY • Will the key visual feature identified be the same as the aural
features?
This paper has shown the results of an online listening test pre- • Will the presence of visual features change which aural
senting stereo UHJ renderings of B-format soundscape recordings. features are identified by test participants?
Test participants were asked to describe the emotional state evoked
by those soundscapes using the SAM, rate the extent to which • Will the presence of visual elements change the category
each soundscape belonged in three environmental categories, and ratings for the presented environments?
identify the key sound sources present in each one. This results presented in this paper certainly confirm the SAM
The purpose of this test was to identify any relationships be- as being a powerful tool for soundscape analysis, with soundscape
tween the subjective assessment of a soundscape and the extent to categorisation and sound source identification offer suitable meth-
which that soundscape is perceived to be natural, human, and me- ods to pair with the SAM to offer further insight into which aural
chanical. A secondary aim was to use the sound sources identified features can most dramatically affect emotional responses to envi-
by test participants to sound sources that are the most important to ronmental sound.
soundscape perception, both in terms of subjective experience and
category rating.
Correlational analysis indicated strong relationships between 6. ACKNOWLEDGMENTS
the natural and mechanical category ratings and the valence and
arousal dimensions of the SAM. The natural category ratings were This project is supported by EPSRC doctoral training studentship:
shown to be positively correlated with valence, and negatively reference number 1509136.
correlated with arousal. The mechanical category ratings showed
the opposite correlations in each case. The human category ratings 7. REFERENCES
showed similar but less strong correlation results to the mechanical
category ratings with some ambiguity in the relationship between [1] M. Bradley, B. Cuthbert, and P. Lang, “Picture media and emo-
the human category ratings and the valence dimension of the SAM. tion: Effects of a sustained affective context,” Psychophysiol-
These results support the study’s hypotheses which predicted a ogy, vol. 33, no. 6, pp. 662–670, 1996.
negative emotional response to highly mechanical soundscapes,
and a positive emotional response to highly natural soundscapes. [2] F. Stevens, D. T. Murphy, and S. L. Smith, “Emotion and
The hypothesis regarding emotional response to soundscape rated as soundscape preference rating: using semantic differential
highly human was also suggested with a clear correlation between pairs and the self-assessment manikin,” in Sound and Music
human rating and arousal and a more ambiguous relationship with Computing conference, Hamburg, 2016, 2016.
valence. [3] ——, “Ecological validity of stereo uhj soundscape repro-
These results indicate that more natural soundscapes invoke duction,” in In Proceedings of the 142nd Audio Engineering
a relaxation response (low arousal, high valence), and that more Society (AES) Convention, 2017.
mechanical soundscapes invoke a stress response (high arousal,
low valence). The human category rating result’s relative lack of [4] R. Schafer, The Soundscape: Our Sonic Environment and
significance indicate a contextual dependence where human sounds the Tuning of the World. Inner Traditions/Bear, 1993.
are not necessarily disliked or stressful, but are mainly present in [Online]. Available: http://books.google.co.uk/books?id=
soundscapes where traffic noise (i.e. mechanical sounds) are also ltBrAwAAQBAJ
present. [5] E. Benjamin and T. Chen, “The native b-format microphone,”
An analysis of the sound sources identified by test participants in Audio Engineering Society Convention 119. Audio Engi-
has presented, which indicated a key sound source for each cate- neering Society, 2005.
gory: conversation, birdsong, and traffic. The identification of these
[6] R. Elen, “Ambisonics: The surround alternative,” in Proceed-
sources supports the hypothesis predicting which sound sources
ings of the 3rd Annual Surround Conference and Technology
would be most commonly identified, and as such would be the most
Showcase, 2001, pp. 1–4.
useful predictor of soundscape categorisation. The presence and ab-
sence of these sources in the soundscapes presented was examined [7] M. A. Gerzon, “Ambisonics in multichannel broadcasting and
by studying particular examples. This examination demonstrated video,” Journal of the Audio Engineering Society, vol. 33,
where these sound sources in particular produced differences in no. 11, pp. 859–871, 1985.
DAFX-7
DAFx-487
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[8] ISO, Mechanical Vibration and Shock: Evaluation of Human [25] G. Watts and R. Pheasant, “Tranquillity in the scottish high-
Exposure to Whole-body Vibration. Part 1, General Require- lands and dartmoor national park–the importance of sound-
ments: International Standard ISO 2631-1: 1997 (E). ISO, scapes and emotional factors,” Applied Acoustics, vol. 89, pp.
1997. 297–305, 2015.
[9] C. Osgood, “The nature and measurement of meaning.” Psy- [26] J. Snow and M. Mann, “Qualtrics survey software: handbook
chological bulletin, vol. 49, no. 3, p. 197, 1952. for research professionals,” 2013.
[10] J. Kang and M. Zhang, “Semantic differential analysis of [27] K. Pearson, “Note on regression and inheritance in the case
the soundscape in urban open public spaces,” Building and of two parents,” Proceedings of the Royal Society of London,
environment, vol. 45, no. 1, pp. 150–157, 2010. vol. 58, pp. 240–242, 1895.
[11] W. Davies, N. Bruce, and J. Murphy, “Soundscape reproduc- [28] J. A. Russell, “A circumplex model of affect,” Journal of
tion and synthesis,” Acta Acustica United with Acustica, vol. Personality and Social Psychology, vol. 39, no. 6, pp. 1161–
100, no. 2, pp. 285–292, 2014. 1178, 1980.
[12] S. Viollon and C. Lavandier, “Multidimensional assessment
of the acoustic quality of urban environments,” in Conf. pro-
ceedings “Internoise”, Nice, France, 27-30 Aug, vol. 4, 2000,
pp. 2279–2284.
[13] A. Zeitler and J. Hellbrück, “Semantic attributes of environ-
mental sounds and their correlations with psychoacoustic
magnitude,” in Proc. of the 17th International Congress on
Acoustics [CDROM], Rome, Italy, vol. 28, 2001.
[14] T. Hashimoto and S. Hatano, “Effects of factors other than
sound to the perception of sound quality,” 17th ICA Rome,
CD-ROM, 2001.
[15] B. Schulte-Fortkamp, “The quality of acoustic environments
and the meaning of soundscapes,” in Proc. of the 17th inter-
national conference on acoustics, 2001.
[16] M. Raimbault, “Qualitative judgements of urban soundscapes:
Questionning questionnaires and semantic scales,” Acta acus-
tica united with acustica, vol. 92, no. 6, pp. 929–937, 2006.
[17] S. Harriet, “Application of auralisation and soundscape
methodologies to environmental noise,” Ph.D. dissertation,
University of York, 2013.
[18] M. Bradley and P. Lang, “Measuring emotion: the self-
assessment manikin and the semantic differential,” Journal of
behavior therapy and experimental psychiatry, vol. 25, no. 1,
pp. 49–59, 1994.
[19] M. Bradley and P. J. Lang, The International affective digi-
tized sounds (IADS): stimuli, instruction manual and affective
ratings. NIMH Center for the Study of Emotion and Atten-
tion, 1999.
[20] M. Bradley and P. Lang, “Affective reactions to acoustic stim-
uli,” Psychophysiology, vol. 37, no. 02, pp. 204–215, 2000.
[21] A. Léobon, “La qualification des ambiances sonores urbaines,”
Natures-Sciences-Sociétés, vol. 3, no. 1, pp. 26–41, 1995.
[22] S. Viollon, L. C., and C. Drake, “Influence of visual
setting on sound ratings in an urban environment,” Applied
Acoustics, vol. 63, no. 5, pp. 493 – 511, 2002. [Online].
Available: http://www.sciencedirect.com/science/article/pii/
S0003682X01000536
[23] W. Yang and J. Kang, “Acoustic comfort and psychological
adaptation as a guide for soundscape design in urban open pub-
lic spaces,” in Proceedings of the 17th International Congress
on Acoustics (ICA), 2001.
[24] L. Anderson, B. Mulligan, L. Goodman, and H. Regen, “Ef-
fects of sounds on preferences for outdoor settings,” Environ-
ment and Bevior, vol. 15, no. 5, pp. 539–566, 1983.
DAFX-8
DAFx-488
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
ABSTRACT model. At the most general, Barchiesi and Reiss assume that the
effects are linear and time-invariant[4], which we cannot assume
This paper describes the development and application of a ma-
here.
chine listening system for the automated testing of implementa-
The fields of audio query by example and music recommen-
tion equivalence in music signal processing effects which contain
dation systems have a looser definition of audio similarity which
a high level of randomized time variation. We describe a mathe-
might have some applications to our problem. A comparison of
matical model of generalized randomization in audio effects and
some systems which aim to define this more general similarity is
explore different representations of the effect’s data. We then pro-
given by Pampalk et al in [7]. A particularly interesting implemen-
pose a set of classifiers to reliably determine if two implementa-
tation from Helén et al [8] uses a perceptual encoder followed by
tions of the same randomized audio effect are functionally equiva-
a compression algorithm to determine similarity by the compres-
lent. After testing these classifiers against each other and against a
sion ratio of the signals separately vs combined. Another interest-
set of human listeners we find the best implementation and deter-
ing probabilistic approach is given by Virtanen [9]. These models
mine that it agrees with the judgment of human listeners with an
may have the capacity to model the variation we expect between
F1-Score of 0.8696.
and within the sets of signals we are evaluating, however, their im-
plementations are very resource intensive and the nature of these
1. INTRODUCTION problems is that there is no ground truth by which to compare the
algorithms to each other. We suspect that we might be able to solve
When Eventide began development of their new H9000 effects our problem with a more compact model.
processor a primary task was the porting of a large signal pro- In Section 2 we give a deeper description of the problem. Sec-
cessing code base from the Motorola 56000 assembly language to tion 3 describes the theory behind our solution. Section 4 mentions
C++ targeted to several different embedded processors. While in- some practical considerations of our system. Section 5 describes
dustry standard cancellation testing was possible for deterministic the validation experiments we ran, and Section 6 gives results of
effects, most of the effects involved some amount of random time these experiments and some analysis. Section 7 gives a brief con-
variation causing test results to vary significantly from the judg- clusion, Section 8 some acknowledgments, and Section 9 includes
ment of human listeners. This created a problem when ascertain- references.
ing that the thousands of presets used in the H8000 were working
correctly on the H9000. To solve this problem we developed a ma-
chine listening system to replace the first stage of human listening 2. PROBLEM DESCRIPTION
to determine if two different implementations sound equivalent.
Comparing the similarity of two audio signals is done in many When testing whether two audio signal processing implementa-
contexts. The field of audio watermarking aims to embed a unique tions are functionally equivalent it is common to send a test signal
code in an audio stream such that the code is imperceptible to x(t) through each of them and compare the output signals y1 (t)
the human ear and is recoverable after the application of several and y2 (t) [10]. If the magnitude difference z(t) between these
different auditorily transparent transformations such as time scale two outputs is less than 2η, for some very small η, the two imple-
modification [1][2]. Several different techniques are used to quan- mentations can be considered equivalent.
tify how perceptible the added code is to the listener. These in- This can be written as
clude the Perceptual Audio Quality Measure (PAQM), the Noise
to Mask Ratio (NRM), and the Perceptual Evaluation of Audio yn (t) = fn (x(t)) (1)
Quality (PEAQ) [3]. However, while these methods aim to make
z(t) = |y1 (t) − y2 (t)|, (2)
a perceptual comparison between two recordings, the comparison
is primarily interested in determining if two signals sound exactly z(t) < 2η; ∀t ∈ T, (3)
the same. Unfortunately, these comparisons are not useful to us as
where fn (x) represents the nth distinct implementation being
we need to determine if two sets of unique signals are generated
tested.
by the same random process, even when these sets of signals vary
significantly inside each set. This test works well for musical effects that can be character-
There have also been several attempts to reverse engineer the ized by a deterministic system plus a small residual, as long as the
specific settings of audio effects [4][5][6], however these systems system memory is less than the measurement time scale T . This is
are often aimed at specific effects types and rely on a specific true because this test relies on the the implicit signal model
DAFX-1
DAFx-489
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
where f (x(t)) is the theoretically perfect implementation and several recordings of the effect. If all sources of randomization can
Nn (t) is a continuous random process on the range [−1, 1] repre- be considered purely additive, we could take several measurements
senting the implementation-specific variation from f (x(t)). The of each implementation and compare the mean and variance of
test in Equation (3) places no constraints on Nn (t) other than that each to make a determination which may prove sufficient to call
it is strictly bounded to the range [−1, 1], nor do we assume any. them functionally equivalent.
It is likely that some realizations of Nn (t) produce more audible Unfortunately, many effects implement processing whose ran-
effects than others, but in practice the cancellation tests works by domization cannot be treated as simple additive noise. For these
keeping η very small. effects even a relaxation of the cancellation test in Equation (3)
A very simple example of a deterministic effect that uses this is not appropriate because the signal model used does not have
signal model is a gain effect with a constant gain of g. This system the capacity to discriminate between equivalent implementations
can be represented as and ones that differ. We can show an example of this by adding a
random initial phase θn to our tremolo example and deriving the
f (x(t)) = gx(t) (5) equation for the cancellation test. If we treat θn as a uniformly
When implemented digitally it becomes distributed random variable between −π and π the system can be
described by
fn (x(t)) = g[x(t) + qn (t)] + tn (t) (6)
Where qn (t) represents the quantization noise and tn (t) rep- fn (x(t)) = (sin(w0 t+θn )+sn (t))(x(t)+qn (t))+tn (t) (16)
resents the truncation error of the multiply, each in the nth imple-
mentation. Both of these types of error can be modeled by additive After grouping the additive components we are not able to fac-
noise, and with a slight re-arrangment we can show that it fits the tor the theoretically perfect f (x(t)) out of the model.
signal model of Equation (4)
fn (x(t)) = sin(w0 t + θn )x(t) + ηNn (t), (17)
fn (x(t)) = gx(t) + gqn (t) + tn (t) (7)
and therefore the difference will depend on θn
fn (x(t)) = f (x(t)) + ηNn (t) (8)
|f1 (x(t)) − f2 (x(t))| = |ηN1 (t) − ηN2 (t)| (9) |f1 (x(t)) − f2 (x(t))| =
and as long as |gqn (t) + tn (t)| < η is strictly true for two |(sin(wt+θ1 )x(t)−sin(wt+θ2 )x(t))+(ηN1 (t)−ηN2 (t))|
different implementations, we can call them equivalent using the (18)
cancellation test in Equation (3).
A slightly more complicated deterministic process which still Even if we make the additive sources of error very small, if
passes the test in Equation (3) is the simple tremolo effect. To we cannot control the distribution of θn there will still be a large
create a tremolo effect we define g as a function of time, now the source of error determined by this initial condition. This might
system can be described by seem to be a contrived comparison, but consider that the two im-
plementations might be running on different hardware, or even im-
g(t) = sin(w0 t) (10) plemented in an analog circuit. In these situations we may not be
able to synchronize θn . More complicated effects often have sev-
f (x(t)) = sin(w0 t)x(t) (11)
eral processes with randomized initial conditions which we will
By adding an additional sn (t) term to represent the error in gener- store in the vector Θn and may even be the result of one or more
ating the sine wave, we can write an implementation-specific ver- random processes Ψn (t). We can write this as the more general
sion. formulation
fn (x(t)) = (sin(w0 t) + sn (t))(x(t) + qn (t)) + tn (t) (12) yn (t) = fn (x(t), Ψn (t), Θn ) (19)
After grouping the additive components we can still factor the In these instances, Ψn (t) and Θn may not always be able to
implementation specific components into a simple additive term. be controlled between various implementations, and when they are
not the same we can expect the test in Equation (3) to fail.
fn (x(t)) = sin(w0 t)x(t) + ηNn (t), (13) However, in practice we found that both in-house testers and
end users are able to reliably detect similarity or difference in these
ηNn (t) = sin(w0 t)qn (t) + sn (t)(x(t) + qn (t)) + tn (t). (14)
categories of randomized time-variant effects, even when sources
As long as the implementation errors qn (t), sn (t), and tn (t) are of randomization are not controlled. Therefore we hoped to de-
kept small enough the two implementations can be considered equiv- sign a machine listening algorithm to make these implementation
alent, as is demonstrated by equivalence determinations in much the same way that humans
do. The goal was to create an algorithm that would reliably de-
|f1 (x(t)) − f2 (x(t))| = η|N1 (t) − N2 (t)|. (15) tect functional equivalence between implementations of both the
When considering the addition of intentional randomization to easily tested deterministic, as well as the highly randomized audio
these types of effects it might be tempting to consider them to be effects without any prior knowledge about the effect under test.
additive processes using the existing signal model. For instance,
if we consider the implementation error of the sine wave sn (t) in 3. THEORY
equation (12) to instead represent an intentional random process
integral to the sound of the effect, we might consider trying to To extend the simple model in Equation (3) to systems with unde-
reduce it’s contribution to the measurement error by averaging over fined random variations we will form a probabilistic interpretation
DAFX-2
DAFx-490
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
and use several samples of each implementation fn (·) to learn a mensionality of the training data increases. In Experiment 1 we
model of its behavior. Once we learn models for each fn (·) we will try all four of these data representations to determine which
can compare these models to make a decision about the similarity gives the best classification accuracy across a number of effects.
of the implementations.
3.2. Forming a Probabilistic Representation
3.1. Representing the Data When dealing with sources of variation it is often useful to build
As is done in the classical problem description above, we chose to the model in terms of a probabilistic formulation. If we believe that
represent each fn (·) by testing the appropriate response yn (t) to M successive applications of fn (·) to x(t) will result in M distinct
the common test signal x(t). Therefore the learning problem hap- results ymn (t) and ymn then we might choose to model the ymn
pens on vectors representing the yn (t) signals rather than the fn (·) vectors as a random process from which we can draw samples. A
processes directly. Because of this, the result of the test depends commonly used generative model for high dimensional continuous
both on the test method as well as the forcing signal x(t). Specif- data is the multivariate normal distribution with the assumption of
ically, in the case of linear time-invariant systems, yn (t) must be independence between the dimensions [13]. This model has the
long enough to capture the entire expected impulse response of benefit of fitting many naturally occurring processes, while also
fn (·), and it must have sufficient bandwidth to measure the ex- having a tractable calculation of the log likelihood that a particular
pected bandwidth of fn (·). When the system is expected to be sample has come from the underlying model. This is true even
nonlinear it must have sufficient changes in amplitude to capture when used with very high dimensional feature vectors, like we
these effects. Finally, when the system is expected to be time- expect to have here.
variant, yn (t) must be long enough to capture this time-variance, To do this, we define
or there must be enough samples of yn (t) to sufficiently span the
expected range of variance. Yn ∼ N (µn , σn2 ) . (20)
So given the test signal x(t), we will now learn a model Yn 1
M
X
for each implementation fn (·) representing its response yn (t). µn = yn m (21)
M
To learn the Yn we must represent the signal yn (t) as a fixed m=1
be a feature in our probabilistic model. For these representations σn = (yn m − µn )(yn m − µn )T . (22)
M − 1 m=1
we consider several choices. We can choose to simply use the dis-
cretized samples yn [i] directly. This has the benefit of being sim- where N is the multivariate normal distribution, µn is the
ilar to the tried and true method in Equation (3) and does not lose mean of the M samples of yn , and σn is the variance vector. In
any data. Additionally, when using this representation, any purely this instance the covariance matrix reduces to a vector because in-
additive or multiplicative differences between implementations of dependent features imply that all non-diagonal elements of the co-
fn (·) will result in purely additive or multiplicative differences in variance matrix are zero.
yn . Therefore, even if we assume the features of Yn are strictly in-
dependent, Yn will still be able to capture these differences. How- 3.3. Making a Decision
ever, if fn (·) has memory, errors in the part of the system with
memory will be spread across several different features in yn and Once we have a model Yn representing the effect fn (·) for each
we will have to include some dependency structure in Yn to ac- implementation n, we must compare them to measure their sim-
curately model these differences, which can make the model more ilarity. A common method of measuring the similarity between
complicated. two probability distributions is the Kullback-Leibler divergence
We can also choose to take the magnitude FFT of yn [i], which DKL (P kQ) [14]. The Kullback-Leibler divergence from Q to P
has the added benefit of keeping most of the data when the FFT measures the information gained when one modifies the prior dis-
size is large. However, if we do so, convolutions in the fn (·) pro- tribution Q to the posterior distribution P .
cess will become simple multiplications. Therefore a model that When P and Q are distributions of a continuous random vari-
assumes independence will be more robust to small convolutive able the Kullback-Leibler divergence is defined by the probability
variations, like a small change in a delay time parameter, at the densities of P and Q, respectively p(x) and q(x), as
expense of being less robust to small multiplicative variations, like Z ∞
a small change in a gain coefficient. p(x)
DKL (P kQ) = p(x) log dx, (23)
A compromise solution might be to take the magnitude STFT −∞ q(x)
of yn and vectorize it. This would combine the benefits of both To determine the similarity of two implementation models Yn
the sample-wise and FFT representations, while maintaining the we can choose the target implementation to represent the true dis-
locality of differences in both time and frequency. This might put tribution P and choose the implementation under test to represent
a limit on the amount of dependency the model needs to assume, or the prior distribution Q. The smaller the information gained by
the amount of error we’d expect to see by assuming independence. modifying Q to match P , the more similar the distributions Q and
Finally, considering we’re looking to replace human listeners P are considered. Therefore, by finding the Kullback-Leibler di-
we might take a cue from speech recognition research [11], and vergence DKL (Ytarget kYtest ) we can gain a measure of similarity
choose to represent the audio as a vectorized set of MFCC coeffi- between the known good implementation ftarget (·) and the one
cients. While this representation does throw away data, depending being tested ftest (·).
on the length of the test signal x(t), losing this data may help us When P and Q are both multivariate normal distributions NP
with regard to the "curse of dimensionality" [12], which describes and NQ with independent components the KL divergence can be
how the modeling capacity of a given model can suffer as the di- simplified to:
DAFX-3
DAFx-491
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-4
DAFx-492
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Evaluation of feature representations and decision methods: Are the implementations equivalent?
Bold indicates agreement.
Preset Human Samples Samples FFT FFT STFT STFT MFCC MFCC
Number Listeners Vote % log KL Vote % log KL Vote % log KL Vote % log KL
3524 no 10% 16.65 10% 25.95 10% 24.32 10% 16.25
3526 no 10% 16.94 10% 26.66 10% 24.62 10% 16.35
615 no 30% 13.44 10% 26.90 10% 22.27 0% 16.78
7613 no 100% 13.71 0% 18.78 10% 18.93 0% 10.62
5012 no 100% 12.10 100% 13.65 100% 12.93 0% 9.47
4311 no 50% 12.97 0% 15.39 0% 15.10 0% 11.65
1411 no 0% 15.60 0% 28.39 0% 22.82 0% 17.61
5430 no 100% 12.63 0% 13.63 0% 23.62 0% 14.67
2211 no 0% 15.24 0% 27.47 0% 22.02 0% 17.90
221 no 100% 12.13 100% 14.31 90% 13.45 100% 9.97
225 no 100% 11.93 0% 17.89 100% 13.04 100% 9.58
4720 yes 100% 14.75 0% 13.78 0% 13.95 0% 12.72
5809 yes 80% 14.79 0% 13.41 10% 16.51 90% 9.64
224 yes 50% 11.93 40% 17.23 60% 13.99 100% 11.80
1413 yes 50% 11.24 60% 12.56 100% 11.34 80% 9.59
815 yes 100% 11.87 0% 12.52 100% 11.76 100% 8.21
1116 yes 0% 16.97 0% 19.61 0% 18.39 100% 11.63
4510 yes 100% 14.14 0% 12.64 100% 11.80 100% 8.97
12 yes 100% −4.79 100% 7.82 100% 2.26 100% 7.41
Correct % 63.2% 73.7% 57.9% 78.9% 68.4% 73.7% 84.2% 68.4%
has been trained as a musician and recording engineer and should this test an oracle threshold was chosen to evaluate this test on it’s
be considered an expert in the field of listening to musical effects. optimal possible performance. A reported log KL divergence be-
Neither listener was aware of the results from the other listener, low this threshold is scored as an equivalent implementation and
or of the results of the QuAALM listening machine at the time of a value above this threshold is scored as not equivalent. These
their determination. thresholds are reported in Table 2.
The QuAALM listening machine was then run on the same set
of recordings from each of the 19 effects and reported a Kullback-
Leibler divergence information gain, and Voting Classifier percent-
age for each feature type. The results of this test are reported in Table 2: Log KL divergence thresholds
Section six.
Samples FFT STFT MFCC
5.2. Experiment 2: Validation and Comparison against Hu- 12 14 13 10
man Listeners
Based on the results of Experiment 1, six listeners meeting the
same qualifications as Experiment 1 were asked to make inde-
pendent determinations of an additional 18 randomly chosen ef- Similarly, the Vote % represents the average of the votes for
fects. These determinations were analyzed for consistency across which the implementations were equivalent in the 10 runs of the
the human listeners, and compared to the QuAALM machine re- voting classifier. A Vote % greater than or equal to 50% is scored
sults running using the Voting Classifier test and MFCC features. as an equivalent implementation, below 50% is not. In Table 1,
This analysis and results are included in Section six. scores which agree with the judgment of the human listeners are
in boldface while those that disagree are not. Finally, a percentage
of presets for which QuAALM agrees with the human listeners is
6. RESULTS
calculated for each test and reported at the bottom of Table 1.
6.1. Experiment 1: Determination of Best Signal Representa-
tion and Decision Algorithm From these results we can see that both the KL divergence
and voting classifier methods have some success depending on the
The results of Experiment 1 are shown in Table 1. For each preset features used. However, the clear standout is the voting classifier
tested, the Vote % and log KL divergence are reported for each of method when used with MFCC features. Not only does it show the
the four feature types. The natural logarithm of the KL divergence highest percentage of correct classifications, but for the instances
is reported because the range of the KL divergence for such high where it passed, the classification margins are wider. This implies
dimensional data makes these values hard to display. After seeing that there might be a higher robustness to noise using this test. Be-
the results, a threshold for the log KL divergence was chosen to cause of these results, we chose the voting classifier method oper-
optimize the results for each feature type. In a final implementa- ating on MFCC features for our deployment and run the validation
tion, the threshold would be considered a user parameter, and for tests in Experiment 2 using this method.
DAFX-5
DAFx-493
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 3: Validation and comparison against human listeners: Are the implementations equivalent?
PresetNumber Listener1 Listener2 Listener3 Listener4 Listener5 Listener6 Consensus QuAALM Verdict
234 yes yes yes yes yes yes yes yes pass
322 yes yes yes yes yes yes yes yes pass
536 yes yes yes yes yes yes yes no f ail
1333 yes yes yes no yes yes yes yes pass
1910 yes yes yes yes yes yes yes yes pass
2317 no no no no no no no no pass
3211 yes yes yes yes yes yes yes yes pass
4034 yes yes yes yes yes no yes yes pass
4138 yes yes yes yes yes yes yes yes pass
4727 no yes no yes no no no no pass
4814 no yes no yes yes yes yes no f ail
5729 yes yes yes no yes yes yes yes pass
5911 yes yes yes yes no yes yes yes pass
6663 yes yes yes yes yes yes yes yes pass
6910 no no no no no no no no pass
7211 yes yes yes yes yes yes yes no f ail
7419 yes no no no no yes no no pass
8214 no no no no no no no no pass
Table 4: Validation scores This leads to a Precision score of 1, a Recall score of 0.625, and
an F1 score of 0.8696. These values are shown in Table 4.
Precision Recall F1 As mentioned earlier, as a practical consideration, the perfect
1.0 0.625 0.8696 Precision score and relatively high F1 score satisfied our goals and
allowed us to put QuAALM into service testing real implementa-
tions.
6.2. Experiment 2: Validation and Comparison against Hu-
man Listeners
7. CONCLUSIONS
For Experiment 2, we randomly chose 18 presets from the H8000,
making sure that no two were from the same bank, and tested them In this paper, we developed an automated listening machine which
on both the H8000 and H9000 hardware. As in Experiment 1, we meets the qualifications to serve as the first line of defense for qual-
used a digital audio connection and tested at a 48 kHz sampling ity assurance and automated testing of our effects hardware. In
rate, the test signal consisted of a 3 second long Gaussian White doing so we also developed a more robust framework for analyz-
Noise burst, 3 seconds of silence, a 3 second exponential sine ing the similarity of effects utilizing some amount of random time
chirp, and a final 3 seconds of silence. Ten independent record- variation. This framework may have applications to other listening
ings were made of both the H8000 and the H9000 for each preset. intensive work like the automatic labeling of effect type.
The 20 recordings were each listened to by the six listeners, The specific implementation described here was sufficient to
independently and without sharing their results, and a determina- serve our purposes, however, many improvements could be con-
tion was made as to whether the two implementations sounded sidered. Some potential improvements might come from exploring
equivalent. For 11 of the 18 presets, all listeners agreed on their different test signals, or relaxing the strict independence criteria in
conclusions; 8 times that the implementations were the same, and the probability model, as we expect that there is likely some depen-
3 times that they were different. On a further 4, a single listener dence between these samples. Additionally, given the limited test
was in the minority, while the final 3 presets were a vote of 4 to 2. data reported here, we expect that the QuAALM machine might
No presets came to a split decision among the 6 listeners. We be- be failing for some specific types of effects. A larger data set will
lieve that this shows the listeners have a high degree of consistency allow further investigation into these effect type and may lead us
when determining if implementation differences were perceptually to new ways to improve the system.
equivalent. These preset numbers and results are tabulated in Table Additionally, the comparisons made here were between two
3. implementations of the same digital effects. However, the under-
The final ground truth consensus was reached by a vote of the lying technique should be useful to in determining the quality of
human listeners and is also tabulated in Table 3. analog models for certain types of randomized effects like chorus,
flanger, phaser, or rotary speaker emulations. It would be interest-
For each preset, the same 20 recordings were fed into the
ing to see if this were the case.
QuAALM machine running the voting classifier method on MFCC
features, and its judgment was recorded in Table 3, along with a
comparison of its judgment and the ground truth consensus. 8. ACKNOWLEDGMENTS
We can see that the QuAALM machine disagreed with the hu-
man listeners for only 3 of the 18 presets, and in each of these Thank you to Jeff Schumacher, Pete Bischoff, Patrick Flores, and
instances it reported a false negative rather than a false positive. Nick Solem for performing listening tests.
DAFX-6
DAFx-494
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-7
DAFx-495
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFX-1
DAFx-496
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
A(k, m) N (k, m) This means, the separation gain gs (m) is only different from zero
_+
Φ(m)
after the energy ratio surpassed an attack threshold τattack and only
Energy
Ratio
Ψ(m) Gating/ gs (m) C(k, m) as long as it is above a release threshold τrelease . When Ψ(m) falls
Extraction Thresholding
+
Φ̄(m) below τrelease , the separation gain is set back to zero. For the se-
paration attack and release threshold τattack = 2.5 and τrelease = 1
Figure 1: Block diagram of applause separation into distinctive were used. The final separated signals are obtained according to
individually perceivable foreground claps C(k, m) and noise-like C(k, m) = gs (m) · A(k, m) (6)
background N (k, m).
N (k, m) = A(k, m) − C(k, m). (7)
Figure 2 depicts waveforms and spectrograms of an exemplary ap-
The applause separation proposed in this paper is a modified ver- plause signal on the left and the corresponding separated clap sig-
sion based on the approaches used in [9, 11]. Figure 1 depicts a nal in the middle, as well as the noise-like background signal on
block diagram describing the basic structure of the applause sepa- the right. Sound examples are available at https://www.audiolabs-
ration processing. Within the energy extraction stage, an instanta- erlangen.de/resources/2017-DAFx-ApplauseUpmix.
neous energy estimate Φ(m) as well as an average energy estimate
Φ̄(m) is derived from the input applause signal. The instantaneous 3. APPLAUSE UPMIX
energy is given by Φ(m) = kA(k, m)k2 , where k·k2 denotes the
L2-norm. The average energy is determined by a weighted sum of After having the input applause signal separated into individual
the instantaneous energies around the current block and given by claps and noise-like background, the signal parts are upmixed se-
parately. Upmixing the noise-like background was realized by
PM
µ=−M ΦA (m − µ) · w(µ + M ) using the original background signal as the left channel and a de-
Φ̄A (m) = PM , (2) correlated version N b (k, m) as the right channel of the upmix. De-
µ=−M w(µ + M )
correlation was achieved by scrambling the temporal order of the
where w(µ) denotes a weighting window (squared sine-window) original noise-like background signal. This method was originally
with window index µ and length Lw = 2M + 1. In the next stage, proposed in [4], where the time signal is divided into segments
the ratio of instantaneous and average energy Ψ(m) is computed. which are themselves divided into overlapping subsegments. Sub-
It serves as an indicator whether a discrete clap event is present segments are windowed and their temporal order is scrambled.
and is given by Applying overlap-add yields the decorrelated output signal. The
processing for the noise-like background signal was modified in
ΦA (m) the sense that it operates in short-time Fourier transform (STFT)-
Ψ(m) = . (3) domain and with a small segment size. A segment size of 10 blocks
Φ̄A (m)
corresponding to 13 ms was used, where each block represented a
If the current block contains a transient, the instantaneous energy subsegment.
is large compared to the average energy and the ratio is signifi- Upmixing of foreground claps makes use of inter-clap relati-
cantly greater than one. On the other hand, if there is only noise- ons. This means, the upmixing process incorporates the assump-
like background at the current block, instantaneous and average tions that claps originating from a certain direction should sound
energy are almost similar and consequently, the ratio is approx- similar, as well as that claps originating from a certain direction
imately one. The basic separation gain gs (m) can be computed should exhibit some sort of periodicity.
according to In the beginning, arbitrary discrete directions φd within the
s stereo panorama are chosen, where d = 0...D − 1 denotes the
index of a distinctive direction within the direction vector Φ =
gN
gbs (m) = max 1 − ,0 , (4) [φ0 , φ1 , ..., φD−1 ]. The total number of directions is given by
Ψ(m)
D. Furthermore, the mean clap spectrum for each detected clap
is computed, whereby a clap is considered as a set of consecutive
which is a signal adaptive gain 0 ≤ gbs (m) ≤ 1, solely depen- blocks, each of which having non-zero energy and framed by at
dent on the energy ratio Ψ(m). To prevent dropouts in the noise- least one block to each side containing zero energy. With the start
like background signal, the constant gN is introduced. It determi- and stop block index γs (c) and γe (c) of clap c, the claps’ mean
nes the amount of the input signal’s energy remaining within the spectra are given by
noise-like background signal during a clap. The constant was set
to gN = 1, which corresponds to the average energy remaining 1
γe (c)
X
within the noise-like background signal. Since for the upmixing bc (k) =
C |C(k, m)|2 . (8)
γe (c) − γs (c) + 1
we are mainly interested in the directional sound component of a m=γs (c)
clap (i.e., the attack phase), thresholding or gating is applied to The upmix process operates on a per clap basis rather than on indi-
Ψ(m) according to vidual blocks. Considering the first clap to be upmixed, the target
( direction Θ(c) is chosen randomly from the vector of available di-
0 if Ψ(m) < τattack rections Φ(d). The spectrum of the current clap is stored in the
if gs (m − 1) = 0
gbs (m) if Ψ(m) ≥ τattack matrix S(k, d) which holds the mean spectra of the last clap assig-
gs (m) = ( ned to a distinctive direction:
0 if Ψ(m) < τrelease (
gbs (m) if Ψ(m) ≥ τrelease if gs (m − 1) 6= 0.
Cbc (k) if S(k, d) = 0
S(k, d) = bc (k)), (9)
(5) 0.5(S(k, d) + C else
DAFX-2
DAFx-497
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
0.4
Amplitude
0.2
0
−0.2
−0.4
10
Frequency [kHz]
8
6
4
2
0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Time [s] Time [s] Time [s]
[dB]
Figure 2: Waveforms and spectrograms of an applause signal (left) and the respective separated clap signal (middle) and residual noise-like
background (right). In the Figure the spectrogram is only plotted in the range of 0 to 10 kHz.
Additionally, the start block number of the current clap is stored in et. al [12] found clap rates of 2 to 6 Hz with a peak at around 4 Hz.
a vector T (d) holding the respective start block number of the last Based on this experimental data and some internal test runs, we
clap assigned to a direction: chose a target clap rate of δt = 3 Hz, corresponding to a target
block difference of δm = 250 blocks, with an additional tolerance
T (d) = γs (c). (10) scheme of ±1 Hz. This means for the time difference, it has to be
within the range of δm −62 and δm +125, i.e., 188 and 375 blocks
For all further claps, it is determined how well the respective cur- to be considered as a periodic continuation of claps in a direction.
rent clap fits to the claps distributed previously in each direction. In the case that two or more claps occur simultaneously the ap-
This is done with respect to timbral similarity as well as with re- plause separator detects only one single clap which leads to a gap
spect to temporal periodicity. As a measure for timbral similarity, in the periodicity pattern of the directions the other (masked) claps
the log spectral distances of the current clap to the previously sto- would have belonged to. To compensate for this effect the search
red mean spectra of the claps attributed to a direction is computed for periodicity is extended to multiples of the expected target block
according to difference, specifically to 3 · δm . The actually used time difference
v
u " !#2 value is the one with smallest absolute difference to the considered
Ku
u
t 1 X Sd (k) multiples of the target frequency. If the current raw time difference
LSD(d) = 10 log10 , is outside of the tolerance scheme, it is biased with a penalty.
Ku − Kl + 1 bc (k)
C
k=Kl
In the next step, log spectral differences as well as time diffe-
(11) rence are normalized to have zero mean and a variance of one:
where Kl and Ku denote the lower and upper bin of the fre- LSD(d) − µLSD
quency region relevant for similarity. Similarity is mostly influen- d
LSD(d) = (13)
σLSD
ced by the spectral peaks/resonances resulting form the air cavity
b ∆(d) − µ∆
between hands as a consequence of the positioning of hands re- ∆(d) = . (14)
lative to each other while clapping; these peaks are in the region σ∆
up to 2 kHz [1, 7, 10]. Based on this observation and including Means µ and standard deviations σ are computed using the re-
some additional headroom, frequency bins corresponding to the spective raw time differences and log spectral distances correspon-
frequency region between 200 Hz and 4 kHz were considered in ding to the last 25 assigned claps. Finding the most suitable di-
the log spectral distance measurement. rection for the current clap can be considered as the problem of
To determine how well the current clap fits into a periodi- d
LSD(d)
city scheme within a certain direction, the time difference (i.e., finding the direction d where the length of a vector b is
∆(d)
in blocks) of the current clap to the last distributed clap in every minimal. The vector norm Λ(d) for every direction is computed
direction is computed: by
∆(d) = γs (c) − T (d). (12) q
Λ(d) = d
LSD(d) b 2.
2 + ∆(d) (15)
The resulting time differences are compared to a target clap fre-
quency. In an experiment with more than 70 participants, Neda The direction to which the current clap fits best is determined as
DAFX-3
DAFx-498
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
the index dnew where Λ(d) is minimal: This type of upmix is denoted as randomUpmix. For both up-
mixing schemes, 13 discrete directions were available; these were
dnew = arg min Λ(d). (16) in particular Φ = [±30, ±25, ±20, ±15, ±10, ±5, 0]. To ensure
d
equal loudness, all stimuli were loudness-normalized to -27 LUFS
The current clap is then assigned to the direction Φ(dnew ) and the (loudness unit relative to full scale). Exemplary stimuli are availa-
buffer for the respective mean and standard deviation computati- ble at https://www.audiolabs-erlangen.de/resources/2017-DAFx-
ons as well as S(k, dnew ) and T (dnew ) are updated. ApplauseUpmix.
As long as each available direction was not assigned with at
least one clap, it is additionally checked whether
r thevector norm
2 4.2. Procedure
2
τLSD τ∆
Λ(dnew ) is larger than a threshold τΛ = + σ∆
σLSD The listening test procedure followed a forced-choice preference
with τLSD = 1.9 and τ∆ = 5.5. If so, the current clap is conside- test methodology. This means, in each trial subjects are presented
red as too different from the claps in the so far assigned directions with two stimuli in randomized order as hidden conditions. Sub-
and, consequently, is assigned to a new or ‘free’ direction. The jects have to listen to both versions and were asked which stimu-
new direction is chosen randomly from the available free directi- lus sounds more plausible. The order of trials was randomized, as
ons. well.
Finally, the blocks corresponding to the current clap γs (c) ≤ Before the test, subjects were instructed that applause can be
m ≤ γe (c) are scaled with the corresponding panning coefficient considered as a superposition of distinctive and individually per-
g(φ) [13] to appear under the determined direction φdnew . Left and ceivable foreground claps and more noise-like background. It was
right clap signal are obtained according to furthermore stated that claps usually exhibit certain temporal, tim-
bral, and spatial structures, e.g., claps originating from the same
CL (k, m) = g(φdnew ) · C(k, m) (17) person do not vary considerably in spatial position and clap fre-
p
CR (k, m) = 1 − g(φdnew )2 · C(k, m). (18) quency, etc. As listening task, subjects were asked to focus on
plausibility of foreground claps.
The final upmixed left and right signals are obtained by superpo- After the instructions, there was a training to firstly familia-
sing the respective left and right upmixed clap and noise signals: rize subjects with the concept of foreground claps and noise-like
background and secondly with plausibility of foreground claps.
1 In the first case, the same procedure as in [9, 11] was also used
L(k, m) = CL (k, m) + √ N (k, m) (19)
2 here: four exemplary stimuli of varying density and accompanied
1 b with additional supplementary explanations regarding foreground
R(k, m) = CR (k, m) + √ N (k, m). (20) clap density were provided. For the second, two synthetically ge-
2
nerated stereo stimuli with PbΣ = 6 were presented whereby in
4. SUBJECTIVE EVALUATION one of which timbre and time intervals between consecutive claps
were modified to decrease plausibility. Subjects were provided
To subjectively evaluate the performance of the proposed blind up- with supplementary information regarding the stimuli and their ex-
mix method, a listening test was performed, where the plausibility pected plausibility.
criteria-driven upmix was compared to a context-agnostic upmix. It should be noted that instructing subjects on such a detailed
level appears to bear a risk of biasing them into a certain direction.
However, this came as a result of a pre-test where subjects were
4.1. Stimuli
simply asked to rate naturalness of applause sounds without pro-
Two sets of stimuli were used for the listening test: one set con- viding any further information. In this pre-test, it was found in
sisted of synthetically generated applause signals with controlled interviews that subjects based their ratings of naturalness on quite
parameters, whereas the other set consisted of naturally recorded different aspects of the stimuli. For example, for some subjects,
applause signals. All signals were sampled at a sampling rate of naturalness was predominantly influenced by the room/ambient
48 kHz. Seven synthetic signals were generated based on the met- sound, others focused more on imperfections of the applause synt-
hod proposed in [2] and with respective number of virtual people hesis and did not take spatial aspects into account. The plurality of
PbΣ = [2, 4, 8, 16, 32, 64, 128]. The signals had a uniform length influencing factors made it impossible to obtain a reasonably con-
of 5 seconds. sistent rating between the subjects. Thus, it was decided to focus
The set of naturally recorded signals was a subset of the sti- the listening test on the notion of foreground claps. Furthermore,
muli used in [11]. For reasons of comparability the same stimuli it emerged from the subject inter- views that asking for plausibi-
numbering was applied in this paper. In particular stimuli with lity potentially puts more focus on properties of the clap sounds
numbers 1, 7, 10, 11, 13, 14, and 18 were used, each of which themselves than using the more broadly defined term naturalness.
having a uniform length of 4 seconds. Both stimulus sets covered The listening test was conducted in an acoustically optimized
an applause density range from very sparse to medium-high. sound laboratory at the International Audio Laboratories Erlangen.
Stimuli of both sets were blindly upmixed using different pro- Stimuli were presented via KSdigital ADM 25 loudspeakers.
cessing schemes. In the first scheme, upmixing was done accor-
ding to the above proposed processing and will be denoted as pro- 4.3. Subjects
posedUpmix. In the second scheme, the applause signals were also
separated into claps and noise-like background but the claps were A total number of 17 subjects among which 3 female and 14 male
distributed in a context-agnostic manner, i.e., claps were randomly took part in the listening test. Subjects’ average age was 33.1
assigned to the available directions within the stereo panorama. (SD = 8) years ranging from 23 to 53 years. All subjects were stu-
DAFX-4
DAFx-499
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
60
10
8
50 neral preference towards proposedUpmix visible in the pooled re-
40
6 30 sults. Considering the individual stimuli, the data suggests that
4 20 except for stimulus number 18, where preference for both met-
2 10
0 0
hods was about similar, proposedUpmix was preferred. There is
2 4 8 16 32 64 128 pooled also a weak trend in the data visible indicating that preference for
PbΣ proposedUpmix decreases with increasing density. In no case of
both stimulus sets, the randomUpmix was clearly preferred over
Naturally Recorded
80
proposedUpmix.
12 70 The indicated trend of preferences in both stimulus sets makes
10 60 sense given that at high densities clap events occur in a more cha-
Frequency
8 50 otic and pseudo-random fashion and the event rate gets too dense
40
6
30
to be evaluated by the human auditory system on a per clap ba-
4
20 sis. Instead, the denser the clapping becomes, the more it can be
2 10 considered as a sound texture and, in further consequence, general
0
10 1 18 11 14 13 7
0
pooled
signal statistics gain more relevance for perception than properties
of individual clap events [14].
Stimulus Number
Additionally to the visual evaluation, statistical test results are
proposedUpmix randomUpmix provided in Table 1. For every stimulus as well as the respective
pooled data, a one-tailed binomial test was carried out which tes-
Figure 3: Histogram of subjects’ preference for synthetical gene- ted against the hypothesis that the upmix methods were chosen by
rated (top plane) and naturally recorded (bottom plane) stimuli. chance, i.e., an expected relative frequency of 0.5. Considering the
Additionally, the pooled data across stimuli is provided. individual stimuli only results for PbΣ ∈ [2, 16] of the synthesized
and none of the recorded stimulus set are significant, where a sig-
nificance level of α = 0.05 was used. However, the overall results
Synthesized show that for both stimulus sets, proposedUpmix was clearly and
PbΣ 2 4 8 16 32 64 128 pooled statistically significantly preferred over randomUpmix.
p-value .001 .072 .685 .025 .072 .166 .166 <.001
Recorded
Stimulus 10 1 18 11 14 13 7 pooled 5. CONCLUSION
p-value .072 .072 .685 .166 .166 .315 .315 .005
A blind upmix approach for applause-like signals incorporating
Table 1: P-values of the statistical analysis by means of binomial quasi-periodicity and timbral similarity of consecutive claps from
testing of preference data. individual spatial directions was proposed and evaluated. The in-
put signal was firstly decomposed into distinctive and individu-
ally perceivable foreground claps and more noise-like background.
While the background signal was simply decorrelated, the fore-
dents or employees either at Fraunhofer IIS or at the International
ground claps were distributed amongst random positions in the ste-
Audio Laboratories Erlangen with varying degree of experience.
reo panorama based on timbral similarity and temporal periodicity
In a questionnaire after the listening test, subjects reported in how
of claps. The proposed upmix was evaluated by means of a prefe-
many listening test they have participated in so far; a number of
rence test and based on synthetically generated as well as naturally
up to 5 was considered as low-experienced (2 subjects), a number
recorded applause stimuli. Results showed that the perceptual qua-
between 6 and 14 was considered as medium-experienced (4 sub-
lity with respect to plausibility of the spatial scene produced by
jects), and anything above was considered as expert listener (11
the proposed upmix was clearly preferred over the one of an up-
subjects).
mix where foreground claps were distributed in a context-agnostic
manner. Statistical analysis proved subjects’ overall preference to
4.4. Results be significant.
DAFX-5
DAFx-500
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
[3] S. Disch and A. Kuntz, “A Dedicated Decorrelator for Pa- [9] A. Adami, L. Brand, and J. Herre, “Investigations Towards
rametric Spatial Coding of Applause-like Audio Signals,” Plausible Blind Upmixing of Applause Signals,” in 142nd
in Microelectronic Systems, A. Heuberger, G. Elst, and International Convention of the AES, Berlin, Germany, 2017.
R. Hanke, Eds. Springer-Verlag Berlin Heidelberg, 2011,
pp. 363–371. [10] A. Jylhä and C. Erkut, “Inferring the Hand Configuration
from Hand Clapping Sounds,” in Proceedings of the 11th In-
[4] G. Hotho, S. van de Par, and J. Breebaart, “Multichannel Co- ternational Conference on Digital Audio Effects (DAFx-08),
ding of Applause Signals,” EURASIP Journal on Advances Espoo, Finland, 2008.
in Signal Processing, vol. 2008, 2008.
[11] A. Adami and J. Herre, “Perception and Measurement of Ap-
[5] F. Ghido, S. Disch, J. Herre, F. Reutelhuber, and A. Adami,
plause Characteristics: Wahrnehmung und Messung von Ap-
“Coding of Fine Granular Audio Signals Using High Re-
plauseigenschaften,” in Proceedings of the 29th Tonmeister-
solution Envelope Processing (HREP),” in IEEE Internatio-
tagung (TMT29). Cologne, Germany: Verband Deutscher
nal Conference on Acoustics, Speech and Signal Processing
Tonmeister e.V., 2016, pp. 199–206.
(ICASSP), New Orleans, USA, 2017.
[6] C. Uhle, “Applause Sound Detection,” J. Audio Eng. Soc, [12] Z. Néda, E. Ravasz, T. Vicsek, Y. Brechet, and A.-L. Ba-
vol. 59, no. 4, pp. 213–224, 2011. rabási, “Physics of the rhythmic applause,” Phys. Rev. E,
vol. 61, no. 6, pp. 6987–6992, 2000.
[7] L. Peltola, C. Erkut, P. R. Cook, and V. Välimäki, “Synthesis
of Hand Clapping Sounds,” Audio, Speech, and Language [13] V. Pulkki, “Virtual Sound Source Positioning Using Vector
Processing, IEEE Transactions on, vol. 15, no. 3, pp. 1021– Base Amplitude Panning,” J. Audio Eng. Soc, vol. 45, no. 6,
1029, 2007. pp. 456–466, 1997.
[8] W. Ahmad and A. M. Kondoz, “Analysis and Synthesis of [14] J. H. McDermott and E. P. Simoncelli, “Sound Texture Per-
Hand Clapping Sounds Based on Adaptive Dictionary,” in ception via Statistics of the Auditory Periphery: Evidence
Proceedings of the International Computer Music Confe- from Sound Synthesis,” Neuron, vol. 71, no. 5, pp. 926–940,
rence, vol. 2011, Huddersfield, UK, 2011, pp. 257–263. 2011.
DAFX-6
DAFx-501
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Alex Wilson
Acoustics Research Centre,
University of Salford
Greater Manchester, UK
a.wilson1@edu.salford.ac.uk
DAFX-1
DAFx-502
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
co-authorship network of the DAFx proceedings and pays particu- 37 authors, indicating how the main component grows by merging
lar attention to the largest connected component. with smaller components, through the process of collaboration.
Herein, comparative data is presented for the DAFx confer-
2.2. Bibliometrics of DAFx proceedings 1998–2009 ence series, after 19 years. The number of authors in the largest
connected component was of particular interest, as was the detec-
A previous submission to the DAFx conference series examined tion of communities within that component.
the bibliometrics of the series, for the first twelve years [7]. The
following is a brief summary of that work.
3. DATA COLLECTION AND PROCESSING
• Background to the DAFx conference series — list of loca-
tions and organisers Bibtex files from each of the first 12 conferences were already
• There had been 722 submissions by 767 unique authors. available [7]. Unfortunately, the file for DAFx-98 only listed one
author per paper — this was manually corrected. JabRef2 was used
• Number of submissions per year, individual and cumulative
for Bibtex editing. Bibtex files for the seven subsequent confer-
• Authorship distribution — number of authors per paper (the ences were created manually for this paper, from the conference
modal value was 2) proceedings listed on each conference website. Several editions
• A list of the most cited papers of the DAFx conference proceedings (from 2005 to 2012) are in-
dexed on Scopus and DAFx-03 is indexed on Web Of Science.
• A list of the 20 most frequent authors
These entries were checked against the manually created entries.
• Confirmation that the conference submissions followed From the combined proceedings it was possible to construct
Lotka’s law, shown by a log-log plot of publications against a co-authorship network, showing which authors had directly col-
number of authors. laborated in the writing of a paper. The following is a description
The current submission attempts not to repeat any of these con- of the process by which the network was created from the Bibtex
tributions, save for necessary updates to data after an additional data.
seven years of submissions. • All individual Bibtex files were merged and this file was
imported into Network Workbench (NWB), a tool for net-
2.3. Bibliometrics & scientometrics of other conference series work analysis and scientometrics [14]. The co-author net-
work was extracted from the Bibtex file using the supplied
Another paper by the author of the initial DAFx study focussed on routine.
the IEEE Transactions on Software Engineering [8]. This included
further types of analysis not reported in the study of DAFx, such • The names of authors will not always be consistent through-
as collaboration between countries. Of course patterns of interna- out all of their publications. This can be due to use of ini-
tional collaboration can change with time. When considering the tials in place of full names, deliberate changes in name,
period of 1990-2000, the growth of the European Union and its reversal of first-name/family-name conventions, or simply
funding for science was credited as a significant change in the sci- human error in transcription. The merging of duplicate
entific environment [9]. It was in this climate that the DAFx series nodes is crucial for the accurate determination of graph
began. metrics. Possible duplicates were highlighted by applying
After its first nine years, a bibliometric analysis of the ISMIR the Jaro distance metric [15, 16]. Through trial and error,
(International Society for Music Information Retrieval) conference authors were merged at 0.92 similarity providing the first
series was published [10]. This included a limited co-author net- two letters were in common, and it was noted when two
work, in which only 22 authors were labelled. This paper sug- entries were above 0.85 similarity. The data was then man-
gested that the European research labs were tightly interconnected. ually inspected. Where node labels only contained a first
The ISMIR community does contain some overlap with DAFx, in initial and not the full name, Google Scholar was used to
terms of authors. A recent follow-up examined the role of female identify whether this node matched any others. If ‘both’ au-
authors in the ISMIR conference proceedings [11]. At this time, thors had similar co-authors and wrote about similar topics,
after 16 years, the total number of unique authors was 1,910. This then the likelihood of them being duplicates was considered
paper included a brief co-author analysis. Nine clusters of authors great enough to make a correction.
are shown — clusters containing female authors with at least five • The network was updated by merging nodes that were
co-authors. This showed that the two most prolific female authors flagged as duplicates.
were members of the same cluster. However, it is not clear whether
these clusters can be connected to one another, i.e. whether they • The graph was split into separate, comma-separated files:
were drawn from a large connected component or numerous con- one for nodes and one for edges.
nected components. It can be inferred from [10] that some connec- • These node and edge tables were imported into Gephi [17].
tions can be made (as a component cannot become smaller over Visual inspection of the graph was able to identify further
time), yet insufficient data is presented. duplicate nodes. An iterative approach was taken to the cor-
It has been reported that the main component in the co- rection of duplicate nodes, by repeating these steps until all
authorship network of the PACIS (Pacific Asia Conference on In- had been accounted for. The need for an iterative approach
formation Systems) reached a size of 663 authors after a period of to data cleaning has also been described previously [12]. Of
15 years, which accounted for 33% of the total network [12]. This course, these methods were not able to detect any deliberate
compared well to a value of 29% in a study of the ECIS (European changes in name, such as by marriage.
Conference on Information Systems) after a period of twelve years
[13]. Within ECIS, the second largest component contained only 2 http://www.jabref.org/
DAFX-2
DAFx-503
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Table 1: Summary statistics of entire co-author network Table 2: Isolates (authors without co-authors) who have made
more than one submission.
Number of nodes 1281
Number of edges 2058 Name Nworks
Average degree 3.213
Sinan Bökesoy 3
Average weighted degree 3.911
Niels Bogaards 2
Connected components 269
Christian Müller-Tomfelde 2
...which are isolates 132
David Kim-Boyle 2
...which are dyads 58
Richard Hoffmann-Burchardi 2
...which are triads 37
Angelo Farina 2
The largest connected component consists of 667 nodes.
Tor Halmrast 2
DAFX-3
DAFx-504
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
6
Emmanuel Rio Caulkins
Terence
Guillaume Vandernoot
Etienne Corteel
M. Nogues
N. Misdariis
Olivier Warusfel
Francisco
Marco Leny Garcia
Vinceslas
Marchini
Panagiotis Papiotis
Thibaut Carpentier
Alfonso Perez
Esteban Maestre
EnricGraham
Guaus Coleman Winfried Ritsch
Saso Musevic
Merlijn Blaauw Markus Noisternig
Richard Marxer
Andrew P. Bradley
I. Vaughan L. Clarkson
Justin Salamon Christopher Frauenberger
Cornelia Falch
Stefan Warum
Jordi Janer
Jordi Bonada Veronika PutzThomas Musil
Tony Stockman
Annamaria Mesaros
Antti
Victor Eronen
Popa
Jussi LeppBanen C. Feldbauer
Aymeric ZilsEloiJaume Masip Emilia G4
omez Marko Hel4
en D.D.
Igor Curcio
Elina Helander Alexander Wankhammer
Sergio Krakowski Batlle Katariina Mahkonen
Pierre Roy Gilles Peterschmitt
Xavier Amatriain
Ferdinand Fuhrmann Peter Sciri Robert HBoldrich
Katharina Vogt
Alex
Lars Loscos
Fabig Markus Guldenschuh
Sebastian Bock Tuomas Virtanen Gerald
Helmuth
Karl Freiberger Lichtenegger
S. Leitner
Ploner-Bernard Florian Wendt
Pedro Cano
Fran9cois Pachet Florian Krebs Arzt
Andreas Hannes Pomberger
Siegfried VBossner
Reinhard Braunstingl
Alvaro
Markus Barbosa
Koppenberger
Anssi Klapuri
Gerhard Widmer
Elias Pampalk
Zlatko Baracskai
Nicolas Wack
T. K.
H. Jensen
Andersen Stuart M. Rimell
Chris Pike
Anocha Rugchatjaroen
Dimitri Zantalis
Sean EnderbyPeymanDavid Mo,at
Heydarian Robert Tubb
Dominic Ward Hatice Gunes Ross P. Kirk
Pedro
Cham Athwal David
PestanaRonan Gregory Aldam
David M. Howard
Michael Stephen Oxnard
C. J. C Newton Jeremy J. Wells
MorrellTerrell
Daniele Barchiesi Anthony Masinton
Martin Jack Mullen
SaoirseEnrique
Finn Perez Gonzalez Paul Chapman
Aglaia Foteinou
Jacob Maddams Matthias
TianMauch
Cheng Iain Laird Guilherme Campos
Paulo Dias Tom Barker
Joshua D. Reiss Andre Jose Vieira
Olivera
Catarina Mendonca
Jorge Santos
Joe Rees-Jones
Damian T. Murphy
Alice Cli,ord Jude Brereton
Matyas Jani
Simon Dixon Sten Ternstrom
Zulfadhli Mohamad
Alan Wing
Satoshi Endo Nikolaos Henry
Mitianoudis Andrew Chadwick
Dan Stowell Lindsay-Smith
Luis Figueira Massimiliano Tonelli
Ryan Stables Chris Cannam
Steve Welburn Christopher Harte
Spyridon Stasis Stephen Welburn Skot McDonald
Adam M. Stark Giuliano MontiWilmering
Thomas
Mark Levy Elio Quinton Simon Shelley
Mark D. Plumbley Mike Davies
JasonBernhard Seeber
A. Hockman RebeccaGyorgy Fazekas
Stewart Mark Beeson
Alastair Moore
Roman Gebhardt Paul
Matthew E. P. DaviesBrossierXue Wen
Xiaoyan Lou
Christopher Duxbury
Christian Landone
2 Taemin Cho
Laewoo Kang
Jon Forsyth Juan P. Bello
Mark Sandler
Emmanuel Ravelli
David Darlington
Gregory Cornuz
Pierre Leveau
Jean{Dominique Polack
Guillaume Defrance
S. Merdjani
Benoit Lachaise
Jonathan Terroir
Karim Abed-Meraim
Julie Rosier
0
JorgeMarchand Jean-Loic Le Carrou
Brian Hamilton
Joan Mouba Sylvain
Charlotte Craig J.Olivier
Desvages Clara
Webb Doare
Issanchou
Anne Vialard Alberto Torin
James Perry Harrison
Reginald
Sergio L. Netto
Fabian Esqueda Math Verstraelen
Roope Kiiski
Fernando Vidal Wagner
David Birnbaum Keun Sup Lee Sami Oksanen Leonardo Gabrielli
Joseph Malloch Julian Parker Patty Huang Rafael C. D. de PaivaLuiz W. P. Biscainho
Stefano Squartini
Elliot Sinyor Hannes Gamper
Andreas Franck Leonardo O. Nunes
Eric
David Smialek
Brackett Elliot Kermit-Can-eld Luca Vesa
Remaggi
Norilo Mika Kuuskankare
Charles Cooper
Alexandre Bouenard Mayank Sanganeria Jukka Rauhala Guilherme Sausen Welter
Stephen McAdams
Savvas Kazazis Thomas Hermann Marcelo M. Wanderley Dan Harris Kari Roth
Philippe Depalle
Florian Grond Chet Gnegy Paulo A. A. EsquefAntti Jylha
Corey Kereliuk Vesa VBalimBaki Mikael Ismo Kauppinen
Laurson
Doug Van Nort Francois Germain Cumhur Erkut
Jonathan
WernerS.Tamara
Abel
Jung Suk Lee Kurt James SmythEdgar Berdahl Azadeh Haghparast Jyri Pakarinen
Vaibhav Nangia Henri Penttinen Timo I. Laakso Steven van de Par Sampo Vesa
Gary Scavone W. Ross Dunkel
Michael Jorgen Olsen
Nelson Lee A. Harma
Bo Holm-Rasmussen
Phillipe Latour Alexis Moinet Maximilian Rest Romain Michon Miikka Tikander J. Hiipakka
Maria Astrinaki Tapio Lokki
Thierry Dutoit Nicolas D'Alessandro C. Chafe Ville Riikonen Antti Kuusinen
Julius O. Smith Matti Karjalainen
Caroline Traube S. D.
Wilson
Chisholm
Randal J. Leistikow
Harvey D. Thornburg
Juhan NamDavid Berners
John F-tch David T. Yeh Heidi-Maria Lehtonen
Jonathan Berger T. Yeh
Colin Ra,el Matthias Rath
Jari Kleimola
S.J.Rossignol
Soumagne Catherine Guastavino Hyung-SukCarrKimWilkerson
J.-L. Collette Chao-Yu JackRossing
Thomas Perng
Woon Seung Yeo Richard DobsonWafaa Shabana Stefania Sera-n David Topper
Matthew Burtner
C. Nichols
S. O'Modhrain
Stefano Zambon
Vincent Verfaille Russell Bradford Gianpaolo Evangelista
Luca Smilen
TurchetDimitrov
P. Duhamel Rolf
M. Charbit
Jussi Pekonen Bal4aJuraj KojsNordahl
zs Bank Laura Ottaviani
Joseph Timoney Mikael. FernstrBom 4 Maid41n
Federico Fontana D. O
Tapani PihlajamakiThomasFredrik Eckerholm
Mejstrik
Anthony Gibney
Steven YiVictor Lazzarini Marco Civolani
Wim D'haes Sergio Cavaliere
Dawid Czesak P. Polotti
Harald Viste L4aszl4o Sujbert
Arie Livshin Florian Keiler Rudi Villing Antonio Goulart Matthieu Hodgkinson
Jian Wang
E. Apollonio
JohnEdward
Glover Costello
Thomas
Lorcan Lysaght
Tadhg
MacManus Healy
Xavier Rodet Sean O'Leary
Miroslav Zivanovic MarcAndreas Schwarzbacher
Matthias
Schoenwiesner John
Davide Rocchesso
Marco
Marco LiuniRomito Riccardo Marogna
Sigurd Saure Fritz Menzer Antonio DePierre
Sena Dutilleux
Ta-Chun Ya-Han
Chen Kuo Peter Balazs Notto J. W.
Hakon Thelle
Kvidal
Wei-Chen Chang JanJohn
Tro BrianJean Pich4e
Carty
Pei-Yin Tsai Tien-Ming Wang Chunghsin Yeh Niall Gri/th Oyvind Brandtsegg
PalTidemann
Inderberg
Pei-Ching Li Alvin W. Y. Su
Yi-Hsuan Yang Axel Roebel Axel Archontis Politis Stefano Baldan Kiyoshi Furukawa
Wei-Hsiang Liao
Simon Maller
Li Su YangYi-Ju Lin
Chih-Hong Snorre
Thomas Farner
Hueber
Joran Rudi
Yin-Lin Chen Javier Contreras
Amaury La BurtheGilles Gregory
Degottex
Christophe Beller
Veaux Rory Walsh Federico Avanzini
Sascha Disch
Jan-Mark Batke
M. Balik R. Fitz
D. Oboril
Thomas Sikora PetrZdenek
Krkavec
Jiri Misurec
Smekal
Garri Steba Adami
Alexander
Rudolf Rabenstein
Florian Kaiser
BerndJurgen
Edler Herre
4
Maarten van Walstijn
Konrad Kowalczyk
Vasileios Chatziioannou
Alexander MB
Petr Frenstatsky uller
MaximilianBernhard Krach Ben Holmes
Stefan Bayer StefanSchaefer
Petrausch JamieSandor Mehes
Bridges
Fabian-Robert StB
oter Jonathan Shea,er
Sarah Orr
Stephanie Bertet
Peter Stitt Bruno Fazenda
Christian Uhle
Alex Wilson
Christian Paul
Figure 1: Largest connected component, consisting of 667 nodes. Node size is proportional to number of works created by that author.
Edge thickness is proportional to the number of co-authored works between nodes. Colour represents the communities detected using the
Louvain algorithm [19]. Node positioning was achieved using a force-directed layout [20].
-4 -3 -2 -1 0 1 2 3 4
DAFX-4
DAFx-505
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Jonathan Terroir
Dan Harris
3
Mathieu Barthet
Mayank Sanganeria
Chet Gnegy 2
Philippe Guillemain Solvi Ystad
Edgar Berdahl P. Boussard Richard Kronland-Martinet
Gregory Cornuz G. Pallone Julien Bensa Charles Gondre S. Conan
Romain Michon T. Yeh Etienne Thoret
Emmanuel Ravelli Bruno Torr4esani
2 Michael Jorgen Olsen
Hyung-Suk Kim
Pierre Leveau Mitsuko Aramaki Adrien Sirdey Ichrak Toumi
Maximilian Rest S. Merdjani
Kurt James Werner Benoit Lachaise Karim Abed-Meraim
W. Ross Dunkel Thomas Rossing
Steven van de Par Chao-Yu Jack Perng Guillaume Defrance
Colin Ra,el
1 Jean{Dominique Polack
Laurent Daudet Olivier Derrien
Francois Rigaud
Julie Rosier
Francois Germain Olivier Romain R4emy Boyer
Romain Hennequin
Vaibhav Nangia Juraj Kojs
Nicolas Sturmel Bertrand David
Julius O. Smith Carr Wilkerson David Topper Roland Badeau Valentin Emiya
Nancy Bertin
Jose Echeveste
1 Ville Riikonen A. Harma Tamara Smyth Matthew Burtner
David T. Yeh Laurent Girin
David Berners S. O'Modhrain
Nelson Lee Stefania Sera-n C. Nichols Jonathan Pinel
Miikka Tikander Mathieu Lagrange
Sylvain Marchand Robert Strandh
Jonathan S. Abel Smilen Dimitrov
Antti Jylha Matti Karjalainen
Jussi Pekonen
Juhan Nam Elliot Kermit-Can-eld
Luca Turchet 0 Martin Raspaud Matthias Robine
Jean-Bernard Rault Stanislow Gorlow
Henri Penttinen Rolf Nordahl Guillaume Meurisse
Mika Kuuskankare Jari Kleimola Patty Huang
Charles Cooper Anne Vialard Pierre Hanna
Jyri Pakarinen
0 Cumhur Erkut Mikael Laurson
Azadeh Haghparast Bertrand Scherrer Joan Mouba Dominique Fourer Myriam Desainte-Catherine
Gerard J.M. Smit Lauri Savioja Nathan Whetsell Anthony Beuriv4e
Vesa Norilo
Antti Kelloniemi
Math Verstraelen
Jan Kuper Paulo A. A. Esquef
Ismo Kauppinen Vesa VB
alimB
aki
Guilherme Sausen Welter Julian Parker
Kari Roth Hannes Gamper -1 Philippe Depalle
Jukka Rauhala Sami Oksanen Savvas Kazazis
Leonardo O. Nunes
-1 Leonardo Gabrielli
Stefano Squartini Andreas Franck Fabian Esqueda Vadim Zavalishin
Stephen McAdams
Luiz W. P. Biscainho Doug Van Nort David Brackett
Roope Kiiski
Rafael C. D. de Paiva Luca Remaggi Heidi-Maria Lehtonen E0am Le Bivic Corey Kereliuk J. Soumagne
Timo I. Laakso S. Rossignol Eric Smialek
Bo Holm-Rasmussen Stefano D`Angelo J.-L. Collette
Fernando Vidal Wagner
Ville Mantyniemi Jung Suk Lee
Sergio L. Netto
-2
-2
Gary Scavone
R4emi Mignot
D. Chisholm
S. Wilson
C. Chafe
-3 Denis Matignon -3 Randal J. Leistikow
Thomas H4elie
Harvey D. Thornburg
Jonathan Berger
-4
Antoine Falaize
Ivan Cohen
-4 Woon Seung Yeo
-5
(a) Nnodes = 88
Community #13, # Nodes = 53
(b) Nnodes = 73
Community #16, # Nodes = 49
-3 -2 -1 0 1 2 3 -1 -0.5 0 0.5 1 1.5 2
Satoshi Endo
Alan Wing 3 Daniel P.W.Ellis
3 Marco Dalai
Ryan Stables
Pierangelo Migliorati
Spyridon Stasis Riccardo Leonardi
Emmanuel Deruty Helene Papadopoulus
Alessio Degani
2 2
Lise Regnier
Christophe Charbuillet Gael Richard
Bernhard Seeber Damien Tardieu
Jon Forsyth Roman Gebhardt
Taemin Cho Jason A. Hockman
Laewoo Kang
Mathieu Ramona
Ugo Marchand
Adam M. Stark
Dan Stowell
Matthew E. P. Davies
1 Juan P. Bello
Chris Cannam
1
Geo,roy Peeters
Mark D. Plumbley Steve Welburn
Luis Figueira
Xiaoyan Lou Javier Contreras
Christopher Duxbury
Xue Wen
Stephen Welburn
Paul Brossier Simon Maller
Mark Levy
Stephan Huber
Mike Davies Christophe Veaux
Rebecca Stewart Gilles Degottex
0 Mark Sandler
0 Snorre Farner
Gregory Beller
Thomas Hueber
Miroslav Zivanovic
Wim D'haes Pierre Lanchantin
Massimiliano Tonelli
Giuliano Monti Nicolas Obin
Skot McDonald Axel Roebel
Nikolaos Mitianoudis
Gyorgy Fazekas Fernando Villavicencio
Xavier Rodet Ianis Lallemand
Thomas Wilmering Christian Landone Stefan Weinzierl
Diemo Schwarz
Henry Lindsay-Smith
David Darlington
David Mo,at Henrik Hahn
Hatice Gunes Marco Romito
Alice Cli,ord Gr4egory Beller
-1 David Ronan -1 Juan Jos4e Burred
Chunghsin Yeh
Marco Liuni
Amaury La Burthe Sam Britton
Enrique Perez Gonzalez Joshua D. Reiss Bruno Verbrugghe
Pedro Pestana
Martin Morrell
Peter Balazs
Marcelo Caetano
Dominic Ward Peyman Heydarian Michael Terrell Remy MB
uller
Hideki Kenmochi
Arie Livshin Norbert Schnell
Cham Athwal
-2 Sean Enderby Jacob Maddams
Saoirse Finn -2
Daniele Barchiesi
George Kafentzis
Figure 2: The four largest communities detected within the main component. Note that most nodes shown here have many more edges
beyond these communities. Colours used are as in Figure 1. Node positioning was achieved using a force-directed layout [20], run
separately for each community.
DAFX-5
DAFx-506
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Joseph Timoney
Rémi Mignot Stefania Serafin G. Evangelista
Kurt James Werner Victor Lazzarini
Paulo A. A. Esquef Henri Penttinen Matti Karjalainen
Thomas Lysaght
David T. Yeh
Name Degree Weighted degree Tamara Smyth Laurent Daudet
Nworks 0.8 Jean-Bernard Rault Udo Zölzer
Stefan Bilbao
Vesa Välimäki 36 40 62
Udo Zölzer 44 29 55
Victor Lazzarrini 24 28 54 Damian T. Murphy
Laurent Daudet 11 28 35
Mark Sandler 21 22 34
Axel Röbel 20 27 34
0.2
of the total number of authors. While the other connected com- Nworks
ponents are relatively small, each containing less than 15 nodes,
these include active research communities in Japan and China. Figure 3: Scatterplot showing number of submissions vs. weighted
closeness, which is normalised to the range [0 1]. The greater the
value of closeness, the fewer steps (on average) are required to
4.2. Node degree reach another author in the network.
In an undirected graph, the degree of a node refers to the num-
ber of other nodes to which it is connected. In a co-authorship
network this is simply the number of co-authors. When ranking than those with fewer co-authored works, and so the ‘cost’ asso-
co-authors by degree this gives (perhaps) undue preference to au- ciated with traversing this edge is lower. The authors with the
thors who have managed to author many works but perhaps con- highest weighted closeness centrality values are displayed in Fig.
tributing little to each. If the edge weights are considered, then the 3. Closeness generally increases with the number of submissions
weighted degree takes into account the number of times a pair of but there are notable exceptions, as highlighted.
co-authors have worked together. Table 3 shows the ten authors
with the greatest weighted degree. 4.3.2. Betweenness
The betweenness measure illustrates the importance of an author
4.3. Centrality by means of assessing the flow of “traffic” that passes through that
In attempting to measure the influence of nodes within a network node. This is achieved by measuring the number of shortest paths
a number of centrality measures have been developed [21]. This from all nodes to all other nodes which involve passing through
section will report on three of these: closeness, betweenness and the node in question.
eigenvector centrality. These three scores were calculated using X nst (i)
Matlab R2016a. betweenness(i) = (2)
Nst
s,t6=i
4.3.1. Closeness nst (u) is the number of shortest paths from source s to target
t that pass through node i, and Nst is the total number of shortest
Closeness centrality uses the inverse sum of the distance from a
paths from s to t. As with closeness, the reciprocal of the edge
node to all other nodes in the graph. Assuming that not all nodes
weights was used as a cost in calculating weighted betweenness.
can be reached (which is true for this network), the centrality of
The values of weighted betweenness centrality are displayed in
node i is:
Fig. 4.
2
Ai 1
closeness(i) = (1) 4.3.3. Eigenvector centrality
N −1 Di
Here, Ai is the number of nodes that can be reached from node The eigenvector centrality measure assumes that when a node is
i (not counting i itself), N is the number of nodes in the graph G, connected to other high-scoring nodes that this counts for more
and Di is the sum of distances from node i to all reachable nodes. than a connection to a lower-scoring node.
If no nodes are reachable from node i, then closeness(i) is zero. When eigenvector centrality was computed without weighting,
This expression assumes that all edge weights are equal to 1. The the top 10 authors were not just all members of the same commu-
reciprocal of the actual edge weights (the number of co-authored nity (shown in Fig. 2d) but all co-authors of the same publica-
works) were introduced as the ‘cost’ used in the centrality calcu- tion. For many of these authors, this has been their sole contri-
lations. This is suitable because one can deduce that co-authors bution to DAFx. When eigenvector centrality is calculated with
with many co-authored works exchange information more readily edge weights taken into account, the results are plotted in Fig. 5.
DAFX-6
DAFx-507
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
1 Vesa Välimäki
5. DISCUSSION
Udo Zölzer
rative efforts that causes the network to expand. At this stage, a
Pietro Polotti
Martin Raspaud Damian T. Murphy series of questions is posed to the community, for discussion at
conferences and beyond.
Axel Roebel
Victor Lazzarini
1. How could the network be used to identify potential collab-
0.2
orators?
Stefan Bilbao Joseph Timoney
0
3. There are very few frequently-contributing authors who
lack co-authors. Could more be done to support authors
0 10 20 30 40 50
Nworks
who do not collaborate?
Figure 4: Scatterplot showing number of submissions vs. weighted 4. Would the size of the network be increased by the use of
betweenness, which is normalised to the range [0 1]. High be- a double-blind review process, in which submissions are
tweenness scores indicate an author who co-authors submissions evaluated without knowing the names of the authors?
with a large number of communities.
As shown by Table 3 and Figs. 3, 4 and 5, the relationship
between these measures of node importance and more straight-
forward measures, such as the number of works, is not clear. It
is possible for an author with few works to be considered highly
important to the network as a whole. Each centrality measure
Joseph Timoney
Thomas Lysaght
Victor Lazzarini
Vesa Välimäki
has strengths and weaknesses and provides different insights into
Lorcan MacManus
Jussi Pekonen
10−1 Jari Kleimola Jonathan S. Abel Julius O. Smith
Henri Penttinen
Julian Parker
Matti Karjalainen
the topology of the network. Closeness centrality rewards authors
weighted eigenvector centrality (norm.)
Axel Röbel
who are prominent in communities which are themselves well con-
Gianpaolo Evangelista Udo Zölzer
nected to others, often achieved by high degree scores — the top
10−3
Sylvain Marchand
three authors are members of the community in Fig. 2a. Between-
Damian T. Murphy
Mark Sandler
ness centrality highlighted the efforts of a number of authors who
have worked with a number of communities and authors who act
10−5
Joshua D. Reiss as bridges — six communities are represented in the top 10 au-
thors. In contrast, eigenvector centrality rewards groups of authors
who frequently collaborate with one another.
10−7
Robert Höldrich
The aim of the study was to examine the nature of community and
collaboration within the DAFx conference proceedings, after two
decades. By collecting Bibtex archives for each conference a co-
The three authors with the highest score according to this metric authorship network was created. This revealed a large connected
are all from Maynooth University, and have frequently collabo- component — 52% of all authors are connected to one another by
rated. There are many infrequent collaborators with relatively high a number of intermediate co-authors. Twenty-four communities
scores, while some frequently contributing authors in more iso- were detected within this component, heavily influenced by geo-
lated communities (according to this metric) are also highlighted. graphical proximity. Communities are connected by the movement
DAFX-7
DAFx-508
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
of researchers between research groups and by the interactions and [10] Jin Ha Lee, M Cameron Jones, and J Stephen Downie, “An
discussions at conferences. This network could be displayed on- analysis of ISMIR proceedings: Patterns of authorship, topic,
line, allowing authors interactive access to the data. This would and citation.,” in Proceedings of the 10th International Con-
also facilitate the updating of the data with each new set of confer- ference on Music Information Retrieval, 2009, pp. 57–62.
ence proceedings. [11] Xiao Hu, Kahyun Choi, Jin Ha Lee, Audrey Laplante, Yun
There have now been two papers explicitly describing the Hao, Sally Jo Cunningham, and J Stephen Downie, “WiMIR:
DAFx conference proceedings: one on basic bibliometrics and one An informetric study on women authors in ISMIR,” in Pro-
describing the details of the network. There is scope for further ceedings of the 17th International Conference on Music In-
work, in the context of DAFx and also more generally. The Bib- formation Retrieval (ISMIR), New York, USA, 2016.
tex entries for the DAFx conference could contain information on
submission type: keynote, tutorial, oral presentation or poster pre- [12] France Cheong and Brian J Corbitt, “A social network
sentation. It would be interesting to examine whether the choice of analysis of the co-authorship network of the pacific asia con-
poster or oral presentation has an impact on the formation of col- ference on information systems from 1993 to 2008,” PACIS
laborative links. Of course, DAFx does not exist in isolation, and Proceedings, 2009.
its contributors also make submissions to other conference series. [13] Richard Vidgen, Stephan Henneberg, and Peter Naudé,
Larger co-authorship networks can be constructed by the merger “What sort of community is the european conference on in-
of related conference proceedings. This could reveal the extent of formation systems? a social network analysis 1993–2005,”
the overlap and the interdisciplinary nature of the research groups European Journal of Information Systems, vol. 16, no. 1, pp.
involved. 5–19, 2007.
[14] NWB Team et al., “Network workbench tool. indiana uni-
7. REFERENCES versity, northeastern university, and university of michigan,”
2006.
[1] Wolfgang Glänzel and András Schubert, “Analysing sci-
entific networks through co-authorship,” in Handbook of [15] Matthew A Jaro, “Advances in record-linkage methodology
quantitative science and technology research, pp. 257–276. as applied to matching the 1985 census of Tampa, Florida,”
Springer, 2004. Journal of the American Statistical Association, vol. 84, no.
406, pp. 414–420, 1989.
[2] Wolfgang Glänzel and Bart Thijs, “Does co-authorship in-
[16] Matthew A Jaro, “Probabilistic linkage of large public health
flate the share of self-citations?,” Scientometrics, vol. 61, no.
data files,” Statistics in medicine, vol. 14, no. 5-7, pp.
3, pp. 395–404, 2004.
491–498, 1995.
[3] Katy Börner, Shashikant Penumarthy, Mark Meiss, and
[17] Mathieu Bastian, Sebastien Heymann, Mathieu Jacomy,
Weimao Ke, “Mapping the diffusion of scholarly knowledge
et al., “Gephi: an open source software for exploring and
among major US research institutions,” Scientometrics, vol.
manipulating networks.,” ICWSM, vol. 8, pp. 361–362, 2009.
68, no. 3, pp. 415–426, 2006.
[18] Robert Tarjan, “Depth-first search and linear graph algo-
[4] Howard D White, Barry Wellman, and Nancy Nazer, “Does
rithms,” SIAM journal on computing, vol. 1, no. 2, pp.
citation reflect social structure?: Longitudinal evidence from
146–160, 1972.
the globenet interdisciplinary research group,” Journal of the
American Society for information Science and Technology, [19] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lam-
vol. 55, no. 2, pp. 111–126, 2004. biotte, and Etienne Lefebvre, “Fast unfolding of commu-
nities in large networks,” Journal of statistical mechanics:
[5] Jinseok Kim and Jana Diesner, “A network-based approach
theory and experiment, , no. 10, 2008.
to coauthorship credit allocation,” Scientometrics, vol. 101,
no. 1, pp. 587–602, 2014. [20] Thomas MJ Fruchterman and Edward M Reingold, “Graph
drawing by force-directed placement,” Software: Practice
[6] Howard D White and Katherine W McCain, “Visualizing
and experience, vol. 21, no. 11, pp. 1129–1164, 1991.
a discipline: An author co-citation analysis of information
science, 1972-1995,” Journal of the American society for [21] Tore Opsahl, Filip Agneessens, and John Skvoretz, “Node
information science, vol. 49, no. 4, pp. 327–355, 1998. centrality in weighted networks: Generalizing degree and
shortest paths,” Social networks, vol. 32, no. 3, pp. 245–251,
[7] Brahim Hamadicharef, “Bibliometric study of the DAFx pro-
2010.
ceedings 1998-2009,” in Proceedings of International Con-
ference on Digital Audio Effects, 2010, pp. 427–430.
[8] Brahim Hamadicharef, “Scientometric study of the IEEE
transactions on software engineering 1980-2010,” in Pro-
ceedings of the 2nd International Congress on Computer Ap-
plications and Computational Science. Springer, 2012, pp.
101–106.
[9] Caroline S Wagner and Loet Leydesdorff, “Mapping the
network of global science: comparing international co-
authorships from 1990 to 2000,” International journal of
Technology and Globalisation, vol. 1, no. 2, pp. 185–208,
2005.
DAFX-8
DAFx-509
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
Author Index
DAFx-510
Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), Edinburgh, UK, September 5–9, 2017
DAFx-511