Localization of Sound Sources

Localization of Sound Sources
Studies on Mechatronics
Spring 2009
Christian Lenz
Supervisors: Ambroise Krebs, Laurent Kneip

Professor: Prof. Dr. Roland Siegwart, ASL
Autonomous Systems Lab

ETH Zürich
Preface
The studies on mechatronics presented in this work have been performed at the Au-
tonomous Systems Laboratory of the Federal Institute of Technology in Zurich during
the spring semester 2009. I would like to thank Ambroise Krebs and Laurent Kneip
for their support and their work concerning my studies. Another thank goes to Rote
Fabrik Zürich which supported me with some literature and graphics.
i
Declaration of Independence
Hereby I, Christian Lenz, assure that this Studies on Mechatronics with the title
“Localization of Sound Sources” were written by myself. All used sources are declared
and if there are citations they are clearly marked.
Zürich
May 10, 2009
Christian Lenz
ii
Abstract
The goal of this work is to give an overview of the field of artificial sound localization
techniques. For this purpose the approach is decomposed in three parts, treated in this
document. Firstly the data acquisition, secondly the signal processing and thirdly the
microphones and their physical arrangement.
The technique that is mostly used, the inter-aural time difference (ITD), will be dis-
cussed as well as techniques concerning inter-aural level difference (ILD), beam forming
(BF), microphone directivity (MD) or head related transfer functions (HRTF). These
techniques mainly concern the data acquisition. Also important are different ways of
signal processing. The influence of different correlation techniques and filtering proces-
sors helping to separate the desired signal from noise will be discussed. The third part
describes the influence of different microphone arrangements.
In a last section the “Mosquito Localization Problem” will be discussed. Possible solu-
tions are imagined to localize a flying mosquito in order to “blind” it.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Area Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Scheme of Sound Source Localization . . . . . . . . . . . . . . . . 2
1.3 Assumptions and Restrictions . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Sound Localization – Data Acquisition and Signal Processing 4

2.1 Inter-aural Time Difference – ITD . . . . . . . . . . . . . . . . . . . . . 4
2.2 Inter-aural Level Difference – ILD . . . . . . . . . . . . . . . . . . . . . 6
2.3 Beamforming – BF / Steered Response Power – SRP . . . . . . . . . . . 7
2.4 Microphone Directivity – MD . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Head Related Transfer Functions – HRTF . . . . . . . . . . . . . . . . . 9
2.6 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Generalized Cross-Correlation – GCC . . . . . . . . . . . . . . . . . . . 11
2.7.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.2 Some Observations on Ry1 y2 . . . . . . . . . . . . . . . . . . . . . 12
2.7.3 Pre-Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Hardware 14
3.1 Classification of Microphones . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Differentiation by Principles . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Type of Construction and Directivity . . . . . . . . . . . . . . . 15
3.1.3 Additional Structures . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Number and Arrangement of Microphones . . . . . . . . . . . . . . . . . 17
3.2.1 One Microphone . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Two Microphones . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Arrays of Microphones . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Mosquito Localization Problem 19

4.1 The Male Mosquito Hearing System . . . . . . . . . . . . . . . . . . . . 19
4.2 Problem Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Choice of Most Appropriate Technique . . . . . . . . . . . . . . . . . . . 20
4.3.1 Proposal of Near-field Solution . . . . . . . . . . . . . . . . . . . 20
5 Conclusion 22
iv
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Bibliography 23
List of Figures
1.1 Scheme of Sound Localization Process . . . . . . . . . . . . . . . . . . . 2
2.1 ITD – time shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 ITD – far-field assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 ILD – Inter-aural Level Differences . . . . . . . . . . . . . . . . . . . . . 6
2.4 Delay-and-Sum Beamformer . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Scheme of Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Usual Directional Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 16
List of Tables
2.1 Pre-filtering processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
Nomenclature
ASL Autonomous Systems Lab
BF Beam Forming
DSBF Delay-and-Sum Beamforming
Eckart Eckart Filtering Function
ETH Eidgenössische Technische Hochschule
FBS Frequency Band Selection
FD Frequency Domain
GCC General Cross-Correlation
HRTF Head Related Transfer Function
HT Hannan Thomson Weighting Function
IID Inter-aural Intensity Difference
ILD Inter-aural Level Difference
IPD Inter-aural Phase Difference
ITD Inter-aural Time Difference
MD Microphone Directivity
MEMS Micro Electro-Mechanic Systems
ML Maximum Likelihood
PHAT Phase Transform Weighting Function
RIR Roth Impulse Response
SCOT Smoothed Coherence Transform Weighting Function
SNR Signal-to-Noise Ratio
SoM Studies on Mechatronics
Sonar sound navigation and ranging
vi
SRP Steered Response Power
SRP-PHAT Steered Response Power Phase Transform Weighting Function
TD Time Domain
TDA, TDOA Time-Difference Of Arrival
UCC Unfiltered Cross-Correlation
vii
viii
1 Introduction
1.1 Motivation
The motivation to do the Studies on Mechatronics (SoM) on localization of sound
sources has its origin in my job as a sound engineer. Working normally at live concerts
and thus trying to listen precisely at every sound made me wondering about artificial
ways to “listen”. The human hearing is easily influenced by psychoacoustic effects. In
live concert applications this is desired, otherwise it could be interesting to do mea-
surements without being “deceived”.
Looking for a possibility to write the SoM I had a conversation with Ambroise Krebs,
PhD student at the ASL at ETH Zürich. He knew of Laurent Kneip, also a PhD student
at the ASL which had already done some work on artificial sound source localization
and finally had the idea of the “Mosquito Localization Problem”. As I received the
project description the task began.
1.2 Problem Area Overview

The human hearing is a very complex and sophisticated organ. It enables human beings
to locate and analyze sounds with astonishing precision in direction, elevation and even
distance. Surprisingly, this works with even only one ear!
It is somehow difficult to fix the origin of the sound localization problem. Maybe,
the greatest efforts in the mentioned field have been done in robotics, trying to teach
robots having a maximal manlike behaviour or to do special tasks concerning sound
localization. Also important in this context is indeed the interaction between robots
and humans. This brings up the task of computational voice detection and computa-
tional linguistics. Another very important field is certainly the development of artificial
hearing aids for persons with a hearing disorder. Also the armed forces are interested
in passive ways of sound localization in order to fulfill anti aircraft missions or to locate
quickly snipers.
As noticed, so far only passive ways of sound localization were mentioned. Strongly re-
lated are active acoustic localization techniques. Worth mentioning in this context are
indeed Sonar, an echo sounding technique to detect the depth in the sea and in general
echo location techniques as used by wales or bats. This again leads up to biology which
again stands as an example to artificial solutions.
1
1.2.1 Scheme of Sound Source Localization
For a better understanding of the task of sound source localization different process
steps have to be identified. Under the assumption of listening passively to the envi-
ronment without use of any active localization technique the following scheme can be
established:
Figure 1.1: Schematics of the Sound Source Localization Process
To describe a specific situation where it is required to localize a sound source three

main items have to be considered: the environment, noise and the source itself. The
environment could be quiet, noisy, reverberant or in specific labor applications even
acoustically dead. Since a signal has to propagate through a medium it can easily be
influenced by noise. This noise could either be other sounds or reverberations of the
source itself at walls. It is remarked that only external noise is mentioned. Finally, the
source itself carries attributes like directivity, constancy or the type of signal like ran-
dom noise (white, pink, brown, ...), a (sine) tone, a measurement signal (special noise,
impulse) or a ‘real’ sound (ex. speech or music). This brings a lot of uncertainities
a passive localization system has to cope with. For simplification the sound source is
supposed to be an omnidirectional point source.
By listening with one or several ‘ears’ to a situation, as described above, several

problems as illustrated in figure 1.1 arise. The first is indeed to extract information
that can be used to compute the sound source position. This contains the application
of one or several microphones to get the data and later on the use of different physical
principles to extract specific information for the following steps. In this text, this task
is called data acquisition. Then signal processing has to be done in order to extract the
wanted information and later on to compute the spatial position of the sound source.
Finally, there are many possibilities to implement a sound source localization system
physically. This is treated as a third scope of duties, named physical implementation.
The differentiation between the three described areas data acquisition, signal processing
and physical implementation is sometimes very difficult since the areas are overlapping
in reality too. As an example one could mention the use of very directional sensitive
2
microphones. Here, already with the data acquistion, information about the direction is
gained. By choosing the overall strategy and therefore the physical principle one would
like to use, it is often the case that allready a restriction or even a ‘partial solution’ on
one or several of the three other problem fields.
There exists alternative approaches to do sound source localization. One of them

could be to know the environment exactly by pre-calculated impulse responses. By
listening with an unknown arrangement of the microphones the position can be com-
puted backwards [1]. System analysis is usually done by the use of white noise for
computing the spectral frequency developement. Later on this data can be used for
further calculations. Other approaches are based very closely on biological analogies,
for example on neural networks [2]. Other authors propose to combine classical sound
localization techniques with other techniques like video imaging, radar or IR [3]. Since
we want a system that could be applied in every environment, purely artificial and fully
passive, these techniques are neglected.
The structure of this paper matchs the schematics of figure 1.1. The first part focuses
on data acquisition. Different physical principles to get data are discused. In the second
part the focus is on ways of signal processing in order to compute the determined
position of a sound source or to make the system more robust to noise. The third part
is about different microphon types and varying forms of arranging them. And finally
the last part discusses the mosquito localization problem in more detail and possible
approaches are proposed.
1.3 Assumptions and Restrictions

To narrow down the huge field of sound localization, some restrictions are made:
artificial Only artificial ways of sound localization are discussed. Ideas like neural
networks and other biological based approaches are beyond the scope of this text.
passive It is assumed that the localization system only passively listens to a generic
unknown situation. No active techniques of localizing obstacles like sonar or echo
location are discussed.
source Since the sound source has no defined shape all techniques which combine pas-
sive ways of sound localization with other techniques like video imaging, radar
or IR are omited. The sound source is suposed to be an omnidirectional point
source.
post-processing Post-processing steps (like post-filtering algorithms) to improve sys-

tems accuracy and to make it more stable are not objectives of this paper.
3
2 Sound Localization – Data Acquisition
and Signal Processing
2.1 Inter-aural Time Difference – ITD
The most common sound source localization technique is indeed the inter-aural time
difference (ITD). Also known as synonyms for this technique are Inter-aural Phase Dif-
ference (IPD) or Time-Difference Of Arrival (TDA or TDOA). Since it is relatively
simple to get the phase shift between two signals this technique has been investigated
a lot and found its way to many applications.
The basic idea of ITD is based on time shifts between received signals due to the
finite speed of sound. A sound wave front propagating through a medium has a certain
speed, the speed of sound1 . Due to this fact the wave front arrives at different times at
different locations. Thus, differently placed microphones are receiving more or less the
same signal but with a small time shift. As it is shown later the following computation
to estimate the sound sources position2 is independent from the speed of sound. This
makes the ITD-principle highly adaptive to different environments. It can be applied
in a gas as well as in a fluid.
1.0
0.5
2 4 6 8 10 12
-0.5
-1.0
Figure 2.1: Illustration of the time shift between two microphones induced by a single,
immovable sound source.
The now proposed mathematical model is based on the far-field assumption. [4, 5]
1
The speed
√ of sound in a gas depends on the gas itself and the temperature.
a = γRT ≈ 344 m/s (for air at 295 K)
2
‘Position’ means in this context the ‘direction to the sound source’. The distance is not independent
from the speed of sound at all.
4
The far-field assumption simply states that the distance l from the sound source to the
microphones is much larger than the distance d between the two microphones building
a pair. Therefore the two incident “sound rays” can be assumed as parallel which allows
simple calculation. Further we assume that there is no diffraction involved.
Figure 2.2: The far-field assumption l >> d allows simple calculations. (Source: [5])
Since most sound signals can be approximated as stationary for short periods of time
it can be assumed that the sound source is immovable during computations. As an
example, for speech this period is in the order of 20 − 30 ms [1].
cτij pi − p~j ) • ~u
(~
cosΦ = = = sinΘ (2.1)
k~
pi − p~j k k~
pi − p~j k • k~uk
1
⇒ τij = pi − p~j ) • ~u
(~ (2.2)
c
where k~uk = k1k is the unit vector and denotes the sound sources direction and c is
the speed of sound. A highly significant aspect is the distance between the microphones
p~i + p~j . Having it higher means in generall having higher accuracy on the estimated
angle. This can easily be seen in the formula 2.3 as derived in the paper [6]:
dL − dR
cosΦ = (2.3)
2k
where P hi means the angle in the direction to the sound source, dL , dR the distances
to the sound source measured from the left/right microphone and k the distance be-
tween the two microphones. It is irrelevant to the principle but worth mentioning here
that there exists an approach which respects the sampling time Fs for the calculation
of τ [5]. The time difference becomes then
Fs
⇒ τij = pi − p~j ) • ~u
(~ (2.4)
c
where Fs is the sampling time.
It is important to understand that the equations stated above only describe the prin-
ciple of sound source localization via ITD. Further processing has to be done in order
5
to compute the sound sources position. By using correlation techniques τ can be de-
termined. If a beamformer is used this leads directly to an estimation of the sources
position3 . If no beamformer is used there have to be other ways of computing the
position. One is proposed by Kneip in [6], which is a purely geometric model to deduct
the direction to the source or the position of the sound source respectively.
2.2 Inter-aural Level Difference – ILD

Although ILD was intensively investigated in biological context, it has rarely found its
way to artificial applications [7]. Most of the past research are based on ITD (see page
4, section 2.1). The ILD is completely based on the relative energy difference between
two signals of a microphone pair. In contrast to most other techniques the phase of the
received signals are of no importance and can therefore be neglected. Another often
heared synonym for this technique is Inter-Aural Intensity Difference (IID).
Figure 2.3: ILD cues are purely based on relative energy differences between two or
several ears. (Source: Thomson Higher Education, 2007)
A sound source s(t) is supposed. While it propagates, the sound is soiled by noise
from the environment. As sensors, we suppose N microphones. Now the signal received
by the ith microphone can be modeled as
s(t)
xi (t) = + ξi (t) (2.5)
di
where di is the constant distance of the ith microphone from the sound source and
ξi (t) is white Gaussian noise simulating the noisy environment.
In order to compute the relative energy difference between microphone pairs we have
to do some simplifications. The position of the source s(t) as well as the position of the
microphones is supposed to be constant while processed. This time window is defined
as the interval [0, W ] where W is the window size. The signal must be audible.
The energy received by the ith microphone can be computed by
3
Beamforming, see page 7, section 2.3.
6
Z W Z W 2
2 s(t)
Ei = xi (t) dt = + ξi (t) dt
0 0 di
0
z
}| {
W
s2 (t)
Z
s(t) 2
= +2 ξi (t) +ξi (t)
0 d2i di
Z W Z W
1 2
= 2 s (t)dt + ξi2 (t)dt (2.6)
di 0 0
where it is assumed that integrating the non-squared noise ξi (t) results in a mean
value of zero. The cross-term can therefore be neglected. As the energy is inversely
proportional to the square of the distance from the source to the microphone (Ei ∝
1/d2i ) this relation is named as the inverse-square-law [7].
For further discussion, a planar problem with only two microphones is considered.
Using equation (2.6) for two microphones leads to the following simple relation between
energies and distances
E1 d21 = E2 d22 + η (2.7)

RW
ξ12 (t) − ξ22 (t) dt is a zero-mean random variable if the variance of ξi (t)

where η = 0
is constant.
2.3 Beamforming – BF / Steered Response Power – SRP

Beamforming is a widely applied and well-known technique. It is based on inter-aural
time differences and therefore strongly related to the classical ITD approach. The main
difference to ITD is that there is no exact calculation but a position estimation by di-
recting the beamformer through space and look up for the maximal output. This leads
to the alternative expression of steered response power (SRP), where power describes
the output of the beamformer, the energy E. Another synonym is Delay and Sum
BeamForming (DSBF) By using multiple microphones arranged as an array systems
robustness to noise and stability is increased.
Directing a beamformer and look for the highest output simply can be done with a so
called delay-and-sum beamformer. Having a certain number of microphones positioned
in space in a fixed structure4 allows to determine exactly the time delays between two
microphones. In a delay-and-sum beamformer all received signals are aligned (corrected
in phase) and added. Since the microphone signals are in phase they are added con-
structively by the summation. The noise, assumed white, cancels out.
The output of a delay-and-sum beamformer as in figure 2.4 with M microphones is
defined as:
4
An array of microphones.
7
M
X −1
y(n) = xm (n − τm ) (2.8)
m=0
where xm (n) represents the received signal of the mth microphone, τm the corre-
sponding time delay of arrival [8]. Finding the maximum output can be done by
computing the energy over a frame of the length L.
L−1
X 2
E= x0 (n − τ0 ) + ... + xM −1 (n − τM −1 ) (2.9)
n=0
E will be maximized if all the τm are such that the signals are in phase. Remark it is
assumed that there is only one sound source present. Otherwise the possibility of two
energy peaks of different signals overlapping is highly increased. In that case of course
it would be impossible to differentiate them.
Figure 2.4: An M-microphone delay-and-sum beamformer. M signal are received, de-

layed and then summed.
The further signal processing often requires a lot of computational power. Especially
if we handle the problem in the time domain. Often it is better to transform to frequency
domain for a reduction of the computer power consumption. A description of a delay-
and-sum beamformer is also given in [8].
By expanding equation 2.9 the beamformer can be described as:
M
X −1 L−1
X M
X −1 m
X1 −1 L−1
X
E= x2m (n − τm ) + xm1 (n − τm1 )xm2 (n − τm2 ) (2.10)
m=0 n=0 m1 =0 m2 =0 n=0
And in terms of cross-correlation:
N
X 1 −1
−1 nX
E =K +2 Rxn1 , xn2 (τn1 − τn2 ) (2.11)
n1 =0 n2 =0
8
P −1 PL−1 2
K= M m=0 n=0 xm (n − τm ) is assumed as constant and therefore neglected when
maximizing the output engergy E. In the frequency domain one can approximate the
cross-correlation function as:
L−1
X
Rij (τ ) ≈ Xi (k)Xj (k)ej2πkτ /L (2.12)
k=0
where Xi (k) is the discrete Fourier transform of xi (nt ), Xi (k)Xj (k) is the cross-
spectrum of xi (nt ) and xj (nt ), (.) denotes the complex conjugate. The reduction on
the required computational power is described with an example in [5]. Here, only a
very short summary is given. By pre-computing the Rij (τ ) it takes just N (N − 1)/2
lookup and accumulation operations. In the time domain the computation requires
2L(N + 2) operations.
The main problem using a delay-and-sum beamformer is that we have relatively wide
energy peaks. This makes resolution poorer and therefore more difficult to identify or
separate sources from other sources or noise. One way to improve accuracy is to ‘whiten
the signals’ as explained in section 2.7.
2.4 Microphone Directivity – MD

Another approach is to use fixed, highly directional microphones for sound localization.
Each microphones covers a specific region of the room. By looking for the maximal
signal received by the microphones one can locate the sound source withing the corre-
sponding sector. Additional structures as shown in chapter 3 may be used for better
results. Another similar approach may be to have the highly directional microphones
movable and having them scan the room for a sound source. Then its position can
be computed based on the geometrical informations available from the aligned micro-
phones. These approaches exist but the problem with these ideas is that resolution is
generally low.
2.5 Head Related Transfer Functions – HRTF

The ILD / ITD approaches need minimally three pairs of microphones to have no
more ambiguities. To have an applicable system this brings a relatively high technical
complexity. One very smart solution comes along with the approach of head related
transfer functions. The main disadvantage is that HRTFs need a lot of computational
power because of there long impuls responses.
The main idea of head related transfer functions is to look closely at nature. The
human hearing system is a very complex organ with amazing possibilities to locate
9
sound sources. Usually having the whole capabilities of our hearing system (inclusive
inter-aural level difference and inter-aural time delay) we can even locate sound pre-
cisely with one ear shut5 . This is because our hearing system does a spectral shaping
of the sound. HRTF are a way to model the mentioned shaping of sound during the
entrance to the ear.
Incoming sound to the ear is shaped by the whole hearing system. The pinnae, the
ear canal but also the head and torso do a spectral shaping of the sound. This shaping
strongly depends on the direction of the sound source relatively to the hearing system.
Since shorter wave lenghts belong to higher frequencies, sound locating systems based
on HRTF show better results when locating higher frequencies, mainly above 5 kHz [9],
where the shaping of the sound is much stronger than at lower frequencies.
For technical applications two approaches are known, a binaural6 [9] and a mono-
aural7 [10]. The first approach, as proposed by Keyrouz, works with a data base of the
well known KEMAR HRTF as a look-up table. The measured signal is correlated to
this table and the most likely position estimation is then choosed. This needs relatively
low computational power. The second approach as proposed by Hao is in need of
much more computational power due to lack of closed-form algorithms. This is why he
proposes an approach to narrow down the huge dataset of KEMAR transfer functions
by applying IIR / FIR filters on the data set first. Secondly, he applies self-learning
neural networks to cope with the high complexity. All this is beyond the scope of this
paper and therefore it is refered to [10].
2.6 Signal Processing
In this chapter signal processing is in the focus of interest. Due to noise, reverberance
and system inaccuracy in sound localization, poor signal quality is often found. This
makes it hard to run the localization cue. To increase system stability and robustness
to noise, specific signal processing can be done. The most common way to do so in
sound localization tasks, is presented below.
The goal of pre-filters is to accentuate the signal where highest signal-to-noise ratio
(SNR) can be found. In contrast, noise shall be supressed. There exist also other
possibilities to handle and process the aquired data which are not mentioned here.
5
Worth mentioning in this context is that this works mainly for frequencies above 5 kHz. Below the
human auditory system is using ILD as main cue.
6
Two ears like the human auditory system. Each ear contains one microphone
7
Two microphones implemented as one ear. One is placed inside the other outside.
10
2.7 Generalized Cross-Correlation – GCC
As ITD cues are the most commonly used, most algorithms are developed for applica-
tions in combination with microphone arrays and therefore on steered response power.
In general the task to be done is to compute an estimation of the time delay of arrival
D. On one hand cross-correlation and pre-filtering can be done in the time domain
(TD) and on the other hand it takes much less computational power to do it in the fre-
quency domain (FD) . The showed derivations are mainly based on a paper of Charles
H. Knapp [11].
x1 y1
H1
T
× ∫0  2 Peak
Detector
x2 y2
H2 Delay
Figure 2.5: Schematics of Signal Processing in Sound Source Localization Taskes.

(Source: [11])
A signal s1 (t) is assumed to be the only signal in a room. This signal is uncorrelated.
Further, noise and reverberance are respected with ni (t), is named ‘noise’ and assumed
to be white noise. The signals xi (t), received by the i microphones, therefore can be
modelled as in equation 2.14:
x1 (t) = s1 (t) + n1 (t) (2.13)

x2 (t) = αs1 (t + D) + n2 (t) (2.14)
where for simplification only two microphones are assumed. α is the attenuation and
D the delay (phase shift) between the two received signals xi (t). As a relatively slow
varying environment is assumed, the signal s1 (t), the noise ni (t), attenuation α and
the delay D are supposed to be stationary during the observation time T . Further it
is necessary to have a sufficiently high signal-to-noise ratio for being able to detect the
sound source. This will be discussed later.
2.7.1 Principle
The received signals being defined, this subsection shows how two signals con be cross-
correlated. As a remainder, the goal is to get an estimation D
b of the true delay D. The
correlation function is defined by

Rx1 x2 (τ ) = E x1 (t)x2 (t − τ ) (2.15)
where E denotes expectation value. E is maximized by τ . Hence the value of the
delay D. An estimation of the cross-correlation is given by
11
Z T
1
bx x (τ ) =
R 1 2 x1 (t)x2 (t − τ )dt (2.16)
T −τ τ
Again τ maximizes (2.15). It is important to understand that this only can be an
estimation because of the finite observation time. Correlation calculation either can
be done in time domain or in frequency domain. As already mentioned normally it is
more reasonable to correlate in FD because of the lower computational costs. Since
the link between TD and FD is defined by fourier transformation the correlation can
be written as:
Z ∞
Rx1 x2 (τ ) = Gx1 x2 (f )ei2πf τ df (2.17)
−∞
An unfiltered cross-correlation is done with H1 (f ) = H2 (f ) = 1 (see figure 2.5).

When filtered, the filter output can be written as:
Gy1 y2 (f ) = H1 (f ) · H̄2 (f ) · Gx1 x2 (f ) (2.18)

where H̄2 (f ) means the complex conjugated of H2 (f ). This leads to an expression
for generalized cross-correlation:
Z ∞
(g)
Ry1 y2 (τ ) = Ψg (f )Gx1 x2 (f )ei2πf τ df (2.19)
−∞
where Ψg (f ) = H1 (f )· H̄2 (f ) is the generalized frequency wheighting. Again, because

of the finite observation time T we have an estimated filter output G b x x (t) whereupon
1 2
the generalized estimated correlation function becomes:
Z ∞
(g) b x x (f )ei2πf τ df
Ry1 y2 (τ ) =
b Ψg (f )G 1 2 (2.20)
−∞
Pre-filtering can be done by selecting well Ψg (f ). Maybe it has also to be estimated.

This depends on the amount of information that is available about the signal to be
located. Cross-correlation is done for a couple of values around the expected peak,
then a peak selector chooses the peak value which is the estimation of the delay D. b
2.7.2 Some Observations on Ry1 y2

In the following some observations on Ry1 y2 are expressed. Thereby, ideal conditions
are assumed. If Fourier is applied on the given cross-correlation function its cross-power
spectrum can be achieved:
0
z }| {
Rx1 x2 = αRs1 s2 (τ ) + Rn1 n2 (τ )Gx1 x2 (f ) = αGs1 s2 (f )e−i2πf D + Gn1 n2 (2.21)
where n1 (t) and n2 (t) are assumed to be uncorrelated. Since a multiplication in time
domain becomes a convolution in frequency domain:
12
Rx1 x2 (τ ) = αRs1 s2 (τ ) ⊗ δ(t − D) (2.22)
The term δ(t − D) is ‘spread’ or ‘smeared’ by the Fourier transformation. This is
how peaks in a spectrum become broadened. As long as there is only one single delay
peak broadening is no big problem. Having multiple delays or even multiple sources
the resolution goes down and it may become impossible to distinguish peaks or delay
times.
X
Rx1 x2 (τ ) = Rs1 s2 (τ ) ⊗ αi δ(τ − Di ) (2.23)
i
To have a good time delay resolution means to choose Ψg (t) in a way to ensure a
large sharp peak in Ry1 y2 (t). The system then becomes more sensitive to errors (for
example due to finite observation time T ). This is especially the case if we have a low
SNR. So the choice of Ψg (t) becomes a tradeoff between stability and resolution.
2.7.3 Pre-Filtering
As mentioned above the choice of Ψg (f ) = H1 (f ) · H̄2 (f ) is an important step in the
design of a pre-filter. It strongly depends on the problem the filter will be applied to
how it is choosen best. Charles H. Knapp presents possible processors in his paper [11].
They are listed below:
Processor Name Weight (Ψg (f ) = H1 (f ) · H̄2 (f )) Used by (References)

UCC 1 ...
1
RIR Gx1 ,x2 (f ) [no application found]
√ 1
SCOT [no application found]
Gx1 x1 (f )Gx2 x2 (f )
1
PHAT |Gx1 x2 (f )| [5, 8, 4, 12, 13, 14]
Gs1 s2 (f )
Eckart [Gn1 n1 (f )Gn2 n2 (f )] [no application found]
|γ12 (f )2 |
ML or HT |Gx1 x2 (f )|[1−|γ1 2(f )2 |]
[14]
Table 2.1: List of some pre-filtering processors proposed by Charles H. Knapp.
It is relatively simple to apply an unfiltered cross-correlation on a set of data. De-

pending on the computational power available one can achieve much better results in
sound source localization if pre-filtering is done. Since in most sound source localization
problems reveberations are the main problem PHAT has emerged to be the standard
[14]. Pre-filtering can also be applied on beamforming. Each channels is filtered before
it is summed up. This is then named as SRP-PHAT [14]. Of course there are also other
approaches to pre-filtering than listed in the table above. A different combination of
an algorithm with beamforming is proposed by Sasaki [15] or Tamai [16]. They use
Frequency Band Selection (FBS) to improve the obtained results by attenuating noise.
Post-Processing steps like post-filtering algorithms as for example proposed in [17] or
applied in [18] are beyond the scope of this paper and therefore not further studied.
13
3 Hardware
3.1 Classification of Microphones

3.1.1 Differentiation by Principles
In principle a microphone is nothing else than an inversed speaker. It transforms sound
waves into an electric signal. The goal of this subsection is to give a quick overview
to common electroacoustic transducers and there most important attributes. General
informations given in this chapter are mainly based on personal knownledge, partially
on the encyclopedia “Die Audio-Enzyklopädie - Ein Nachschlagewerk für Tontechniker”
[19].
carbon microphone These were the first microphones built and well-known by the
mouthpieces of the first telephones. As a sensor for sound pressure a capsule
filled with carbon pieces is used. The resistance of the capsule dependends on the
acutal pressure on the membrane.
dynamic microphone Based on the principle of induction, dynamic microphones are

built with an electric conductor in a magnetic field. U = −B · l · v = −B · l · dx/dt
is the formula mainly describing the output voltage. Since the actual voltage is
directly proportional to the particle velocity (U ∝ v), dynamic microphones are
‘velocity receivers’ and therefore dependent on the velocity of air particles. It is
obvious that dynamic microphones have the advantages of being simple, robust
and they do not need supply voltage. But dynamic microphones need relatively
high sound pressure levels to deliver a sufficiently high output voltage. They
have a low sensitivity and a low-pass character. This type of microphone is not
very impulse loyal. Its cause can be seen in the high mass inertia and mutual
induction.
condenser microphone The capsule of a condenser microphone contains, in principle,

a small condenser loaded via a high resistor. By this constellation it is assured
that the charging of the capsule is kept constant below its specific critical angular
frequency fcrit = 1/(2π̇ · C · R). The arrangement of resistor and condenser works
as a voltage divider. The microphones membrane is one side of a condenser.
Therefore its deflection has a direct influence on the capacity of the condenser
C = (A · )/d. Since a condenser microphone works directly proportional to
the deflection of air particles they are called ‘elongation receivers’. Condenser
microphones are very impuls loyal. They usually need a supply voltage of 60V .
14
electret microphone These work based on the same principle as condenser micro-
phones. The difference is that they do not need any supply voltage for the sound
conversion. The charging is already ‘frozen’ on the membrane. These micro-
phones are cheap and therefore found in a lot of simple applications. As ‘normal’
electret microphones are not so low-noise, they usually do not find their way into
demanding applications.
piezoelectric microphone A membrane is directly coupled with a piezo cristall. De-

flection of the membrane therefore generates an output voltage. This type of
microphone is normally only used as a pulse sensor.
silicium microphone A relatively new development are silicium microphones. They

are fabricated directly by micro machining technologies and therefore also called
MEMS microphones. Poor sound quality makes them only interesting for mass
market applications, for example mobile phones.
laser microphone A laser beam is directed against an objects surface which is affected
by the sound one would like to listen to. The fine vibrations of the surface in the
sound field produce interferences between the emitted laser beam and its detected
reflection. By demodulation it is possible to reconstruct the sound.
3.1.2 Type of Construction and Directivity

Microphones either can be built as ‘pressure transducer microphones’ or as ‘pressure
gradient microphones’. Complex microphones for professional applications normally
combine these two techniques. The type of construction and the physical shape of the
capsule also have a strong effect on the microphones directivity. A pressure transducer
microphone generally has a spherical directivity. Propagating sound can receive the
membrane only from one side. The encapsuled space behind the membrane is a closed
one. There is a very small hole to allow adaption to barometric variations. Therefore
this type of microphones are sensitive to the pressure changes of an arriving sound.
Pressure gradient receivers have a membrane which can be reached by incoming sound
either on the front side or on the back side. This makes this type of microphones
sensitive to pressure gradients, the pressure difference between the front and back side
of the membrane. They generally have the directivity of an eight where the front side
counts positive and the back side negative. Figure 3.1 shows some typical directivities.
3.1.3 Additional Structures

Lobe microphones also named ‘shotguns’ are the most directivity sensitive ‘normal’
microphones. The capsule of the microphone has the shape of a relatively long
tube which allows to specifically listen into one direction. The application of a mi-
crophone with an eight-shaped polar patter diagram makes incoming sound from
the front adding positively and noise from the back cancelling out. One difficulty
to cope with is the relatively high portion of the high frequencies recorded by lobe
15
Figure 3.1: Usual combinations of directivities for professional microphones.
(Source of images: Wikipedia)
microphones. This microphones are often used in noisy environments where one
would like to get a specific sound as for example on film sets or for TV interviews.
Parabolic microphones or ‘big ears’ are the extension of the above mentioned lobe mi-
crophones. They are based on the same principle like parabolic antennas for radio
waves. A parabolic deflector focuses incoming sound waves on the microphone
placed in the focal point. This allows to listen precisely into a specific direction
over long distances. Limiting for low frequencies is the diameter of the deflector
which shall not exceed the longest wave length to be received. This fact makes
this devise at once unattractive for high quality sound recordings because big
enough structures would be very unhandy.
Interfacial microphones can be built by installing a microphones integrated to a pla-

nar surface. This brings the directivity of half a sphere and the advantage of more
direct sound compared to a free-standing microphone because all the reflections
of the wall itself do not have any influence to the recorded sound.
Free-standing structures like a cylinder may be used in combination with a micro-
16
phone array to get better directivity characteristics of the system and to suppress
aliasing effects [20]. Other structures are needed with HRTF as it can be found
in [18, 9].
3.2 Number and Arrangement of Microphones

Since sound localization systems use usually omnidirectional microphones one good way
to charaterize them may be the number of microphones. From a technical point of view
it may seem difficult to locate sound with one microphone. Nevertheless there is enough
legitimation to briefly discuss one-microphone cues and some related problem fields. By
two microphones in principle a linear arrangement (1D) is possible. We will also have
a short glance at a suggestion to use two microphones within one ear. With three or
more microphones in analogy to the 1D-case, it is possible to build planar (2D) and
finally spatial (3D) structures. In general one can state that using more microphones
brings higher accuracy and stability to the localization system.
3.2.1 One Microphone

By a system with one microphone a system which has the capability to process data
from one single sensor is meant. This is not to be mistaken for systems with one
ear. Systems with one ear may contain multiple microphones (see page 17, subsection
3.2.2). Assuming we have the data of a recorded sound of one microphone the task
that is usually done, is to detect a sound source within the rest of recorded sounds
and noises. This is used in computational auditory scene analysis. There are cues to
separate sounds from noise as for example proposed in [21]. This separation task is
associated to sound localization. Actually, if applied in noisy environments, it must
be part of the sound source localization cue to separate the interesting sound from the
rest.
3.2.2 Two Microphones

Two microphones of course allow to get two sets of data. The geometrical approach of
Kneip [6] using ITD is helpful to introduce oneself to the topic. One measurement with
two microphones allows to define a so called ‘cone of confusion’ on which the sound
source must be located. By another measurement in a turned position it is possible
to narrow down the position of the sound source to the intersection of two cones of
confusion. To complete the determination of position a third measurement after a
translation is necessary to get the distance. It is obvious that having more than two
microphones would allow us to get directly multiple pairs of measured data. This allows
then to compute directly multiple constraints on the sound sources position.
Using the more biology-like approach of HRTFs Keyrouz [9] proposed lately a solution
to listening to a scene with one ear and still locate precisely and in a very straightforward
way sound sources. An artificial ear contains two microphones, one placed inside and
one outside (see page 9, section 2.5).
17
3.2.3 Arrays of Microphones
Three microphones principally may be arranged linearly, planarly or spatially. This
list allready covers all different types of microphone arrays. The specific arrangement
fitting to a given problem of course strongly depends on the problem itself and the
principle of sound source localization one would like to use. In principle every localiza-
tion technique could be implemented in combination with a microphone array. Having
more microphones generally brings more accuracy and stability to the system.
Linear arrays at once restrict us on localization possibilities since it restricts on a ‘linear

arrangement’ of cones of confusion [6]. However, it may be interesting using such
an arrangement because it allows to determine the distance to the sound source.
Planar arrays are the best-known microphone arrays since they are commonly used
for beamforming and often applied in wind channels to examine the formation of
noise on an aerodynamic structure. Applications on mobile robots are proposed
by Sasaki using different 32-channel arrays [22, 15]. Another approach having
microphones on three circles is proposed by Tamai [16]. As shown by Kneip [6],
in principle four planarly arranged microphones are needed to completely define
the sources position by geometrical calculations, neglecting the usual back-front
ambiguity.
Spatial arrays allow many different combinations of microphones and therefore can
simply be applied to all kind of techniques, then having the whole spectrum of
possibilities. Using an array of eight microphones in the corner of a cube for
example is proposed by Valin [4, 23, 24, 5, 8]. It makes an important difference
if the array is used to listen to a sound source in the far-field, in the near-field or
if the source is even withing the array. It can be stated that in general far-field
conditions were assumed when using a spatial array. There are approaches which
place microphones in the walls of a room to locate a sound within this array
[14, 25, 26].
Another difference can be found by looking at the structure the array is embedded in.
Normally, one only uses free-standing arrays. Having microphones placed in the walls
of a room principally can be stated as using an additional structure with an embedded
array. There are also approaches which use special structures to improve the arrays
directivity and aliasing characteristics (see page 15, subsection 3.1.3) as for example
proposed by Dedieu [20].
18
4 Mosquito Localization Problem
In this chapter an effort is made to characterize the mosquito localization problem
by giving a brief overview over the most important specifications. In a second part
different possibilities are discussed to choose the most adequat technique.
The main idea behind the mosquito localization problem is the quest for a possibility
to localize and annihilate a flying mosquito within a room. Assuming to have a laser
or another method with high enough accuracy to shoot the mosquito down, the main
focus is on the localization of the mosquito. To narrow down the field again, it will
only be looked for techniques that work passively on the sound of the flying mosquito.
4.1 The Male Mosquito Hearing System

For male mosquitos it is essential to localize the female fellows for reproduction. Since
the sound of a flying mosquito is a very silent one, it becomes a challenging task to
listen precisely to that sound, especially when there is also environmental noise. Nature
solved this problem by a special construction of the male mosquito hearing system. The
mosquito hearing is a receiver of movement (see ‘pressure gradient microphones’, page
15, subsection 3.1.2) and not a pressure sensitive organ. The biological solution in the
male mosquito hearing system is the plumose antenna with its flagellar hairs on it. This
antennae are resonantly tuned to the fundamental frequency of flying female mosquitos
at 380Hz, the flagellar hairs are tuned to frequencies between 2600Hz and 3100Hz.
This system behaves as a damped harmonic oscillator. Stiff coupling of the hair and
the antenna allows to transmit received vibrations to the auditory system. [27]
The most important difficulty to be discussed is the difference between near- and
far-field applications. The natural mosquito auditory system normally works in the
near-field. The positive effect of being in the near-field is that signal-to-noise ratio
becomes high enough for artificial techniques to work, the negative one is that sound
characteristics1 strongly differ with the relative distance of the sound source to the
microphone since the wavelenght of the sound remains relatively constant.
4.2 Problem Specifications

The most important specification is indeed the sound of a flying mosquito. It is assumed
that only one mosquito is in the room. Further, the mosquito is assumed to be a point
1
Magnitude and phase of the air particle velocity.
19
source of sound. The sound of a flying mosquito can be simulated by band-limited
random noise in the range of 230 − 3100Hz [27]. Experience teaches that the sound
of a flying mosquito is a relatively quiet one. No precise values on the volume of a
mosquito could be found and it likely will vary with different mosquitos. Therefore, a
‘good’ volume is assumed.
For the localization system it is more simple to localize a static source than a moving
one. Therefore the sound source is assumed to be quasi-static. That means that during
computation time the sound source will not significantly move. These assumptions
shall be enough realistic to have a system that works in the end.
4.3 Choice of Most Appropriate Technique

If thinking about installing the system integrated within a normal room it will be
working in the far-field, which is usually a basic assumption for all proposed artificial
sound localization techniques. If now a single free-standing sound source at constant
volume is assumed, sound intensity will drop by 6dB when doubling the distance to the
source.2 This makes it impossible to work with one of the proposed artificial techniques
which usually assume signal-to-noise ratios of 10dB or even more.
4.3.1 Proposal of Near-field Solution

To work in the near-field, a possibility to keep the mosquito within a cube3 is needed,
to localize and annihilate it. This has several positive side effects. Firstly, it becomes
less difficult to control the laser or water beam. The beam can work withing a defined
range and has therefore less effects on the environment. Secondly, by using an attrac-
tant to allure the mosquito into our ‘laser cube’, another consequence is that all the
mosquitos are there, where they hurt nobody, even if the system does not work well.
With the goal to achieve highest possible signal-to-noise ratio it makes sense to use
interfacial microphones to minimize the influence of reflected sound waves. This re-
quires planar walls which could be made as the bottom, two walls and the top of a
cube. Since exceptional high accuracy is needed, it makes sense to use highly sensitive
microphones like condensers. It is difficult to make the assumption having an environ-
ment that is quiet enough for our localization methods to work. In reality, the required
signal-to-noise ratios never could be achieved. With this problem it can best be dealt
with by using a beamforming technique in combination with a microphone array for
localization. Using a lot of microphones increases generally the systems stability and
accuracy. By beamforming an attenuation of the environmental noise can be achieved,
therefore, locally the signal-to-noise ratio can be increased. In one of his papers, Valin
presents a couple of experiments with an array that is able to localize a sound source
2
This corresponds to the usual 1/r2 -law for energy quantities.
3
Its dimensions depend on the frequency the system is tuned on.
20
within the near-field. It needs a small adaption on the sound source direction vector ~u
that in the near-field case has a norm that is smaller than unity and therefore must be
normalized [4].
If a completely quiet environment would be assumed, it makes sense to combine the

beamforming technique with an ILD approach. An arising problem is that a flying
mosquito is only poorely modeled by a point source of sound. It is very likely that a
mosquito shows a directional dependent sound emission. This would make it difficult
to use directly an ILD approach. Since mosquitos often show very unpredictable flight
paths, it becomes very difficult to implement an algorithm that accounts for the direc-
tional sound emission and tries to predict the mosquitos flight path.
The use of highly directional microphones like lobe microphones or parabolic micro-
phones is interesting to monitor permanently to one or several points within a defined
space. If a mosquito is detected it is immediately shot down by an appropriate device
like a laser beam or a water jet.
Finally, it can be stated that further experiments are necessary for a more precise
proposition on a solution to the mosquito localization problem. Most important of all is
to find an answer to the question: “How does a mosquito sound like?” shortly followed
by “What is the spectral developement of the mosquito sound with resepect to the
distance?”. Maybe it could also be interesting to investigate closer the male mosquito
audition system. It could be possible to extract ideas for artificial application for the
localization of very quiet sounds in a relatively noisy environment.
21
5 Conclusion
Passive sound localization techniques are an interesting and not yet exhausted field of
research. On one hand most applications use ITD sound source localization techniques
and are applied to far-field situations. Therefore this surely is the best known technique
and mostly combined with microphone arrays, due to higher accuracy and stability. On
the other hand techniques like HRTF, MD and specially ILD are not yet used often in
sound source localization. Further research is indeed important.
Signal processing allows to increase significantly the systems accuracy, stability or its
frequency selectivity. It is important to choose carefully the most appropriate filtering
processor to be applied on a given system.
A high variety of microphone characteristics and spatial arrangements completes the
list of possibilities. Principally, it would be possible to combine almost every technique
with every algorithm or microphone.
As soon as it comes to system design for a specific problem it is important to define
carefully the boundary conditions. Since these are not well-known for the mosquito
localization problem, it is very difficult to propose a really funded solution. Further
research or even experiments have to be done to get more detailed specifications on
this problem.
22
6 Bibliography
[1] AMIDA Augmented Multi party Interaction with Distance Access. State-of-the-
art overview — localization and tracking of multiple interlocutors with multiple
sensors. January 2006.
[2] N. Bhadkamkar and B. Fowler. A sound localization system based on biological

analogy. Neural Networks, 1993., IEEE International Conference on, pages 1902–
1907 vol.3, 1993.
[3] Jie Huang, T. Supaongprapa, I. Terakura, N. Ohnishi, and N. Sugie. Mobile

robot and sound localization. Intelligent Robots and Systems, 1997. IROS ’97.,
Proceedings of the 1997 IEEE/RSJ International Conference on, 2:683–689 vol.2,
Sep 1997.
[4] J.-M. Valin, F. Michaud, J. Rouat, and D. Letourneau. Robust sound source
localization using a microphone array on a mobile robot. In Proc. IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems (IROS 2003), volume 2,
pages 1228–1233, 27–31 Oct. 2003.
[5] J.-M. Valin. Auditory System for a Mobile Robot. PhD thesis, University of Sher-
brooke - Faculte de genie, Genie electrique et genie informatique, august 2005.
[6] Laurel Kneip and Claude Baumann. Binaural model for artificial spatial sound
localization based on interaural time delays and movements of the interaural axis.
J. Acoust. Soc. Am., (124):3108–3119, November 2008.
[7] S.T. Birchfield and R. Gangishetty. Acoustic localization by interaural level dif-
ference. In Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’05), volume 4, pages iv/1109–iv/1112 Vol. 4, 2005.
[8] Jean-Marc Valin, Francois Michaud, and Jean Rouat. Robust localization and
tracking of simultaneous moving sound sources using beamforming and particle
filtering. Robotics and Autonomous Systems, 55(3):216 – 228, 2007.
[9] Diepold Keyrouz, Bou Saleh. A novel approach to robotic monaural sound local-
ization. page 5, May 2007.
[10] Ma Hao, Zhou Lin, Hu Hongmei, and Wu Zhenyang. A novel sound localization
method based on head related transfer function. In Proc. 8th International Confer-
ence on Electronic Measurement and Instruments ICEMI ’07, pages 4–428–4–432,
2007.
23
[11] C. Knapp and G. Carter. The generalized correlation method for estimation of time
delay. IEEE Transactions on Acoustics, Speech and Signal Processing, 24(4):320–
327, 1976.
[12] M.D. Gillette and H.F. Silverman. A linear closed-form algorithm for source lo-
calization from time-differences of arrival. 15:1–4, 2008.
[13] D. Nguyen, P. Aarabi, and A. Sheikholeslami. Real-time sound localization using

field-programmable gate arrays. In Proc. International Conference on Multimedia
and Expo ICME ’03, volume 2, pages II–829–32 vol.2, 2003.
[14] B. Mungamuru and P. Aarabi. Enhanced sound localization. Systems, Man, and
Cybernetics, Part B, IEEE Transactions on, 34(3):1526–1540, June 2004.
[15] Yoko Sasaki, Satoshi Kagami, and Hiroshi Mizoguchi. Multiple sound source map-
ping for a mobile robot by self-motion triangulation. Intelligent Robots and Sys-
tems, 2006 IEEE/RSJ International Conference on, pages 380–385, Oct. 2006.
[16] Y. Tamai, Y. Sasaki, S. Kagami, and H. Mizoguchi. Three ring microphone ar-
ray for 3d sound localization and separation for mobile robot audition. In Proc.
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS
2005), pages 4172–4177, 2005.
[17] I. McCowan and H. Bourlard. Microphone array post-filter for diffuse noise field.
1:905–908, 2002. IDIAP-RR 01-39.
[18] Fakheredine Keyrouz and Klaus Diepold. An enhanced binaural 3d sound local-
ization algorithm. In Proc. IEEE International Symposium on Signal Processing
and Information Technology, pages 662–665, 2006.
[19] Andreas Friesecke. Die Audio-Enzyklopädie - Ein Nachschlagewerk für Tontech-

niker. K. G. Saur Verlag München, first edition, 2007.
[20] S. Dedieu, P. Moquin, and R. Goubran. Sound measurement in noisy environment

using optimized conformal microphone arrays. In Proc. IEEE Instrumentation and
Measurement Technology Conference IMTC 2005, volume 1, pages 748–751, 16–19
May 2005.
[21] Sam T. Roweis. One microphone source separation. 2000.
[22] Y. Sasaki, S. Kagami, and H. Mizoguchi. Main-lobe canceling method for multiple
sound sources localization on mobile robot. Advanced intelligent mechatronics,
2007 ieee/asme international conference on, pages 1–6, Sept. 2007.
[23] J.-M. Valin, F. Michaud, B. Hadjou, and J. Rouat. Localization of simultane-

ous moving sound sources for mobile robot using a frequency- domain steered
beamformer approach. In Proc. IEEE International Conference on Robotics and
Automation ICRA ’04, volume 1, pages 1033–1038, 2004.
24
[24] J.-M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on micro-
phone array source separation with post-filter. In Proc. IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2004), volume 3, pages 2123–
2128, 28 Sept.–2 Oct. 2004.
[25] K. Nakadai, H. Nakajima, M. Murase, S. Kaijiri, K. Yamada, T. Nakamura,

Y. Hasegawa, H.G. Okuno, and H. Tsujino. Robust tracking of multiple sound
sources by spatial integration of room and robot microphone arrays. In Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing ICASSP
2006, volume 4, pages IV–IV, 14–19 May 2006.
[26] K. Nakadai, H. Nakajima, M. Murase, H.G. Okuno, Y. Hasegawa, and H. Tsujino.

Real-time tracking of multiple sound sources by integration of in-room and robot-
embedded microphone arrays. In Proc. IEEE/RSJ International Conference on
Intelligent Robots and Systems, pages 852–859, 9–15 Oct. 2006.
[27] Daniel Robert Martin C. Göpfert, Hans Briegel. Mosquito hearing: Sound-induced
antennal vibrations in male and female aedes aegypti. The Journal of Experimental
Biology 202, pages 2727–2738, 1999.
[28] Atsuhi Ikeda et al. 2d sound source localization in azimuth & elevation from
microphone array by using a directional pattern of element. 2007.
[29] Yousefian et al. Enhancing accuracy of source localization in high reverberation

environment with microphone array. 2008.
[30] Schmidt Benjamin. Robotic sound localization. 2004.
[31] W. M. Hartmann. Localization of sound in rooms. J. Acoust. Soc. Am., 74, 1983.
[32] Jack Hebrank and D. Wright. Spectral cues used in the localization of sound
sources on the median plane. The Journal of the Acoustical Society of America,
56(6):1829–1834, 1974.
[33] J.M. Loomis. Some research issues in spatial hearing. In Proc. IEEE ASSP Work-
shop on Applications of Signal Processing to Audio and Acoustics, pages 67–71,
1995.
[34] Yong Rui and D. Florencio. New direct approaches to robust sound source local-
ization. Multimedia and Expo, 2003. ICME ’03. Proceedings. 2003 International
Conference on, 1:I–737–40 vol.1, July 2003.
[35] Y.Ohta T.Kobayashi, Y.Kameda. Sound source localiyation with non-calibrated

microphoes. 2008.
[36] Sampo Vesa. Sound source distance learning based on binaural signals. In Proc.
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,
pages 271–274, 2007.
25
[37] H.-W. Wei and S.-F. Ye. Comments on a linear closed-form algorithm for source
localization from time-differences of arrival. 15:895–895, 2008.
26

Localization of Sound Sources - Studies On Mechatronics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Localization of Sound Sources - Studies On Mechatronics

Uploaded by

Copyright:

Available Formats

Supervisors: Ambroise Krebs, Laurent Kneip

Autonomous Systems Lab

2 Sound Localization – Data Acquisition and Signal Processing 4

4 Mosquito Localization Problem 19

1.1 Scheme of Sound Localization Process . . . . . . . . . . . . . . . . . . . 2

2.1 ITD – time shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Usual Directional Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Pre-filtering processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

DSBF Delay-and-Sum Beamforming

Eckart Eckart Filtering Function

ETH Eidgenössische Technische Hochschule

FBS Frequency Band Selection

GCC General Cross-Correlation

HRTF Head Related Transfer Function

HT Hannan Thomson Weighting Function

IID Inter-aural Intensity Difference

ILD Inter-aural Level Difference

IPD Inter-aural Phase Difference

ITD Inter-aural Time Difference

MEMS Micro Electro-Mechanic Systems

PHAT Phase Transform Weighting Function

RIR Roth Impulse Response

SCOT Smoothed Coherence Transform Weighting Function

SNR Signal-to-Noise Ratio

SoM Studies on Mechatronics

Sonar sound navigation and ranging

SRP-PHAT Steered Response Power Phase Transform Weighting Function

TDA, TDOA Time-Difference Of Arrival

UCC Unfiltered Cross-Correlation

1.2 Problem Area Overview

Figure 1.1: Schematics of the Sound Source Localization Process

To describe a specific situation where it is required to localize a sound source three

By listening with one or several ‘ears’ to a situation, as described above, several

There exists alternative approaches to do sound source localization. One of them

1.3 Assumptions and Restrictions

post-processing Post-processing steps (like post-filtering algorithms) to improve sys-

2.2 Inter-aural Level Difference – ILD

E1 d21 = E2 d22 + η (2.7)

2.3 Beamforming – BF / Steered Response Power – SRP

Figure 2.4: An M-microphone delay-and-sum beamformer. M signal are received, de-

And in terms of cross-correlation:

2.4 Microphone Directivity – MD

2.5 Head Related Transfer Functions – HRTF

2.6 Signal Processing

Figure 2.5: Schematics of Signal Processing in Sound Source Localization Taskes.

x1 (t) = s1 (t) + n1 (t) (2.13)

An unfiltered cross-correlation is done with H1 (f ) = H2 (f ) = 1 (see figure 2.5).

Gy1 y2 (f ) = H1 (f ) · H̄2 (f ) · Gx1 x2 (f ) (2.18)

where Ψg (f ) = H1 (f )· H̄2 (f ) is the generalized frequency wheighting. Again, because

Pre-filtering can be done by selecting well Ψg (f ). Maybe it has also to be estimated.

2.7.2 Some Observations on Ry1 y2

Processor Name Weight (Ψg (f ) = H1 (f ) · H̄2 (f )) Used by (References)

Table 2.1: List of some pre-filtering processors proposed by Charles H. Knapp.

It is relatively simple to apply an unfiltered cross-correlation on a set of data. De-

3.1 Classification of Microphones

dynamic microphone Based on the principle of induction, dynamic microphones are

condenser microphone The capsule of a condenser microphone contains, in principle,

piezoelectric microphone A membrane is directly coupled with a piezo cristall. De-

silicium microphone A relatively new development are silicium microphones. They

3.1.2 Type of Construction and Directivity

3.1.3 Additional Structures