Professional Documents
Culture Documents
【05-1-3】论文
【05-1-3】论文
【05-1-3】论文
Abstract—The ability to localize acoustic sources can greatly improve the perception of smart devices (e.g., a smart speaker like
Amazon Alexa). In this work, we study the problem of concurrently localizing multiple acoustic sources with a single smart device. Our
proposal called Symphony is the first complete solution to tackle the above problem, including method, theory, and practice. The
method stems from the insight that the geometric layout of microphones on the array determines the unique relationship among signals
from the same source along the same arriving path. We also establish the theoretical model of Symphony, which reveals the relation
between localization performance (resolution and coverage) and impacting factors (sampling rate, array aperture, and array-wall
distance). Moreover, the ability to separate and localize multiple sources is also studied theoretically and numerically. We implement
Symphony with different types of commercial off-the-shelf microphone arrays and evaluate its performance under different settings.
The results show that Symphony has a median localization error of 0.662 m.
1 I NTRODUCTION accept only the commands that originate from the real
locations.
Smart devices with sound recognition are now proliferating Conventional approaches of acoustic source localization
in our daily life. For example, smart speakers like Amazon require the deployment of multiple distributed microphone
ECHO, Google Home, Apple HomePod, and Alibaba Tmall arrays. Based on the estimation of the source’s time differ-
Genie support various attractive applications, including ence of arrival (TDOA) or direction of arrival (DoA) at the
voice control of home appliances, man-machine dialogue, arrays, the source can be localized via triangulation [5], [6],
and entertainment center. [7], [8], [9], [10], [11]. However, those solutions cannot be
With the fast development of smart home and office applied to localization with a device like the smart speaker,
applications, there is an increasing need for acoustic source which is usually equipped with only a single microphone
localization on smart devices. Whether the acoustic sources array.
can be localized largely affects the capability and quality Acoustic source localization with a single array is a non-
of the smart device’s interactive functions, which include trivial problem. Note that the typical size of a microphone
but are not limited to the following cases: (1) The ability of array is several centimeters at most, which is negligible
localization enables a smart speaker to process voice com- with respect to the distance between the source and the
mands with user location awareness. When the user is lying array. As a result, the acoustic signal’s propagation rays to
in bed and says ‘Turn on the light’, the smart speaker can the microphones are nearly parallel. Due to limited spatial
smartly switch off the ceiling lamp and turn on the bedside resolution (array size or aperture) and temporal resolution
lamp, if the user’s (namely the acoustic source) location is (sampling rate of the microphone), a commercial array
provided. (2) Localizing the acoustic source enables a smart cannot distinguish DoAs of nearly parallel rays. This is the
device, e.g. the smart safeguard device, to better perceive so-called far-field effect.
the real situation. For example, the device may remind the Exploiting the multi-path propagation paves a way to
parents of possible danger when it hears abnormal sounds tackle the above problem, however, only in the scenario of
of windows or doors from the baby’s room. (3) Knowing localizing a single source. In addition to the line-of-sight
the source location helps to authenticate voice commands. path (denoted as LOS), VoLoc [12] leverages an additional
Recent studies have uncovered the vulnerabilities of smart arriving path by exploiting the nearby wall reflection (de-
speakers against inaudible and malicious voice commands noted as ECHO), and then localizes the far-field source after
[1], [2], [3], [4]. To defend against these threats, the smart estimating DoAs of LOS and ECHO.
speaker can leverage the knowledge of voice location to It is worth noticing that VoLoc assumes there is only one
source in the sound field, which largely restricts its applica-
bility in the real world. Usually, there are multiple acoustic
• W. Wang, J. Li, Y. He, and Y. Liu are with the School of sources in practice. For example, in the home environment,
Software, Tsinghua University, Beijing, China. E-mail: {wwg18,
li-jm19}@mails.tsinghua.edu.cn, heyuan@mail.tsinghua.edu.cn, yun- there may be family voices, television, washing machine,
hao@greenorbs.com. Yuan He is the corresponding author. and microwave oven. These sources, including the voice
This work is supported by National Natural Science Fund of China No. commands, will interfere with each other, making it very
U21B2007. difficult for VoLoc to localize them.
IEEE TRANSACTIONS ON MOBILE COMPUTING (PRE-PRINT) 2
Source
Locations
θ1 ECHO
Silence LOS
θ2 θ2
Virtual
path
Smart
Speaker
Virtual Real Time
clean signal
Array Array
Wall
(a) (b)
Fig. 2: (a) The virtual array is created by viewing the wall as
Fig. 1: An illustration of Symphony a mirror. (b) Before the arrival of ECHO, there will be a short
window of the clean signal.
Raw
Self- Wall
Pre-Processing Audio Calibration
Distance,
Orientation
θ2 θ4
SA LOS
Identification
Geometric SA ECHO
of Homologous
Filtering of DoA Mic. Array 0 1 2
DoAs Reverse SB LOS
Ray- SB ECHO
DoAs Discrimination DoAs Tracing
Coherence-based
Refinement between LOS
and ECHO
for each source
and each path
θ1 θ3
of DoA d d
DoA Estimation DoA Recognition
Locations Fig. 4: The propagation model of two sources, SA and SB .
Amplitude
lobe effectively by discarding the amplitude of the cross- 0.75 ECHO
spectrum and keeping the phase [16], [17]. Consider two 0.50
signals received by microphone Mn and Mm ; the cross- 0.25
0.00
correlation function (CCF)1 between yn and ym is defined 0 5 10 15 20 25
Time [ms]
as
Corn,m (τ ) = GCC [yn (t − τ ) ym (t)] , (4) Fig. 5: Channel impulse response.
1. Unless otherwise specified, CCF is a function of time shift, and 2. The energy loss resulting from wall reflection is negligible. Typi-
the corresponding value denotes the correlation between the shifted cally, more than 95% signal energy remains after reflection due to the
versions of the inputs high impedance mismatch between the air and the wall.
IEEE TRANSACTIONS ON MOBILE COMPUTING (PRE-PRINT) 5
Cor.
0
τ
<0, 2>
ir
ne Pa
3 <0, 1>
opho
2τ <0, 2>
<0, 3>
Micr
<0, 3>
Microphone Pair
0 10 20
3τ -20 -10
Time Shift [Sample]
(a) (b)
Fig. 6: (a) CCF between pairs ⟨0, 1⟩, ⟨0, 2⟩ and ⟨0, 3⟩. (b) Pure peaks across three pairs can fit a straight line passing (0, 0, 0).
our idea can be applied to arrays with different layouts, However, the sampling rate Fs limits the resolution of time
including the circular array [18]. shifts measured. The continuous DoA θ ∈ [0, π] is mapped
(m−n)d
We observe that for microphone pair ⟨n, m⟩, the time into 2⌊ v Fs ⌋ + 1 discrete bins. Such mapping intro-
shifts of pure peaks are directly proportional to the subtrac- duces an additional conversion error, thus introducing the
tion of two microphone serial numbers, m − n. If we revisit deviation between the ground truth and the peaks (e.g., 3 ).
the cell ⟨LOS-LOS, SA ⟩ in Table 1, we will find that the If two DoAs are so close that they fall into the same discrete
relation between the time shift τ and the variable m − n is a bin, it is impossible for the array to separate them (e.g., 1 ).
linear function: τ = k(m − n), where the slope k = dc cos θ1 . On the other hand, when m − n increases, the number of
This linear relation only holds for pure peaks, rather than discrete bins mapped also increases, thus providing higher
hybrid peaks. We exploit this relation to find pure peaks. spatial resolution. This explains why the markers of SA no
We conduct the following proof-of-concept experiment longer stay in the same bin in pair ⟨0, 2⟩, and also explains
to validate our idea. In a living room, we let two speakers why Problems 1 and 2 are unlikely to occur for pair ⟨0, 3⟩.
simultaneously play two recorded voice commands (de-
noted as SA and SB ) at different places. A uniform linear Term Brief description
array with 4 microphones is placed 30 cm away from the
Pn,m The set of time shifts of peak candidates in pair ⟨n, m⟩.
wall to record signals. Fig. 6(a) shows the cross-correlation
τn,m The time shift of peak candidates τn,m ∈ Pn,m .
functions between three microphone pairs ⟨0, 1⟩, ⟨0, 2⟩ and
Possible combination of peak candidates across multiple
⟨0, 3⟩. In these plots, the markers with a triangular shape ci
pairs. For 4-mic ULA, ci ∈ P0,1 × P0,2 × P0,3 .
denote the local maximums of correlation, which are the wn,m The penalty factor of pair ⟨n, m⟩, equal to |m − n|.
pure peak candidates. The markers with the shape of circle
and cross are the ground truths of pure peaks. The ground TABLE 2: Definition of Terminology.
truths are obtained by inserting pseudo-random white noise
(i.e., pre-known signals) immediately preceding voice com-
mands. Fig. 6 (a) highlights the time shifts of pure peaks i ci [sample] J (ci ) k∗ pure peak source
for LOS SA (red circle), which change linearly with the
1 (4, 8, 12) 0.000 4.000 Yes SA
subtraction of the microphone serial numbers, m − n. 2 (6, 13, 20) 0.191 6.633
To be clearer, we incorporate these three cross- 3 (-1, -3, -5) 0.191 -1.633
correlation functions into the same coordinate system, as 4 (-4, -7, -10) 0.191 -3.367 Yes SB
shown in Fig. 6 (b). It is very interesting to see that, if we 5 (-5, -10, -16) 0.258 -5.276 Yes SB
sequentially connect the markers of the ground truths that 6 (2, 6, 9) 0.328 2.990 Yes SA
correspond to the same path, the pure peaks can nearly 7 (2, 6, 7) 1.004 2.439
form a straight line passing through origin (0, 0, 0). This 8 (-4, -10, -13) 1.004 -4.439
phenomenon motivates us to exploit this linear relation to
find pure peaks. However, before introducing our method, TABLE 3: The ranking list of ci using metric J (Top 8).
it is worthwhile to analyze the following problems.
• Problem 1: Peak Overlap. Some markers of the ground The analysis above suggests that it is unrealistic to apply
truths are too close to be separated (e.g., 1 in Fig. 6 (a)), a strict criterion to directly identify pure peaks according
or the marker is no longer the local maximum because it to the linear relation. To tolerate such imperfections, we
was suppressed by the adjacent peak (e.g., 2 ). formulate DoAs estimation into a curve-fitting problem
• Problem 2: Peak deviation. Some markers (e.g., 3 ) of the using a fitting metric. This metric evaluates how well peak
ground truths are not at the local maximum, and instead candidates across each pair fit a curve. Formally, the metric
are one-sampling-point away from the nearest peak. is defined as follows (Table 2 defines the terminology):
In fact, the above problems are caused by limited spatial 1 X 2
resolution. Recall that the value we can measure is the time J (ci ) = min wn,m [k(m − n) − τn,m ] . (5)
k |ci | τn,m ∈c
shift of peak, and the value we intend to obtain is the DoA θ. i
IEEE TRANSACTIONS ON MOBILE COMPUTING (PRE-PRINT) 6
Fig. 7: Coherence-based refinement. a certain path, after aligning and averaging the first three
signals y0 , y1 and y2 (Line 4 and 5), we can accurately
Intuitively, we select one candidate peak from each peak compensate for the arriving delays of this path and con-
set for each microphone pair (Pn,m ), and each possible structively enhance the path. The enhanced version of the
selection constitutes a combination ci . The whole selection signal is denoted by y<0,1,2> . Theoretically, when we cor-
space is the Cartesian product of the peak sets of pairs relate the enhanced signal y<0,1,2> with another signal y3 ,
(P0,1 × P0,2 × P0,3 ). The metric J (ci ) evaluates the shortest (3)
the pure peak corresponding to this path (located at ci ) in
distance between candidate peaks of ci and the regression
CCF Cor<0,1,2>,3 will rise significantly, compared with the
line y = k(m − n). The ci with the smaller J (ci ) is more
original CCF Cor0,3 between y0 and y3 . If so (Line 7), ci is
likely to be the combination of pure peaks. As we discussed
identified as a pure-peak combination.
before, as m − n increases, the resolution of pair ⟨n, m⟩ also
We validate this idea in the previous proof-of-concept
increases, which means the pair with larger m − n tends to
experiment. We compute Cor<0,1,2>,3 using the pure-peak
have smaller error variance. This is a classic heteroscedas-
combination c5 in Table 3, and the non-pure-peak combina-
ticity problem [19]. Following the idea of weighted least
tion c2 . As expected in Fig. 7(a), the value of Cor<0,1,2>,3 at
squares, we assign different pairs with different penalty (3)
factors wn,m inversely proportional to their error variances. c5 (sample -16) has a significant increase, while in Fig. 7 (b),
(3)
In this way, we leverage results from pairs with different no increase is observed at c2 (sample -20). In summary, if
(3)
error variances, and can estimate DoA more accurately. there is an increase in Cor<0,1,2>,3 at ci , ci can be classified
We apply the metric (Eq. (5)) to the previous proof-of- as a pure-peak combination.
concept experiment. Specifically, we calculate J of each
ci in Fig. 6. Table 3 ranks each ci in the ascending order
of J (ci ). The slopes k ∗ of the lines fitted by ci are also 6 D OA R ECOGNITION
included. Based on the ground truths in Fig. 6, we mark
the entries that belong to pure peaks. We can see that pure- After identifying all pure-peak combinations, we identify
peak entries get high ranks: first, fourth, fifth, and sixth, which two pure-peak combinations belong to the same
respectively. However, some entries not corresponding to source (Section 6.1). In Section 6.2, we determine their types.
pure peaks also get relatively high ranks: second and third,
which may mislead the identification of pure peaks. The
results show that this method can eliminate many ambigu- 6.1 Homologous Identification of DoA
ities by ruling out low-ranking entries, but is incapable of A basic fact is that for a certain source, the ECHO path is
identifying pure peaks confidently due to some outliers. a delayed version of the LOS path. These two paths are co-
herent because they come from the same source. Symphony
5.2 Coherence-Based Refinement of DoA exploits hybrid peaks to capture such inter-path coherence,
Next, we will refine the ranking results to finally determine thus identifying DoAs coming from the same source.
pure-peak combinations. Without loss of generality, let’s assume ci and cj belong
Besides the geometric redundancy, the intrinsic coher- to the same source, and correspond to LOS and ECHO
ence among signals received by microphones can also be respectively. Fig. 8(a) illustrates the definition of ci and cj .
exploited to identify pure peaks. Recall that pure peaks cap- Similar to Algorithm 1, after aligning and averaging the first
ture a certain path’s relative delays among microphones. If three signals based on ci , we can obtain y<0,1,2> where the
we can compensate for these arriving delays based on pure LOS signal is constructively enhanced, as shown in Fig. 8(b).
peaks, signals received by microphones can be coherent
with respect to a certain path. Based on this fact, we refine Note that the other received signal y3 also receives the
each entry ci by checking whether the time shifts of ci can signals of LOS and ECHO of this source, which are both
make signals coherent with respect to a certain path. Next, coherent with the enhanced LOS path of y<0,1,2> . When we
we take a 4-mic linear array as an example to introduce our correlate y<0,1,2> with y3 , two peaks of CCF Cor<0,1,2>,3
method. Algorithm 1 describes the refinement procedure. will increase: (1) LOS-LOS, which is the correlation between
In Algorithm 1, we expect that pure peaks in CCF the enhanced LOS in y<0,1,2> and the LOS in y3 ; (2) LOS-
increase. Specifically, if ci is a pure-peak combination of ECHO, which is the correlation between the enhanced LOS
IEEE TRANSACTIONS ON MOBILE COMPUTING (PRE-PRINT) 7
Enhanced
y0 LOS
t Average
LOS ECHO y1shift y0,1,2
t t
y2shift y3
y0 Aligned t
t
t
shift based on ci
y3 −ci( 3) 3echo − 3los
t
y1
−c (j1) t (b)
−ci(1)
y0 Enhanced
y2
ECHO
t t Average
−c ( 2)
i
−c ( 2)
j y1shift y0,1,2
y3 t t
shift based on c j
−ci( 3) −c(j3)
t
y2shift
Aligned t
y3 −c ( 3)
j
t
(a) y3
t 3echo − 3los
(c)
Fig. 8: The illustration of homologous identification.
61 samples
By comparing Eq. (6) and (7). we can find an interesting
LOS-LOS observation: the locations of hybrid peaks that get enhanced
LOS-ECHO (i.e., LOS-ECHO in Eq. (6) and ECHO-LOS in Eq. (7))
are associated by a term τ3los − τ3echo . This is because the
combinations ci and cj correspond to the same source, and
thus both enhanced hybrid peaks capture the same arriving
(a) c5 = (-5, -10, -16) delay of the source between LOS and ECHO.
This association actually provides us an additional con-
61 samples
straint to determine whether two pure-peak combinations
ECHO-LOS ci and cj (i ̸= j ) belong to the same source. We take the
ECHO-ECHO
following steps to apply this constraint:
′
1) Fetch Cor<0,1,2>,3 and Cor<0,1,2>,3 which have already
been computed in Algorithm 1 by using ci and cj .
(b) c4 = (-4, -7, -10) ′ (3) (3)
2) Shift Cor<0,1,2>,3 and Cor<0,1,2>,3 by ci and cj , and
′
61 samples 61 samples denote the results as Cor
g <0,1,2>,3 and Cor
<0,1,2>,3 .
g
LOS-ECHO ECHO-LOS ′
y
y i Mic. LOS Peak ECHO Peak
Virtual Array O ' = (0, 2d 0 )
(0, 2d0) Speaker
y = d0 d0
y = d0 Amp.
Wall Wall
los
0 echo
0
t
d0 M 0 ( ) (b)
Real Array x α
2 O x
Amp.
1 M 1 ( )
1los 1echo t
Fig. 10: | tan θ1 | should be smaller than | tan θ2 |. Fig. 11: The illustration of self-localization.
respectively. To make sure that these two lines intersect in ECHO. Therefore, the smart speaker will detect two strong
Quadrant III or IV of this coordinate plane, the absolute correlation peaks (LOS peak and ECHO peak), as illustrated
value of the slope of the blue line should be smaller than in Fig. 11(b) and (c). Because the transmission of FMCW is
that of the red line, namely | tan θ1 | < | tan θ2 |. also known, the smart speaker can calculate the propagation
Based on this observation, we propose a simple but times that FMCW takes to arrive at Mi via the LOS and
effective approach to distinguish LOS and ECHO: After ECHO paths, i.e., ∆losi and ∆i
echo
.
recognizing two pure-peak combinations that belong to the As shown in Fig. 11(a), we construct the coordinate
same source and obtaining their DoAs, we compare | tan | system by letting the speaker as the original and the nearby
of these two DoAs. The one with smaller | tan | is identified wall on the line y = d0 . Clearly, the length of the ECHO
′ ′
as LOS, and the other is ECHO. path from the speaker to Mi equals ||O Mi ||2 , where O is
the symmetric point of the speaker (O) about the wall, and
|| • ||2 denotes the Euclidean norm. Therefore, we can build
7 R EVERSE R AY -T RACING an equation for Mi :
We localize sources via reverse ray-tracing. Again, we con- ′
struct the coordinate system with the array at the origin and ||O Mi ||2 = v × ∆echo
i . (9)
′
the nearby wall on the line y = d0 , as illustrated in Fig. 10. Note that the coordinates of O and Mi are only determined
According to the plane mirror imaging principle, the points by the orientation α and the distance d0 , respectively. This
of the real array and the virtual array are symmetrical about means that there are only two unknowns (d0 and α) in Eq.
the wall, and the virtual array is at point (0, 2d0 ). The two (9). Since each microphone can provide one constraint in the
paths from the source to the real array and the virtual array form of Eq. (9), as long as a smart speaker has more than one
can be formulated as: microphone, we can build a determined or overdetermined
(
y = tan(θ1 + α)x, (LOS) equation set to solve the distance and the orientation.
(8)
y = tan(θ2 − α)x + 2d0 , (ECHO)
where α is the array’s orientation with respect to the wall.
Therefore, the point of intersection of these two lines is the 9 L OCALIZABILITY OF Symphony
source location. Before reverse ray-tracing, the distance d0 Here, the theoretical analysis and the numerical results of
and the orientation α need to be calibrated. We cover it in the localization capability are provided, which may provide
the next section. us guidance on system deployment and adjustment.
Probability
Probability
0.6 0.6
Fs = 16 KHz
0.4 0.4
0.2 0.2
Real 294 locations Real 294 locations Real 294 locations
0.0 8 16 24 32 40 48 0.0 5 7 9 11 13 15
Sampling Rate (kHz) Aperature (cm)
Virtual Virtual Virtual
K=2 K=3 K=4 K=5 K=6
Fs = 32 KHz
(a) (b)
Fig. 15: (a) Probability of localizing all sources successfully v.s.
Sampling Rate. (dmax = 15 cm). (b) Probability of localizing all
sources successfully v.s. Array Aperture. (Fs = 48 kHz)
Real 588 locations Real 588 locations Real 588 locations
Fs = 48 KHz
Probability
0.6
0.4
0.2
Real 882 locations Real 882 locations Real 882 locations
0
15 14
13 12
11 10
9 8 48
7 6 32 40
24
Fig. 13: The distributions of the discrete positions with different Aperature (cm) 5 8 16
settings of the sampling rate Fs and the array-wall distance d0 . Sampling Rate (kHz)
Wall Wall dently and uniformly distributed in the room, Eq. (14) can
Real Array Real Array
Real Array Real Array be rewritten as follows
Region 𝑰𝑰
Region 𝑰𝑰𝑰 P(SA , SB )
Region 𝑰 Z Z h i
SA SA
(𝒙𝑨 , 𝒚𝑨 ) Region 𝑰𝑽
= pdf (lA ) pdf (lB ) 1 lB not in RegionlA dlB dlA
(𝒙𝑨 , 𝒚𝑨 )
(a) (b) Z
Fig. 14: When SA is at (xA , yA ), one of the DoAs of SB will = pdf (lA )P(lB not in RegionlA )dlA
overlap with that of SA if SB is in Region I , II , III or IV .
area of RegionlA
Z
= pdf (lA ) 1 − dlA . (15)
area of the room
of SB , their corresponding pure peaks will be overlapped.
Therefore, in this situation it is difficult to separate the DoAs Eq. (15) reveals a fact that the performance of separating
of both sources. and localizing multiple sources strongly depends on the
Fig. 14 (a) and (b) illustrate this overlapping problem. spatial resolution of the array: When we increase the array
Given the limited spatial resolution of the commercial array, aperture dmax or increase the sampling rate Fs , the area
if SB is in region I , the DoAs of SB ’s LOS and SA ’s LOS of RegionlA will be reduced4 , and then the probability of
will be too close to be separated. Similarly, if SB is in region localizing both SA and SB will increase.
II , their DoAs of ECHO will also be overlapped. Further, if
SB is in region III or IV , the DoAs of SA ’s LOS and SB ’s
ECHO will be merged, or the DoAs of SA ’s ECHO and SB ’s
9.2.2 General Case
LOS will be merged, respectively..
Formally, let lA and lB denote the coordinates (i.e., Here, we study the probability of successfully localizing all
locations) of SA and SB , and Regionl denotes the union K sources, S1 , S2 , . . . SK . Similar to Eq. (14), the probability
of region Il , IIl , IIIl and IVl (These four regions are deter- can be calculated as
mined by the source coordinate l). Therefore, the probability
P(S1 , S2 , . . . SK )
of successfully localizing both SA and SB is equal to Z h i
P(SA , SB ) = pdf (L)1 ∀lA , lB ∈ L, lB not in RegionlA dL,
L
(16)
Z Z h i
= pdf (lA , lB ) 1 lB not in RegionlA dlA dlB , (14)
4. This is because an array with a higher spatial resolution allows to
where pdf (lA , lB ) is the joint probability density function estimate a finer DoA, which means Region I , II , III and IV in Fig. 14
of lA and lB . If we further assume SA and SB are indepen- will be narrower.
IEEE TRANSACTIONS ON MOBILE COMPUTING (PRE-PRINT) 11
1.142 1.518 0.318 0.420 0.349 0.087 0.048 0.300 0.087 0.237 0.349 0.362 0.318 play voice commands via a portable speaker at different
0.5m Blind Region Mic Array positions.
0.833 0.519 0.185 0.203 0.222 0.032 1.017 1.000 0.229 0.032 0.223 0.374 0.185
Localization under clean condition. We place the 4-mic
0.777 0.434 0.081 0.043 0.908 0.141 0.259 1.500 0.054 0.141 0.159 0.658 0.081 array 0.4m away from the wall. We conduct experiments at
1.5m night, when it is very quiet in the living room. The volume
1.365 0.517 0.116 0.537 0.246 0.169 0.198 2.000 0.820 2.236 0.246 0.275 0.116
Television
of background noise is lower than 20 dB SPL. Fig. 21 shows
0.905 0.771 0.435 0.311 0.382 0.494 0.599 2.500 0.119 1.498 0.382 0.311 0.435 the localization errors of Symphony and VoLoc. The median
2.5m
1.0m 2.0m 3.0m 4.0m 5.0m 6.0m
error of VoLoc is 0.314 m, which is slightly better than that
of Symphony, 0.387 m. The slight gap in performance is
Fig. 19: Heatmap of Symphony’s localization error. because VoLoc uses a fine-grained but exhausting searching
1.0 method to localize the source, which produces more accu-
rate results in the ideal case.
0.8 Localization under noisy condition. We place the 6-mic
0.6 array 0.4 m away from the wall. We conduct the experiment
CDF
CDF
CDF
Symphony w/o Noise
0.4 0.4 VoLoc w/o Noise 0.4 Source A
Symphony Symphony w/ Noise Source B
0.2 VoLoc 0.2 0.2 Source C
VoLoc w/ Noise
00 0.5 1.0 1.5 2.0 2.5 3.0 00 0.5 1.0 1.5 2.0 2.5 3.0 00 0.5 1.0 1.5 2.0 2.5 3.0
Localization Error (m) Localization Error (m) Localization Error (m)
Fig. 21: Localization error on ideal Fig. 22: Localization error w/ or w/o Fig. 23: Localization error of three
conditions (4-mic). noise (6-mic). sources.
10 Symphony
Fig. 24: The frequency of overlapping. Fig. 25: DoA estimation error. Fig. 26: Processing time.
measure the DoAs from the collided signals. more spatial information and thus localize the source more
Localization with Multiple Arrays or Anchors. Distributed precisely.
microphone / antenna arrays have been used to localize var-
ious sources, including smartphones [34], [35], WiFi clients
[5], RFID tags [6], birds [7], and bumblebees [8]. R EFERENCES
DoA Estimation. Symphony uses a popular method, GCC-
[1] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “Dolphinat-
PHAT [16], to compute CCF. In fact, Symphony might be tack: Inaudible voice commands,” in Proceedings of ACM Conference
extended to other DoA estimation algorithms like Multiple on Computer and Communications Security, 2017.
Signal Classification (MUSIC [36]) and Estimation of Sig- [2] D. Kumar, R. Paccagnella, P. Murley, E. Hennenfent, J. Mason,
nal Parameters via Rotational Invariance Techniques (ES- A. Bates, and M. Bailey, “Skill squatting attacks on Amazon
Alexa,” in Proceedings of USENIX Security Symposium, 2018.
PRIT [37]), because the underlying problem is the same: the [3] N. Roy, H. Hassanieh, and R. R. Choudhury, “Backdoor: Making
ambiguity of peaks, no matter they are from the CCF (GCC- microphones hear inaudible sounds,” in Proceedings of ACM Mo-
PHAT), or from the pseudo-spectrum (MUSIC and ESPRIT). biSys, 2017.
[4] T. Sugawara, B. Cyr, S. Rampazzi, D. Genkin, and K. Fu,
“Light commands: Laser-based audio injection attacks on voice-
controllable systems,” in Proceedings of USENIX Security Sympo-
12 D ISCUSSION sium, 2020.
• 3D Localization. Symphony virtually contains two micro- [5] J. Xiong and K. Jamieson, “Arraytrack: A fine-grained indoor
location system,” in Proceedings of USENIX NSDI, 2013.
phone arrays, the real one and the virtual one, and thus [6] J. Wang, D. Vasisht, and D. Katabi, “RF-IDraw: Virtual touch
only supports localization in 2D. screen in the air using RF signals,” 2014.
• Non-Line-of Sight. A source is localized as the intersec- [7] T. C. Colliera, A. N. G. Kirschel, and C. E. Taylor, “Acoustic
tion of the LOS path and the ECHO path. When the LOS localization of antbirds in a mexican rainforest using a wireless
sensor network,” The Journal of the Acoustical Society of America,
path is blocked, Symphony will fail to localize the source. vol. 128, no. 1, pp. 182–189, 2010.
• The Array-Wall Distance. If the array is far away from [8] V. Iyer, R. Nandakumar, A. Wang, S. B. Fuller, and S. Gollakota,
the wall, some problems will arise: (1) ECHO would “Living IoT: A flying wireless platform on live insects,” in Proceed-
experience more attenuation and its strength would not ings of ACM MobiCom, 2019.
[9] P. Corbalán, G. P. Picco, and S. Palipana, “Chorus: UWB concur-
be comparable to LOS. (2) ECHO may not be the second rent transmissions for GPS-like passive localization of countless
arriving path: the path reflected by the wall close to the targets,” in Proceedings of ACM/IEEE IPSN, 2019.
source may be shorter than ECHO. On the other hand, [10] I. . W. Group, “IEEE standard for local and metropolitan area
networks—part 15.4: Low-rate wireless personal area networks
if the array is too close to the wall, Symphony will also
(LR-WPANs),” IEEE STD, vol. 802, pp. 4–2011, 2011.
perform poorly because the real array is close to the [11] Q. Lin, Z. An, and L. Yang, “Rebooting ultrasonic positioning
virtual array, and the far-field effect happens again. systems for ultrasound-incapable smart devices,” in Proceedings
• Number of Microphones. Symphony requires at least 3 of ACM MobiCom, 2019.
[12] S. Shen, D. Chen, Y. Wei, Z. Yang, and R. R. Choudhury, “Voice
microphones to exploit geometric redundancy. For smart localization using nearby wall reflections,” in Proceedings of ACM
speakers with only two microphones (e.g., Google Home), MobiCom, 2020.
Symphony might not be applied. [13] Govee, “Govee 32.8ft LED strip lights works with Alexa Google
• Moving Objects. Symphony assumes that sources are Home,” https://www.amazon.com/Govee-Wireless-Control-
Kitchen-Million/dp/B07WHP2V77/, 2020, accessed: 2020-10-02.
static when sources are active. Localizing moving objects [14] BlissLights, “home theater lighting,”
will be another problem, which is out of our scope. https://www.amazon.com/BlissLights-Sky-Lite-Projector-
• Number of Sources. Symphony does not explicitly guar- Bedroom/dp/B084DCF429/, 2020.
antee that a certain number of sources must be localized, [15] J. C. Curlander and R. N. McDonough, Synthetic aperture radar.
Wiley, New York, 1991.
it just provides a ”best-effort” service. The whole localiza- [16] C. Knapp and G. Carter, “The generalized correlation method for
tion procedure does not regulate that a certain number of estimation of time delay,” IEEE Transactions on Acoustics, Speech,
sources must be found by Symphony. Instead, Symphony and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.
[17] J. Benesty, J. Chen, and Y. Huang, Microphone array signal processing.
finds as many pure-peak combinations as possible, and Springer Science & Business Media, 2008, vol. 1.
thus localizes as many sources as possible. This implicitly [18] W. Wang, J. Li, Y. He, and Y. Liu, “Symphony: localizing multiple
determines the number of sources. acoustic sources with a single microphone array,” in Proceedings of
• Identification of Sources Symphony enables an array only ACM SenSys, 2020.
[19] S. M. Goldfeld, S. M, and R. E. Quandt, Nonlinear methods in
to localize the sources, but not to identify the sources (i.e., econometrics. North-Holland Pub. Co., 1972.
it does not determine which device or person each source [20] W. Mao, J. He, and L. Qiu, “Cat: high-precision acoustic motion
corresponds to). tracking,” in Proceedings of ACM MobiCom, 2016, pp. 69–81.
[21] Seeed, “Respeaker 4-mic linear array kit for Rasp-
berry Pi,” https://wiki.seeedstudio.com/ReSpeaker 4-
13 C ONCLUSION Mic Linear Array Kit for Raspberry Pi/, 2020.
[22] ——, “Respeaker 6-mic circular array kit,”
We demonstrate the feasibility of using a single microphone https://wiki.seeedstudio.com/ReSpeaker 6-
array to localize multiple acoustic sources concurrently. We Mic Circular Array kit for Raspberry Pi/, 2020.
[23] M. Wang, W. Sun, and L. Qiu, “MAVL: multiresolution analysis of
believe Symphony will enable new applications for location- voice localization,” in USENIX NSDI, April 12-14, 2021, 2021.
aware services. In our design, we passively exploit the [24] C. Cai, H. Pu, P. Wang, Z. Chen, and J. Luo, “We hear your PACE:
ECHO path to tackle the problem of far-field effect. We may passive acoustic localization of multiple walking persons,” ACM
IMWUT, vol. 5, no. 2, pp. 55:1–55:24, 2021.
further develop this idea by actively customizing the sur-
[25] I. An, M. Son, D. Manocha, and S. Yoon, “Reflection-aware sound
roundings of the array, thus introducing more predictable source localization,” in Proceedings of IEEE International Conference
multi-paths. Based on these, we may be able to extract on Robotics and Automation, 2018.
IEEE TRANSACTIONS ON MOBILE COMPUTING (PRE-PRINT) 15
[26] I. Dokmanić, R. Parhizkar, A. Walther, Y. M. Lu, and M. Vetterli, Yuan He is an associate professor in the School
“Acoustic echoes reveal room shape,” National Academy of Sciences, of Software and BNRist of Tsinghua University.
vol. 110, no. 30, pp. 12 186–12 191, 2013. He received his B.E. degree in the University
[27] M. Krekovic, I. Dokmanic, and M. Vetterli, “EchoSLAM: Simulta- of Science and Technology of China, his M.E.
neous localization and mapping with acoustic echoes,” in Proceed- degree in the Institute of Software, Chinese
ings of IEEE ICASSP, 2016. Academy of Sciences, and his PhD degree in
[28] C. Zhang, F. Li, J. Luo, and Y. He, “iLocScan: harnessing multipath Hong Kong University of Science and Technol-
for simultaneous indoor source localization and space scanning,” ogy. His research interests include wireless net-
in Proceedings of ACM SenSys, Memphis, Tennessee, USA, November works, Internet of Things, pervasive and mobile
3-6, 2014, 2014, pp. 91–104. computing. He is a member of IEEE and ACM.
[29] P. Hu, P. Zhang, and D. Ganesan, “Laissez-faire: Fully asymmetric
backscatter communication,” in Proceedings of ACM SIGCOMM,
2015.
[30] J. Ou, M. Li, and Y. Zheng, “Come and be served: parallel decoding
for COTS RFID tags,” in Proceedings of ACM MobiCom, 2015.
[31] M. Jin, Y. He, X. Meng, Y. Zheng, D. Fang, and X. Chen, “Fliptracer:
Practical parallel decoding for backscatter communication,” in
Proceedings of ACM MobiCom, 2017.
[32] M. Jin, Y. He, X. Meng, D. Fang, and X. Chen, “Parallel backscatter
in the wild: When burstiness and randomness play with you,” in
Proceedings of ACM MobiCom, 2018.
[33] M. Jin, Y. He, C. Jiang, and Y. Liu, “Fireworks: Channel estimation
of parallel backscattered signals,” in Proceedings of ACM/IEEE
IPSN, 2020.
[34] D. B. Haddad, W. A. Martins, M. d. V. Da Costa, L. W. Biscainho,
L. O. Nunes, and B. Lee, “Robust acoustic self-localization of
mobile devices,” IEEE Transactions on Mobile Computing, vol. 15,
no. 4, pp. 982–995, 2015.
[35] K. Liu, X. Liu, and X. Li, “Guoguo: Enabling fine-grained smart-
phone localization via acoustic anchors,” IEEE Transactions on
Mobile Computing, vol. 15, no. 5, pp. 1144–1156, 2015.
[36] R. Schmidt, “Multiple emitter location and signal parameter es-
timation,” IEEE Transactions on Antennas and Propagation, vol. 34,
no. 3, pp. 276–280, 1986.
[37] R. H. R. III and T. Kailath, “Esprit-estimation of signal parameters
via rotational invariance techniques,” IEEE Transactions on Acous-
tics, Speech, and Signal Processing, vol. 37, no. 7, pp. 984–995, 1989.