Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 3

Sound Source Localization Using Deep Residual Networks

*Nelson Yalta (Waseda University), Kazuhiro Nakadai (Honda Research Institute Japan Co., Ltd), and Tetsuya Ogata
(Waseda University)

1. Introduction definition of the MUSIC spatial spectrum . With this all


Machine and robots has become part of humans noises are suppressed, making GEVD-MUSIC a robust
every-day life. Thus for natural interactions with method for noises[4].
humans, robots should have auditory functions [1] which 2.2. Task
can be used for Sound Source Localization (SSL), where MUSIC presents as a robust method for SSL on
not only is necessary to recognize the sound locations relatively noisy environments, however, the performance
but also detect sound events. However, environmental depends on the number of microphones that the array has
factors such as background noise, moving sound sources, and it can drastically drop when the signal-to-noise rate
and room reverberations change dynamically in the real lows.
world [2], and the number of microphones and the
3. Implementation
acoustic properties of robots body complicates the
3.1. Deep Residual Network
SSL.
Recently, research evidence that the network depth is
Conventional Methods to improve the performance of
important on challenging tasks. However, the problem of
robots SSL uses subspace-based methods like Multiple
vanishing/exploding gradients presents and obstacle for
Signal Classification (MUSIC) [3], however, the
implementing this very deep networks. With the network
performance of these methods depends on the number of
depth increasing, the accuracy training saturates,
microphone used for the task, and it is reduced relatively
presenting this as a degradation problem. Deep Residual
according to the signal-to-noise ratio.
Networks (DRN) are introduced on [5], in order to
Recently, Deep Neural Networks (DNN) and Deep
address the degradation. The residual learning can be
Convolutional Neural Networks (DCNN) approaches has
denoted as
shown a performance than even exceed humans
F(x) := H(x) x, (1)
capability on different tasks. In this paper, we refine a
where H(x) becomes the desire underlying mapping.
DCNN structure using deep residual learning for SSL
The residual learning let the layers fit a residual mapping
tasks.
F(x), instead of hoping each few stacked layers directly
2. Sound Source Localization
fit a desired underlying map. Then, the original mapping
2.1. MUSIC is recast into F(x) + x, this formulation can be
Multiple Signal Classification (MUSIC) [3], a method implemented by feedforward neural network using
for SSL, where the localization process used shortcut connections (Fig.1). A shortcut is presented as an
representations of the energy of the signals and the time identity mapping, which skip one or more layers, and
difference on the arrival of the signal. These because it does not present computation complexity
representations are called steering vector and can be neither need extra parameter can be trained end-to-end
obtained, in the frequency-domain, using measurements using a common library. As a concern about the training
or physical models. Implementation for MUSIC are time, Deeper Bottleneck are introduced too[5]. Here a
based on Standard Eigen-Value Decomposition (SEVD- stack of 3 layers are used for each residual function F, so
MUSIC), and even when SEVD-MUSIC, because its low in order to reduce and restore the dimensions, a 1x1 3x3
computation cost, is easy to be implemented on robots and 1x1 kernel convolutional layers are used.
and can provides easy detectable reliable peaks[4], it 3.2. Audio Feature Extraction
works when the assumption of, the target sound sources The input data was prepared using Short-Time Fourier
has sufficiently high power compared to noise, is Transform (STFT), and a label with the angle was settled
satisfied. Generalized Eigen-Value Decomposition as target. For input, we transformed from a normalized N
(GEVD-MUSIC) solve the mislocalization on high- channel audio with 16KHz of sampling rate, into STFT
powered noise sources problem, by determining the features, extracting a frame with length of 400 samples
correlation matrix in terms of noise. Because in most of (25ms) and a hop of 160 samples (10ms). From the STFT
robot SSL cases, the sounds and the noises are cross
correlated, make possible to GEVD can be adopted to the

Fig. 1 Residual Learning


we only used the power information, which after remove frames. We trained 4 different networks using ADAM
the imaginary part, was normalized on the W frequency solver and SoftMax with Cross Entropy as loss function.
bin dimension, on a range between 0.1 and 0.9. Since one The initial alpha for the solver was set to 1e-4, and the
STFT frame from the audio has corrupted information target label was from 0-355 degrees. However, the output
due to the noise, we stacked H frames, so the input of the of the network was set to 361 dimensions, where the 0 to
network has a dimension of NxWxH.
3.3. Proposed Model
We used two DRN implemented with 1 residual block
(ResNet1) and 2 residual blocks (ResNet2), and trained
with supervised learning. After the input two 1x1 kernel
convolutional layers with TanH activation are used, and
then they are connected with to the residual block. Each
residual block contained 3 Bottleneck, making 9
convolutional layers for a residual block. Then, an
empirical kernel sized of 6 convolutional layers is used.
Fig. 2 Results
The empirical kernels size was set to 45x4 dims. With this
configuration, the ResNet1 has implemented 17
convolutional layers. 359 outputs activate with it respective angle label, and the
After each convolutional layer follows a Batch 360 output activates when the input data has no-audio
Normalization layer, and the ELU activation was used for information. Doing this, the network not only can locate
the rest of all the networks. Nevertheless, a TanH the sound source angle, but also detect if there is a sound
activation was also evaluated on the first residual block of source.
the ResNet2. A Max-Pooling layer was used on the 4.3. Results
middle of the network and another one at the end of the The results of our experiments are showed in Fig. 2,
convolutional layers. Finally, after the second max- where the mean block accuracy of the median angle was
pooling layer 2 fully connected layers was implemented. evaluated with a tolerance of 0 degrees. For the
For evaluating the performance of the residual learning, evaluation, 50 audio files (25 from females, and 25 from
we also trained a plain network with the same number of males) was used and tested every 5 degrees from 0 degree
layers as ResNet1, and at a same number of iterations. to 355 degrees. A similar white noise was added at all
channels in a SNR range from the Clean data up to
4. Evaluation
-35dB, so the noise evaluated is different to the used on
4.1. Audio Dataset
the training process. We used HARK for evaluating
For training and test we use the Acoustic Society of
SEVD-MUSIC. From each audio, we obtained the
Japan-Japanese Newspaper Article Sentence (ASJ-JNAS)
predicted median angle, and this was compared with the
corpora, which includes utterances from 216 speakers for
true angle.
training and 42 speakers for test, in Japanese language
and with different length. For preparing the training 5. Conclusion
dataset, was used the impulse responses from HEARBO In this paper, we proposed a SSL method using a DRN.
robot equipped with an array microphone of 16 channels, We showed that not only a DCNN can perform a better
obtained from a room with 200ms of reverberation, and block accuracy than SEVD-MUSIC on noisy
4x7 meters of dimension, every each 5dg. A clean single environments, but also we showed that a DRN has better
channel utterance was convoluted with the multiple performance that a DCNN plain network with the same
channel impulse response of a random angle, and then number of layers. The performance of the networks does
white noise was added to each channel, on range between not drop even when a different noise is used, from which
clean data and -30 dB of signal-to-noise ratio (SNR). The the networks has been trained.
noise differs from each channels in order to try to ACKNOWLEDGEMENTS
simulate a real environmental noise. The work has been supported by MEXT Grant-in-Aid for
4.2. Networks Training Process Scientific Research (A) 15H01710.
For our experiments, from the audio data using STFT REFERENCES
we got an input of 16 audio channels x 257 freq bins x 20 [1] K. Nakadai, T. Lourens, G. O. Hiroshi, and H. Kitano, Active audition for
humanoid, Proc. Natl. Conf. Artif. Intell., pp. 832839, 2000.

[2] K. Nakadai, H. Nakajima, Y. Hasegawa, and H. Tsujino, Sound source

separation of moving speakers for robot audition, ICASSP, IEEE Int. Conf.

Acoust. Speech Signal Process. - Proc., pp. 36853688, 2009.

[3] R. Schmidt, Multiple emitter location and signal parameter estimation,

IEEE Trans. Antennas Propag., vol. 34, no. 3, pp. 276280, 1986.

[4] K. Nakamura, K. Nakadai, F. Asano, Y. Hasegawa, and H. Tsujino,

Intelligent Sound Source Localization for Dynamic Environments, pp. 664

669, 2009.

[5] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image

Recognition, Arxiv.Org, vol. 7, no. 3, pp. 171180, 2015.

You might also like