Professional Documents
Culture Documents
Fujipress - JRM 29 1 4
Fujipress - JRM 29 1 4
net/publication/316698525
CITATIONS READS
110 4,899
3 authors:
Tetsuya Ogata
Waseda University
582 PUBLICATIONS 5,897 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nelson Yalta on 16 June 2020.
Paper:
This study proposes the use of a deep neural net- as background noise, sound sources in motion, and room
work to localize a sound source using an array of mi- reverberations change dynamically in the real world [2].
crophones in a reverberant environment. During the In addition, the number of microphones and the acoustic
last few years, applications based on deep neural net- properties of a robot complicate SSL.
works have performed various tasks such as image Conventional methods to implement and improve the
classification or speech recognition to levels that ex- performance of SSL include subspace-based methods
ceed even human capabilities. In our study, we employ such as multiple signal classification (MUSIC) [3]. In
deep residual networks, which have recently shown these methods, the localization process employs represen-
remarkable performance in image classification tasks tations of the energy of signals as well as the time differ-
even when the training period is shorter than that of ence of the arrival of those signals. These representations
other models. Deep residual networks are used to pro- are called steering vectors and can be obtained in the fre-
cess audio input similar to multiple signal classifica- quency domain by means of measurements [4] or physi-
tion (MUSIC) methods. We show that with end-to-end cal models [5]. While MUSIC uses only sound informa-
training and generic preprocessing, the performance tion, other methods employ audio-visual data for track-
of deep residual networks not only surpasses the block ing sound sources in real time [6]. One study revealed
level accuracy of linear models on nearly clean en- that when using both visual data and the pitch extracted
vironments but also shows robustness to challenging from a binaural auditory system, a conversation between
conditions by exploiting the time delay on power in- multiple sources can be tracked [6]. However, a binaural
formation. system can be used only for two-dimensional localization.
By contrast, a microphone array is used for an SSL task
in three dimensions [7]. In that study, the time delay of
Keywords: sound source localization, deep learning, arrival was used to estimate the source location, but the
deep residual networks system worked only when the source was located within
3 to 5 m of the array. The use of a pitch-cluster-map
for sound identification was introduced in [8], in which
1. Introduction the sound source localization was based on delay and
sum beam forming using a microphone array of 32 chan-
Speech is one of the most important means of commu- nels, and for multiple sources main-lobe fitting was em-
nication between humans. To participate in a conversa- ployed. The system could obtain the location of not only
tion, humans usually require information about the source speech events but also non-voice sounds. MUSIC was
of the speech. Regardless of the number of sources, hu- implemented in [4, 9] based on standard eigenvalue de-
mans look in the direction of the source(s) to continue in- composition (SEVD-MUSIC) and generalized eigenvalue
teracting. In audio processing, the search for the location decomposition (GEVD-MUSIC). These implementations
of the source (i.e., sound source localization (SSL)) is a revealed that performing SSL was possible even in noisy
generic problem formulated as part of the cocktail party environments that incurred a relatively low computation
effect, which humans can naturally solve by obtaining the cost.
source location and then filtering the desired speech infor- Recently, deep neural networks (DNN) and deep con-
mation. volutional neural networks (DCNN) approaches have led
Machines and robots have become part of everyday to major breakthroughs in different signal processing
life. Thus, for natural interactions with humans, robots fields such as speech recognition [10, 11], natural lan-
should have auditory functions [1] that can be used for guage processing [12] and computer vision [13]. Many
SSL, whereby recognizing sound locations and detecting improvements have occurred in the field of computer
sound events are necessary. Environmental factors such vision using deep learning. Deep residual networks
(DRNs) [14] have shown the best performance on the Im- location of the sound source as target. For this re-
ageNet Large Scale Visual Recognition Challenge 2015. placement, the model does not require information
In this study, we refined a method for performing SSL about the environment or the input noise.
tasks by replacing the MUSIC method with deep learn-
2. Robustness: Real-world environments change dy-
ing (DL), which in some cases requires a pre-calculation
namically and their noises affect audio signals. Thus,
of the environment, and by maintaining a robust imple-
DCNNs should perform accurate SSL at different
mentation in challenging environments without a major
SNR levels.
increase in the learning cost (e.g., amount of memory us-
age and computational time for the optimization process). 3. Learning efficiency: The performance of a DL model
The different implementations of MUSIC have per- also depends on the training time. However, a larger
formed real-time SSL in noisy environments. However, training set does not ensure a good performance
the performance of these methods depends on the num- because of problems that arise in the training pro-
ber of microphones used for the task and diminishes with cess such as overfitting and accuracy training satura-
increased signal-to-noise ratio (SNR). In addition, to im- tion [14].
prove the performance of SSL using GEVD-MUSIC, the
correlation matrix for noise must be pre-calculated for The remainder of this paper is organized as follows. We
known noises. However, in case of unknown noises, the first present our proposed methods for sound source sep-
performance of GEVD-MUSIC may drop to the same ac- aration and then explain residual learning and its use for
curacy level as that of SEVD-MUSIC, thus showing low SSL. We next describe the implementation and training of
robustness at a low SNR. DRNs, and demonstrate the effectiveness of DRNs exper-
A DNN-based model used in a flexible arrangement imentally. Finally, we present the conclusion of our study
of a microphone array was introduced in [15] to imple- and offer suggestions for future research.
ment SSL tasks. In that study, the power and phase in-
formation of the audio was exploited to improve SSL.
The use of both power and phase information is more im- 2. Proposed Method
portant for a multiple-channel audio source because the
location calculation is affected by noise and reflections. 2.1. Deep Convolutional Neural Network
However, implementing a model with greater input in- We implement DCNN to replace MUSIC using a real-
formation requires a larger number of parameters to in- world impulse response and end-to-end training to per-
crease the processing time and learning cost. Moreover, form the SSL task. Our DCNN uses only the power
DCNNs have performed better than DNNs in speech sig- information from several frequency-domain frames of a
nal processing tasks. In addition, because of the shared multiple-channel audio stream to obtain the location of
weights, the number of parameters to be trained can be a sound source, and adding environmental information
reduced. Even with a lack of sound reflections and when is not required to perform the task. During the last
using only the delay information between channels, the decade, factors such as the commercialization of high-
study in [16] showed that obtaining the direction of ar- performance low-cost graphics processing units (GPU),
rival (DOA) is possible using a DCNN. However, using a development of improved optimization algorithms, and
single frame increases the difficulty of finding the source availability of open-source libraries have enabled the em-
location because of the lack of phase information in envi- ployment of DL for classification or recognition prob-
ronments having a high noise level or a greater number of lems. DL, which attempts to model high-level abstrac-
reflections. Therefore, for a time-delay based model, us- tions, uses artificial neural networks with multiple hid-
ing multiple frames is recommended to improve the per- den layers of units between the input and output layers to
formance in noisy environments [17]. Furthermore, al- model complex nonlinear representations. This structure
though DCNN approaches perform well at classification is known as DNN. A principal characteristic of a DNN is
tasks [13, 18], the slow learning process, the long training its ability to self-organize sensory features from extensive
time because of possible fine-tuning requirements, and the training data.
fact that the DCNN model may not be used in other tasks Recently, convolutional layers, which employ a dif-
means that DCNN approaches are unsuitable for perform- ferent layer configuration, were implemented in net-
ing a flexible task such as SSL. works and were successful at several signal processing
In this study, we focus on the following three aspects of tasks [10–13]. The configuration neurons of convolu-
DL implementation for SSL tasks: tional layers, instead of being fully connected, are formed
by means of spatially arranged features maps. A con-
1. Replacement: A subspace-based method such as figuration that uses one or two convolutional layers in-
MUSIC, which is commonly used for SSL, requires side a neural network is known as a convolutional neural
a steering vector from a multiple-channel signal to network (CNN). A DCNN, defined previously, uses sev-
process the localization. This method can be re- eral convolutional layers. The CNN proposed in [19] em-
placed with a DCNN model by using the power ployed a typical configuration consisting of:
information from frequency-domain frames of a a) convolutional layers, in which feature maps were ob-
multiple-channel audio as input and by providing the tained by convolving the input x with a kernel filter W ,
in
pu
convolutional layer x x
t
Kernel: 1x1
batch normalization
dense dense dense
Weigh Layer
activation
max-pooling
Kernel: 3x3
convolution
activation
output
convolutional layer convolutional layer
activation Kernel: 3x3
F(x) Weigh Layer x activation
Speaker
Kernel: 3x3 identity Kernel: 1x1
kernel
+ +
activation activation
Fig. 1. DCNN structure.
F(x) + x F(x) + x
NNs. Even in challenging conditions with higher noise, 1x1 conv, 32 1x1 conv, 32
models implemented with the novel activation functions
perform better than conventional methods. 3x3 conv, 32 3x3 conv, 32
The introduction of novel nonlinear activation func- 1x1 conv, 32 1x1 conv, 32
tions has improved the accuracy and performance of DC-
NNs (Fig. 3) [18, 29], thus allowing DCNNs to surpass 45x4 conv, 32 45x4 conv, 32
human classification performance. In contrast to conven-
max pool 1x4 max pool 1x4
tional sigmoid-like activations, rectifier neurons (e.g., rec-
tified linear unit (ReLU)) nonlinearities have been used 45x4 conv, 32 45x4 conv, 32
with considerable success for computer vision [13] and
sound tasks [30, 31]. The ReLU activation function, 45x4 conv, 64 45x4 conv, 64
has led to better solutions. However, because the acti- 45x4 conv, 256 45x4 conv, 256
vation outputs are non-negative, the mean activation is
27x2 conv, 512 27x2 conv, 512
greater than zero. Thus, the function is not centered and
slows down learning [32]. Therefore, to speed up learn- fc 2048 fc 2048
ing and improve performance, a generalization of ReLU
called parametric rectified linear unit (PReLU) was in- fc 361 fc 361
troduced in [18]. This study claimed that using PReLU
Fig. 4. Network architecture. Left: plain network. Right:
in a DCNN can surpass human performance. PReLU is
residual network. The dotted line shortcut represents a 1 × 1
defined as:
convolutional layer.
xi if xi ≥ 0
f (xi ) = , . . . . . . . (6)
ai xi if xi < 0
where ai is a coefficient that controls the slope of the neg- ates learning and show more robust generalization perfor-
ative part and can be a learnable parameter or fixed value. mance than ReLUs and PReLUs.
PReLU can improve model accuracy by including nega-
tive values in the activation output, with minimal overfit-
ting risk at negligible additional computational cost. Nev- 2.4. Model Architecture
ertheless, PReLU does not ensure a robust deactivation DCNNs and convolutional layers trained for residual
state. Exponential linear unit (ELU) [29] was introduced learning can improve the performance and robustness of a
as a nonlinearity that can improve learning characteristics plain network without additional learning costs. There-
compared with other linear activation functions. ELU is fore, we implement models (Fig. 4) with 1–4 residual
defined as: blocks, named ResNet 1–4, respectively (see Table 1),
and train them by means of supervised learning. After the
x if x ≥ 0
f (x) = , . . . . (7) input, two 1 × 1 kernel convolutional layers with TanH
a(exp(x) − 1) if x < 0
and ELU are stacked and then connected to the resid-
where exp represents the exponential function. A network ual blocks. We also evaluate larger models using only
implemented with ELU activations considerably acceler- a convolutional layer after the input. According to [14], a
deeper bottleneck can accelerate up learning. Therefore, 200 ms of reverberation every 5◦ . A clean single-channel
we use three bottlenecks as a residual block, in which utterance was convoluted with the multiple-channel im-
each bottleneck contains a set of three layers (Fig. 2 right). pulse response of a random angle and white noise was
We then add a plain network with six convolutional lay- added to each channel in a range between clean data and
ers having an empirical-sized kernel, which we evaluated −30 dB of SNR. The noise added to each channels was
prior to conducting experiments. The empirical kernel different to simulate a real environmental noise. The in-
size is set to 45 × 4 dims for the first five layers and then put data for the network was prepared using short-time
a 27 × 2 kernel convolution layer is stacked. With this Fourier transform (STFT), and a label pointing to the an-
configuration, a ResNet with one residual block is imple- gle or silence was set as the target. The input was prepro-
mented with 19 layers. cessed according the following steps:
Each convolutional layer is followed by a batch nor-
• A normalized N-channel audio with a 16-kHz sam-
malization [27] layer, and an ELU [29] activation is used
pling rate was transformed into STFT features, thus
for the rest of the network. A TanH activation is evaluated
on the first residual block of the ResNet with two residual extracting a frame with a length of 400 samples
(25 ms) and a hop of 160 samples (10 ms). The
blocks. A max-pooling layer with a kernel size of 1 × 4 is
length and hop values were based on previous re-
used after the first 45 × 4 kernel convolutional layer, and
another max-pooling layer with a kernel size of 4 × 1 is search on speech tasks [22].
used after the third 45 × 4 convolutional layer. To evalu- • From the STFT, we used only the power information.
ate the performance of residual learning, we also train a The power was normalized on the W frequency bin
plain network (PlainNet) with the same number of layers axis, in a range between 0.1 and 0.9.
and iterations as those in ResNet1.
• Because one STFT frame from the audio contains
corrupt information due to noise, we stacked H
3. Experiments frames. Thus, the input of the network became a
N ×W × H dimensional file.
3.1. Data Preparation and Preprocessing Regarding the output, we prepared the angle label as
For training and evaluation, we used the Acoustic Soci- follows (Fig. 6):
ety of Japan-Japanese Newspaper Article Sentence (ASJ- • From the clean audio used to prepare the input
JNAS) corpora, which include Japanese utterances of dif- multiple-channel audio, an STFT file with the same
ferent lengths from 216 speakers for training and those dimensions as those of the input but with only one
from 42 speakers for evaluation. To prepare the train- channel (1 ×W × H) was obtained.
ing dataset, we used impulse responses from a HEARBO
robot (Fig. 5 left) equipped with a microphone array of • From this file, we evaluated the root mean squared
8 and 16 channels obtained from a 4 × 7 m room with (RMS) power of all frames.
Fig. 5. Array of microphones. Left: HEARBO robot. Right: which was an array having dimensions of 7/8/16 ×257 ×
microcone. 20 (i.e., audio channels, frequency bins, frames); the cor-
responding label output had an integer value. From each
audio file, a random number of prepared files were se-
lected. To maintain a uniform distribution between the
angle targets, 33750 files were selected for each angle.
We thus prepared a dataset having 2430000 files. For the
distribution, we mixed the files with sound and no-sound
information. Simultaneously, a noise with a random SNR
level was added to each audio file. The SNR values of the
input audio were selected from Clean, 30, 10, 5, 0, −5,
−10, −20, and −30 dB.
We trained six end-to-end networks using ADAM
solver [33] and SoftMax with cross entropy as the loss
function. No finetune or previous training was used for
the experiments. The initial alpha for the solver was set
to 10−4 , and the target-label was set from 0 to 359◦ . The
output of the network was set to 361 dimensions, where
the 0 to 359 outputs were activated with their respective
angles labels, and the 360th output was activated when
the input data contained no-audio information. The value
of F for evaluating the input no-audio data was empiri-
cally set to 13 frames. As a result, the network not only
could locate the sound source angle, but it also ensured
that the network activated a correct output in the absence
Fig. 6. Output preprocessing. Top: STFT clean audio. Bot- of sound information. For the experiments, we did not use
tom: speech-angle activation label. the dropout function on the fully connected layers. We
trained shorter models for five epochs and larger models
for two epochs, using a mini batch of only 70 files. Each
• If several frames F have a RMS value of −120, the epoch required approximately 24 h using a GPU NVIDIA
target label pointed to the silence label index. Oth- Titan X.
erwise, the target label was set to the angle used to
prepare the input audio. 3.3. Evaluation Criteria
In addition, to evaluate the performance of residual learn- We evaluated the networks using two difference mea-
ing, we trained a model using the impulse responses from sures. First, we evaluated the accuracy of an audio file
a microcone (Fig. 5 right) in an environment with many despite the number of frames generated from it. In this
reflections (Fig. 7). The microcone is a microphone array block accuracy evaluation, we forwarded the inputs file
of seven channels and was manufactured by BIAMP. that was preprocessed from an audio file, and then stacked
the outputs. From the outputs, we evaluated the median
angle with respect to the target angle of the file using a
3.2. Training Method confusion matrix, and then calculated the mean accuracy
For our experiments, we used audio data employing of each network based on a corresponding SNR level. We
STFT to prepare multiple training files, the input for did not consider whether the input file contains sound or
Table 4. Microcone (array of 7 microphones) block accuracy. Table 5. HEARBO detection rate.
ResNet2
SNR (dB) SEVD-MUSIC ResNet1 SNR (dB) Plain Network ResNet1 ResNet3 ResNet4
TanH/ELU
−35 1.39 1.39 −35 −86.13 13.16 −8.26 −10.59 −47.06
−30 1.39 1.39 −30 −80.72 15.92 −3.43 −8.92 −19.47
−25 1.39 1.39 −25 −71.65 21.11 8.75 −1.08 0.72
−20 −48.6 32.64 28.64 −4.02 14.29
−20 1.42 1.39
−15 0.18 53.23 49.37 7.18 27.59
−15 2.64 2.06 −10 34.45 68.95 59.53 36.74 27.65
−10 9.10 6.42 −5 51.93 74.93 64.16 55.00 25.23
−5 21.16 16.94 0 61.73 76.26 66.8 63.04 32.04
5 67.69 76.19 70.02 68.93 46.94
0 36.12 33.61
10 71.99 76.75 72.66 73.58 61.84
5 50.99 49.69 15 75.55 78.74 75.08 76.98 72.17
10 66.75 64.61 30 81.21 82.82 82.44 83.26 83.03
15 81.26 78.11 45 84.22 84.94 85.54 85.24 85.33
Clean 84.14 84.75 85.23 85.11 84.66
30 97.42 95.36
45 98.01 97.39
Clean 98.17 97.50 model can be expressed as follows [9]:
x(ω ) = D(ω )s(ω ) + n(ω ). . . . . . . . . (10)
where x(ω ) is an observed microphone signal vec-
greater than 50% and then fell). These results are very
tor with M observations at frequency ω denoted
similar to those of the block accuracy evaluation, which
[x1 (ω )x2 (ω ) · · ·xM (ω )]T ; D(ω ) is a transfer function ma-
can be used as a reference for future works.
trix between the array of microphones and a sound source;
Most of the models showed high value negative accu-
s(ω ) is a clean speech signal vector with N sources at fre-
racy at lower SNR levels. Thus, the networks generated
quency ω denoted as [s1 (ω )s2 (ω ) · · ·sN (ω )]T , where T
replacement of the location or the detected ghost sources.
represents a transpose operator; and n(ω ) is a noise vec-
tor with diffuse and dynamically changing colored noise.
Note that n(ω ) is statistically independent of s(ω ).
5. Discussion Table 3 suggests that the DL model also uses the re-
verberation of the room or that having an input with a
5.1. Reverberation Rooms similar or higher time dimension could improve the SSL.
We compared HEARBO arrays with eight microphones at
When distant microphones were used to capture an different distance using the same reverberation time (Ta-
audio stream, the signals were affected by environmen- ble 3). In this case, the accuracy of the performance was
tal noise and the room’s reverberation. This observation reduced when the source was closer. However, Fig. 10
6. Conclusion
In this study, we proposed an SSL method that use a
Fig. 10. ResNet1 output features. Left: input data. Center: DRN to exploit the time delay between channels to pre-
conv1 output. Right: conv2 output. dict the DOA. Substituting MUSIC for a DRN model us-
ing ELU activations for SSL not only reduced the learn-
ing cost but made the performance more robust with un-
trained noises at lower SNR. We showed that not only can
shows that DL models used most power from the direct a DCNN generate better block accuracy than can SEVD-
sound and compared the level of that sound between the MUSIC in noisy environments, but that a DRN with ELU
channels rather than measuring the power of reverbera- performs better than a DCNN plain network that has the
tions at each channel. Fig. 10 presents the features from same number of layers. The performance of the net-
the conv1 and the residual block conv2 of ResNet1 and works did not diminish even when a different noise was
shows that after training, the model boosted the power used (i.e., from that used by the networks during train-
from the direct sound. In addition, after conv2, the addi- ing). However, stacking additional residual layers did not
tional information (i.e., reverberation or noise) was nearly improve either the convergence of the loss nor the accu-
reduced to zero. racy rate on deeper models during a short training period.
After comparing the model with the others given in DRN models can have an acceptable performance during
Table 4, where the reverberation time of the former is a shorter training period, even when models are very deep.
greater (approximately 500 ms), we observed that the per- However, to obtain a robust performance in challenging
formance of the model also decreased. In this case, it is environments, the models require additional learning cost.
possible that the model required a longer window or more We plan to study residual learning in the application
stacked frames at the input to obtain a similar length to of multiple sources for SSL and improve the accuracy on
match the reverberation time. However, increasing the environments with greater reflections and fewer micro-
hyperparameters of the DL models becomes necessary, phones. In addition, we plan to implement sound source
thereby increasing both the learning process the real-time separation tasks using the employed models for SSL as
processing. future works.
Name: Name:
Kazuhiro Nakadai Tetsuya Ogata
Affiliation: Affiliation:
Honda Research Institute Japan Co., Ltd. Professor, Faculty of Science and Engineering,
Tokyo Institute of Technology Waseda University
Address: Address:
8-1 Honcho, Wako-shi, Saitama 351-0188, Japan 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
2-12-1-W30 Ookayama, Meguro-ku, Tokyo 152-8552, Japan Brief Biographical History:
Brief Biographical History: 1997- Research Fellow, The Japanese Society for the Promotion of Science
1995 Received M.E. from The University of Tokyo (JSPS)
1995-1999 Engineer, Nippon Telegraph and Telephone and NTT Comware 1999- Research Associate, Waseda University
1999-2003 Researcher, Kitano Symbiotic Systems Project, ERATO, JST 2001- Research Scientist, Brain Science Institute, RIKEN
2003 Received Ph.D. from The University of Tokyo 2003- Lecturer, Kyoto University
2003-2009 Senior Researcher, Honda Research Institute Japan Co., Ltd. 2005- Associate Professor, Kyoto University
2006-2010 Visiting Associate Professor, Tokyo Institute of Technology 2012- Professor, Waseda University
2010- Principal Researcher, Honda Research Institute Japan Co., Ltd. Main Works:
2011- Visiting Professor, Tokyo Institute of Technology • “Repeatable Folding Task by Humanoid Robot Worker using Deep
2011- Visiting Professor, Waseda University Learning,” IEEE Robotics and Automation Letters (RA-L), Vol.2, No.2,
Main Works: pp. 397-403, 2016.
• K. Nakamura, K. Nakadai, H. and G. Okuno, “A real-time • “Dynamical Integration of Language and Behavior in a Recurrent Neural
super-resolution robot audition system that improves the robustness of Network for Human-Robot Interaction,” Frontiers in Neurorobotics,
simultaneous speech recognition,” Advanced Robotics, Vol.27, Issue 12, July 15, 2016.
pp. 933-945, 2013 (Received Best Paper Award). • “Multimodal Integration Learning of Robot Behavior using Deep Neural
• H. Miura, T. Yoshida, K. Nakamura, and K. Nakadai, “SLAM-based Networks,” Robotics and Autonomous Systems, Vol.62, No.6,
Online Calibration for Asynchronous Microphone Array,” Advanced pp. 721-736, 2014.
Robotics, Vol.26, No.17, pp. 1941-1965, 2012. Membership in Academic Societies:
• R. Takeda, K. Nakadai, T. Takahashi, T. Ogata, and H. G. Okuno, • The Robotics Society of Japan (RSJ)
“Efficient Blind Dereverberation and Echo Cancellation based on • The Japanese Society for Artificial Intelligence (JSAI)
Independent Component Analysis for Actual Acoustic Signals,” Neural • The Japan Society of Mechanical Engineers (JSME)
Computation, Vol.24, No.1, pp. 234-272, 2012. • Information Processing Society of Japan (IPSJ)
• K. Nakadai, T. Takahashi, H. G. Okuno et al., “Design and • The Society of Instrument and Control Engineers (SICE)
Implementation of Robot Audition System “HARK”,” Advanced Robotics, • Society of Biomechanisms Japan
Vol.24, No.5-6, pp. 739-761, 2010. • The Institute of Electrical and Electronic Engineers (IEEE)
• K. Nakadai, D. Matsuura, H. G. Okuno, and H. Tsujino, “Improvement
of recognition of simultaneous speech signals using AV integration and
scattering theory for humanoid robots,” Speech Communication, Vol.44,
pp. 97-112, 2004.
Membership in Academic Societies:
• The Robotics Society of Japan (RSJ)
• The Japanese Society for Artificial Intelligence (JSAI)
• The Acoustic Society of Japan (ASJ)
• Information Processing Society of Japan (IPSJ)
• Human Interface Society (HIS)
• International Speech and Communication Association (ISCA)
• The Institute of Electrical and Electronics Engineers (IEEE)