Professional Documents
Culture Documents
Domain Information Control at Inference Time For Acoustic Scene Classification
Domain Information Control at Inference Time For Acoustic Scene Classification
Abstract—Domain shift is considered a challenge in machine to outperform CNNs on ASC tasks [7]. Schmid et al. [8]
learning as it causes significant degradation of model perfor- fine-tuned a transformer model pre-trained on a large dataset
arXiv:2306.08010v1 [cs.SD] 13 Jun 2023
mance. In the Acoustic Scene Classification task (ASC), domain on an ASC dataset and used it as a teacher for knowledge
shift is mainly caused by different recording devices. Several
studies have already targeted domain generalization to improve distillation to low-complexity CNNs. Kim et al. [9] adapted a
the performance of ASC models on unseen domains, such as new CNN model to this task. In order to address the domain gener-
devices. Recently, the Controllable Gate Adapter (C ON G ATER) alization problem, Schmid et al. [8] used Frequency Mixstyle
has been proposed in Natural Language Processing to address the data augmentation to improve the performance on unseen
biased training data problem. C ON G ATER allows controlling the devices. Kim et al. [3] introduced ResNorm in BC-ResNet
debiasing process at inference time. C ON G ATER’s main advan-
tage is the continuous and selective debiasing of a trained model, and Relaxed Instance Frequency Normalization (RFIN) as a
during inference. In this work, we adapt C ON G ATER to the audio new normalization method to achieve state-of-the-art results
spectrogram transformer for an acoustic scene classification task. on unseen devices. Another approach to tackle the Domain
We show that C ON G ATER can be used to selectively adapt the Generalization issue is unsupervised Domain Adaptation (DA).
learned representations to be invariant to device domain shifts Using Wasserstein distance, Drossos et al. [10] learned domain
such as recording devices. Our analysis shows that C ON G ATER
can progressively remove device information from the learned invariant feature representations and improved the result of the
representations and improve the model generalization, especially ASC model on an unseen domain. Gharib et al. [11] also used
under domain shift conditions (e.g. unseen devices). We show that adversarial training to learn domain invariant representations to
information removal can be extended to both device and location improve the performance of the ASC model on unseen devices.
domain. Finally, we demonstrate C ON G ATER’s ability to enhance Invariant representation learning is also widely studied in
1
specific device performance without further training .
Index Terms—Acoustic Scene Classification, Domain Adapta-
the field of Natural Language Processing (NLP) on scenarios
tion, Transformers, Adapters, ConGater such as domain adaptation [4], domain transfer [12], and
mitigation of societal biases. In particular, the bias mitigation
I. I NTRODUCTION methods aim at removing sensitive information (e.g. Gender)
from the DNN embeddings to improve fairness. The intro-
Domain Generalization is a critical topic in Deep Neural duced approaches share many conceptual and methodolog-
Networks (DNN). The performance of conventional DNN ical commonalities with domain invariant representation in
methods drastically degrades under domain shift conditions DA, such as the utilization of adversarial techniques [13].
when evaluated on new domains [1]. Therefore, the ability These approaches are widely studied in NLP literature [14]–
of machine learning models to generalize to these new and [16]. Recent studies in this field focus on modular neu-
unseen domains is crucial in real-world applications. Methods ral networks, where end-users can choose between debiased
to improve the generalization of DNNs to a new domain are and biased models. These methods approach this by adding
well studied in different fields, such as computer vision [2], a separate module such as Adapters [17], [18], or sparse
audio perception [3] and natural language processing [4]. subnetworks [19], [20] to network. In this case, instead of
This problem of domain generalization has drawn the at- training the whole model, these separate modules are trained to
tention of the Detection and Classification of Audio Events improve training efficiency and add a modular capability to the
(DCASE) community, and datasets were constructed to test network [18]. More recently, Controllable Gate Adapter [21]
the generalization of common machine learning models under (C ON G ATER) expand the mentioned work by providing the
domain shift conditions [5]. ability for continuous sensitive information removal from the
Convolutional Neural Networks (CNNs) have dominated the trained model.
acoustic scene classification literature [3], [6]. More recently, In this work, we adopt the C ON G ATER idea from NLP and
vision transformers were adapted to audio tasks and shown apply it to audio tasks. By adding C ON G ATER in between
audio spectrogram transformer layers. We aim to remove
This work received financial support by the State of Upper Austria and
the Federal Ministry of Education, Science, and Research through grant LIT-
device and location information to improve the generalization
2021-YOU-215 and basic funding of the LIT AI Lab. of audio spectrogram transformers on ASC tasks under domain
1
Source Code: https://github.com/ShawMask/dcase22 congater shift conditions. We show that (1) C ON G ATER can effectively
ω 0, ω 1, 𝐿 𝐿 𝐿 C ON G ATER [21] was introduced as modules added to the
ω 1, ω 0, 𝐿 𝐿 𝐿
ω 0, ω 0, 𝐿 𝐿
transformer encoder layers of BERT [25] language model,
𝑎𝑑𝑣 ℎ𝑒𝑎𝑑
ℎ𝑒𝑎𝑑 𝑎𝑑𝑣ℎ𝑒𝑎𝑑
ℎ𝑒𝑎𝑑
we similarly extend PA SST with C ON G ATER by adding the
𝑎𝑑𝑣
𝑎𝑑𝑣 ℎ𝑒𝑎𝑑 𝑎𝑑𝑣
𝑇𝑎𝑠𝑘 ℎ𝑒𝑎𝑑 𝑎𝑑𝑣 ℎ𝑒𝑎𝑑 C ON G ATER modules after the attention blocks of PA SST, as
Gradient Reversal explained in the following.
𝑛𝑜𝑟𝑚 C ON G ATER [21] extends the core idea of Adapter net-
⊙ works [17] by introducing a novel controllable gating mecha-
𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 nism, applied to input embeddings to deliver the desired learn-
ing objective such as learning a task or mitigating bias. Each
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 C ON G ATER layer consists of a feed-forward network followed
by a novel activation function called trajectory-sigmoid (t-
⊙
sigmoid). t-sigmoid is similar to normal sigmoid function, but
𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟
with the extra sensitivity parameter ω, as formulated below:
log2 (ω + 1)
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 t-sigmoid(v) = 1 − ω ∈ [0, 1] (1)
1 + ev
⊙ The parameter ω can be manipulated manually at inference
𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 time and smoothly change the shape of the activation function.
At ω = 0, the shape of t-sigmoid is the constant y = 1 function.
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑏𝑙𝑜𝑐𝑘
As ω increases, the activation function smoothly reshapes to a
sigmoid function, adding a stronger non-linearity to the input.
Fig. 1: Our proposed model using C ON G ATER modules, the In order to remove information of several domains, a
location of the C ON G ATER layers, and the trainable parame- C ON G ATER is dedicated to each domain and added to the
ters in each training step. The green blocks are trained with network. As an instance, for the domain location, the gating
the loss of task (Lt ), the blue blocks are trained adversarially vector glocation is achieved after applying the feed forward
to remove device information using task loss and reversed layer followed by t-sigmoid. The overall gate output is the
gradient loss of the device (Ld ), the orange blocks are also element-wise multiplication of the vectors, which in our case
trained adversarially to remove location information using task is defined as g = gdevice ⊙ glocation . The final output the
and reversed gradient location loss (Ll ). For each domain, the layer is defined as the self-gate between the original input to
average loss of the three adversarial heads is considered as C ON G ATER (namely the output of the attention layer denoted
domain loss. by x), and the final gating vector: output = x ⊙ g. Note
control the amount of information of device and location in the that self-gate between any input and C ON G ATER of a domain
network on a continuous range; (2) by removing information with ωd = 0 do not affect the input. By increasing ωd , the
from the device, we achieve on average 0.7% higher overall effect of removing this specific domain – independent of other
accuracy compared to the baseline model, and 1.1% accuracy domain(s) – increases. In our transformer-based architecture as
improvement on unseen devices indicating the better gener- shown in Figure 1, we add one C ON G ATER layer after each
alization of the network; (3) our method can simultaneously of the attention blocks.
be applied to both device and location domains; additionally, The training sequence for the proposed model consists of
we demonstrate C ON G ATER’s ability to fine-tune on a specific three steps, executed sequentially as each batch of training
device by selecting suitable hyper-parameter at inference time data arrives. The first step is Task Training, where parameters
without any need for further training. of the PA SST in addition to the task-head are trained with
ωdevice = 0 and ωlocation = 0. The second step is Device
II. M ETHOD Removal, in which the parameters of the device C ON G ATER
For our experiments, we choose a transformer-based model in addition to the task-head are trained with ωdevice = 1 and
PA SST [7] that performs very well on a wide range of audio ωlocation = 0 to learn the task and remove the device-related
tasks, including ASC tasks [22]. PA SST have a similar struc- information. Finally, in the third step Location Removal, the
ture to Vision Transformers [23]. PA SST works by extract- parameters of the location C ON G ATER in addition to the ones
ing patches of an input spectrogram and linearly projecting of the task-head are trained with ωdevice = 0 and ωlocation = 1
them to a higher dimension corresponding to the embedding in order to learn the task and remove location-related in-
size. The positional encoding—consisting of time and fre- formation. We utilized Domain Adversarial Neural Network
quency positional encodings— are added to these embeddings. (DANN) [13] to remove information from embeddings. In
Patchout [7] is then applied to speed up the training and for this method, an additional sub-network (adversarial head)
regularization. The sequence is then passed through several is added to predict device/location labels from the original
self-attention layers and a feed-forward classifier. In our ex- network embeddings. The reversed gradient loss is added to
periments, we use the pre-trained model on Audioset [24] the network as an extra objective (Ltotal = Ltask +Ladv ). This
and then fine-tune the model on the ASC task. Since the objective–because of the gradient reversal–aims at removing
60 56.5 60.8
balanced accuracy
50 60.6
56.0
accuracy
accuracy
40 60.4
30 55.5 60.2
20 55.0 60.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
device device device
(a) Device probe balanced accuracy (b) Task accuracy on unseen devices (c) Task accuracy on all devices
Fig. 2: (a) Balanced accuracy of the device probe as we increase ωdevice (b) model accuracy on unseen devices (S4-S6) as we
increase only the ωdevice . (c) Overall accuracy of the model on validation set as we increase only ωdevice
the domain information by maximizing the domain confusion In this section, we discuss and analyze the results of our
in the original network embedding. experiments. As mentioned in the previous sections, we train
III. E XPERIMENT S ETUP our proposed model, and the two independent C ON G ATER
modules are trained to remove the domain information of the
Dataset. The dataset used for our experiments is the adapted device and location. More concretely, at inference time each
version of TAU Urban Acoustic Scenes 2022 Mobile develop- C ON G ATER parameter ω is used to remove the corresponding
ment dataset used for the Acoustic Scene Classification (ASC) domain information from the embeddings of the network by
task [5] in the DCASE 2022 challenge. The dataset contains increasing its value from zero to one. The parameters ωdevice ,
1-second audio recordings of 10 different acoustic scenes. The ωlocation control the intensity of the domain adaptation process
development set consists of audio recordings from 9 different for the device and the location, respectively. In our exper-
devices, 3 real devices (A, B, C), and 6 simulated devices (S1- iments, we mainly focus on the effect of ωdevice on the
S6). The audios are recorded in 12 different European cities. network, namely in terms of device information leakage, and
We split 70% of the development set for training in which model performance evaluated on all and unseen devices. In
only 6 devices (A, B, C, S1-S3) are used for training and the addition, we report and analyze the results when controlling
other 3 (S4-S6) are only seen during validation, which we refer simultaneously over both device and location domains. To ac-
to as unseen devices. Similarly for the cities, during training count for the possible variations in the results, we repeat each
10 cities are available and 2 cities are available only during experiment three times and report the average and standard
validation. deviation of the evaluation results.
Preprocessing. We use a sampling rate of 32kHz. We apply
Short Time Fourier Transformation (STFT) with a window
A. Device Information
size of 800 with an overlap of 320 (40%) to generate the
spectrograms. We apply mel-filters bank 128 in a similar setup In order to check whether domain information is indeed
to [8]. The final spectrograms have 128 mel-frequency bins removed from the network, we evaluate the degree of infor-
and 100 time frames. mation leakage using additional classifier heads. This method
Training. We train each model with a batch size of 100 with is commonly referred to as probing in the NLP literature [15].
CrossEntropy loss, Adamw optimizer, weight decay of 0.001, Each probe is a two-layer feed-forward network with a ReLU
−5 −4
max learning rate of 1 × 10 for PA SST model, and 1 × 10 activation function and is trained on the output embedding of
for the C ON G ATER layers. We use a pre-trained PA SST model the last attention layer of the network, before the task head.
on Audioset [24], We initialize all C ON G ATER layers with The probe aims to classify device labels from the embeddings.
zero weight and bias b = 5. This initialization forces the output We retrain the probe for each value of ωdevice . The probes are
−4
of C ON G ATER layers to 1 at the beginning of training. Each trained for 5 epochs with a learning rate of 1 × 10 . We
model is trained for 25 epochs with the following learning calculate the balanced accuracy of the device classification on
rate scheduler: for the first 3 epochs, the learning rate is the validation set and report the average and standard deviation
exponentially increasing to its max value, in the next 3 epochs of the results attained from three independent runs.
the model is trained with the max learning rate, and finally Figure 2a shows that at ωdevice = 0, namely the Baseline with
for the next 10 consecutive epochs the learning rate decreases no information removal from the model, the average balanced
to reach 0.01 of the max value. Adversarial training is used accuracy of the probe for device label is 60.7%. This indicates
for C ON G ATER layers to remove device/location information. that the embeddings of the model are highly informative about
In order to have a smooth adversarial gradient flow during the recording device of the incoming audio. As we increase
training, we use average the loss of three adversarial heads. No the ωdevice and retrain the probes, the balanced accuracy
normalization, augmentation, or dataset balancing is applied, consistently drops, indicating that the device information is
in order to be able to check the direct effect of information decreasing in the embeddings. At ωdevice = 1, we reach
removal on the baseline model where ωdevice = 0 and the lowest balanced accuracy of 17.9%, showing that the
ωlocation = 0. C ON G ATER module has successfully removed most of the
Unseen Devices Seen Devices
Model Unseen Overall
S4 S5 S6 A B C S1 S2 S3
Baseline 56.90.6 57.30.4 50.91.3 72.60.3 63.50.5 67.50.4 58.00.9 55.60.8 58.20.5 55.10.2 60.10.1
ωd = 0.6, ωl = 0.0 57.40.5 58.10.4 51.90.8 72.71.2 63.50.4 67.70.3 59.40.7 56.20.6 58.70.3 55.80.1 60.60.2
ωd = 0.7, ωl = 0.0 57.30.4 58.00.2 52.21.0 72.61.1 63.50.6 67.80.3 59.50.5 56.20.6 58.80.3 55.90.3 60.70.1
ωd = 0.9, ωl = 0.0 57.50.3 57.90.4 52.70.9 72.80.7 63.60.2 67.50.4 59.70.6 56.60.7 59.20.6 56.10.3 60.80.1
ωd = 0.9, ωl = 0.1 57.40.2 58.10.5 52.71.0 72.70.7 63.80.6 67.60.4 59.51.0 56.30.4 59.00.4 56.10.5 60.80.3
ωd = 1.0, ωl = 0.1 57.70.3 58.10.7 53.01.0 72.80.3 63.70.7 67.30.5 59.40.9 56.10.4 59.30.6 56.30.6 60.80.3
Device-specific Tuning 57.90.3 58.10.4 53.01.0 72.80.3 63.80.6 67.80.3 59.70.6 56.60.7 59.30.6 56.30.6 -
achieved in [ωd ,ωl ] [1.0,0.1] [0.6,0.] [1.0,0.1] [1.0,0.1] [0.9,0.1] [0.7,0.] [0.9,0.0] [0.9,0.0] [1.0,0.1] [1.0,0.1] -
TABLE I: Comparison of the performance of our C ON G ATER-based model on target devices under various values of the ω
parameters. We group devices as Seen, and Unseen, indicating whether they exist during training or only appear in evaluation.
The subscripts d and l refer to device and location, respectively. Baseline is the case with ωd = ωl = 0.0, and Device-specific
Tuning refers to dynamically find the best performing ω combination for a specific target device.
information about the device, as – in contrast to the Baseline 60
device
of information removal can be decided at inference time over
a continuous range. 57.01.1 56.60.3 55.90.5 55.81.2 52.30.7 36.63.2