Domain Information Control at Inference Time For Acoustic Scene Classification

Domain Information Control at Inference Time
for Acoustic Scene Classification

1,2 2 1,2 1,2 1,2
Shahed Masoudian, Khaled Koutini, Markus Schedl, Gerhard Widmer, Navid Rekabsaz
1
Johannes Kepler University Linz, Institute of Computational Perception
2
Linz Institute of Technology (LIT), AI Lab
Austria
Abstract—Domain shift is considered a challenge in machine to outperform CNNs on ASC tasks [7]. Schmid et al. [8]
learning as it causes significant degradation of model perfor- fine-tuned a transformer model pre-trained on a large dataset
arXiv:2306.08010v1 [cs.SD] 13 Jun 2023
mance. In the Acoustic Scene Classification task (ASC), domain on an ASC dataset and used it as a teacher for knowledge
shift is mainly caused by different recording devices. Several
studies have already targeted domain generalization to improve distillation to low-complexity CNNs. Kim et al. [9] adapted a
the performance of ASC models on unseen domains, such as new CNN model to this task. In order to address the domain gener-
devices. Recently, the Controllable Gate Adapter (C ON G ATER) alization problem, Schmid et al. [8] used Frequency Mixstyle
has been proposed in Natural Language Processing to address the data augmentation to improve the performance on unseen
biased training data problem. C ON G ATER allows controlling the devices. Kim et al. [3] introduced ResNorm in BC-ResNet
debiasing process at inference time. C ON G ATER’s main advan-
tage is the continuous and selective debiasing of a trained model, and Relaxed Instance Frequency Normalization (RFIN) as a
during inference. In this work, we adapt C ON G ATER to the audio new normalization method to achieve state-of-the-art results
spectrogram transformer for an acoustic scene classification task. on unseen devices. Another approach to tackle the Domain
We show that C ON G ATER can be used to selectively adapt the Generalization issue is unsupervised Domain Adaptation (DA).
learned representations to be invariant to device domain shifts Using Wasserstein distance, Drossos et al. [10] learned domain
such as recording devices. Our analysis shows that C ON G ATER
can progressively remove device information from the learned invariant feature representations and improved the result of the
representations and improve the model generalization, especially ASC model on an unseen domain. Gharib et al. [11] also used
under domain shift conditions (e.g. unseen devices). We show that adversarial training to learn domain invariant representations to
information removal can be extended to both device and location improve the performance of the ASC model on unseen devices.
domain. Finally, we demonstrate C ON G ATER’s ability to enhance Invariant representation learning is also widely studied in
1
specific device performance without further training .
Index Terms—Acoustic Scene Classification, Domain Adapta-
the field of Natural Language Processing (NLP) on scenarios
tion, Transformers, Adapters, ConGater such as domain adaptation [4], domain transfer [12], and
mitigation of societal biases. In particular, the bias mitigation
I. I NTRODUCTION methods aim at removing sensitive information (e.g. Gender)
from the DNN embeddings to improve fairness. The intro-
Domain Generalization is a critical topic in Deep Neural duced approaches share many conceptual and methodolog-
Networks (DNN). The performance of conventional DNN ical commonalities with domain invariant representation in
methods drastically degrades under domain shift conditions DA, such as the utilization of adversarial techniques [13].
when evaluated on new domains [1]. Therefore, the ability These approaches are widely studied in NLP literature [14]–
of machine learning models to generalize to these new and [16]. Recent studies in this field focus on modular neu-
unseen domains is crucial in real-world applications. Methods ral networks, where end-users can choose between debiased
to improve the generalization of DNNs to a new domain are and biased models. These methods approach this by adding
well studied in different fields, such as computer vision [2], a separate module such as Adapters [17], [18], or sparse
audio perception [3] and natural language processing [4]. subnetworks [19], [20] to network. In this case, instead of
This problem of domain generalization has drawn the at- training the whole model, these separate modules are trained to
tention of the Detection and Classification of Audio Events improve training efficiency and add a modular capability to the
(DCASE) community, and datasets were constructed to test network [18]. More recently, Controllable Gate Adapter [21]
the generalization of common machine learning models under (C ON G ATER) expand the mentioned work by providing the
domain shift conditions [5]. ability for continuous sensitive information removal from the
Convolutional Neural Networks (CNNs) have dominated the trained model.
acoustic scene classification literature [3], [6]. More recently, In this work, we adopt the C ON G ATER idea from NLP and
vision transformers were adapted to audio tasks and shown apply it to audio tasks. By adding C ON G ATER in between
audio spectrogram transformer layers. We aim to remove
This work received financial support by the State of Upper Austria and
the Federal Ministry of Education, Science, and Research through grant LIT-
device and location information to improve the generalization
2021-YOU-215 and basic funding of the LIT AI Lab. of audio spectrogram transformers on ASC tasks under domain
1
Source Code: https://github.com/ShawMask/dcase22 congater shift conditions. We show that (1) C ON G ATER can effectively
ω 0, ω 1, 𝐿 𝐿 𝐿 C ON G ATER [21] was introduced as modules added to the
ω 1, ω 0, 𝐿 𝐿 𝐿
ω 0, ω 0, 𝐿 𝐿
transformer encoder layers of BERT [25] language model,
𝑎𝑑𝑣 ℎ𝑒𝑎𝑑
ℎ𝑒𝑎𝑑 𝑎𝑑𝑣ℎ𝑒𝑎𝑑
ℎ𝑒𝑎𝑑
we similarly extend PA SST with C ON G ATER by adding the
𝑎𝑑𝑣
𝑎𝑑𝑣 ℎ𝑒𝑎𝑑 𝑎𝑑𝑣
𝑇𝑎𝑠𝑘 ℎ𝑒𝑎𝑑 𝑎𝑑𝑣 ℎ𝑒𝑎𝑑 C ON G ATER modules after the attention blocks of PA SST, as
Gradient Reversal explained in the following.
𝑛𝑜𝑟𝑚 C ON G ATER [21] extends the core idea of Adapter net-
⊙ works [17] by introducing a novel controllable gating mecha-
𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 nism, applied to input embeddings to deliver the desired learn-
ing objective such as learning a task or mitigating bias. Each
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 C ON G ATER layer consists of a feed-forward network followed
by a novel activation function called trajectory-sigmoid (t-
⊙
sigmoid). t-sigmoid is similar to normal sigmoid function, but
𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟
with the extra sensitivity parameter ω, as formulated below:
log2 (ω + 1)
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑏𝑙𝑜𝑐𝑘 t-sigmoid(v) = 1 − ω ∈ [0, 1] (1)
1 + ev
⊙ The parameter ω can be manipulated manually at inference
𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 𝐶𝑜𝑛𝑔𝑎𝑡𝑒𝑟 time and smoothly change the shape of the activation function.
At ω = 0, the shape of t-sigmoid is the constant y = 1 function.
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑏𝑙𝑜𝑐𝑘
As ω increases, the activation function smoothly reshapes to a
sigmoid function, adding a stronger non-linearity to the input.
Fig. 1: Our proposed model using C ON G ATER modules, the In order to remove information of several domains, a
location of the C ON G ATER layers, and the trainable parame- C ON G ATER is dedicated to each domain and added to the
ters in each training step. The green blocks are trained with network. As an instance, for the domain location, the gating
the loss of task (Lt ), the blue blocks are trained adversarially vector glocation is achieved after applying the feed forward
to remove device information using task loss and reversed layer followed by t-sigmoid. The overall gate output is the
gradient loss of the device (Ld ), the orange blocks are also element-wise multiplication of the vectors, which in our case
trained adversarially to remove location information using task is defined as g = gdevice ⊙ glocation . The final output the
and reversed gradient location loss (Ll ). For each domain, the layer is defined as the self-gate between the original input to
average loss of the three adversarial heads is considered as C ON G ATER (namely the output of the attention layer denoted
domain loss. by x), and the final gating vector: output = x ⊙ g. Note
control the amount of information of device and location in the that self-gate between any input and C ON G ATER of a domain
network on a continuous range; (2) by removing information with ωd = 0 do not affect the input. By increasing ωd , the
from the device, we achieve on average 0.7% higher overall effect of removing this specific domain – independent of other
accuracy compared to the baseline model, and 1.1% accuracy domain(s) – increases. In our transformer-based architecture as
improvement on unseen devices indicating the better gener- shown in Figure 1, we add one C ON G ATER layer after each
alization of the network; (3) our method can simultaneously of the attention blocks.
be applied to both device and location domains; additionally, The training sequence for the proposed model consists of
we demonstrate C ON G ATER’s ability to fine-tune on a specific three steps, executed sequentially as each batch of training
device by selecting suitable hyper-parameter at inference time data arrives. The first step is Task Training, where parameters
without any need for further training. of the PA SST in addition to the task-head are trained with
ωdevice = 0 and ωlocation = 0. The second step is Device
II. M ETHOD Removal, in which the parameters of the device C ON G ATER
For our experiments, we choose a transformer-based model in addition to the task-head are trained with ωdevice = 1 and
PA SST [7] that performs very well on a wide range of audio ωlocation = 0 to learn the task and remove the device-related
tasks, including ASC tasks [22]. PA SST have a similar struc- information. Finally, in the third step Location Removal, the
ture to Vision Transformers [23]. PA SST works by extract- parameters of the location C ON G ATER in addition to the ones
ing patches of an input spectrogram and linearly projecting of the task-head are trained with ωdevice = 0 and ωlocation = 1
them to a higher dimension corresponding to the embedding in order to learn the task and remove location-related in-
size. The positional encoding—consisting of time and fre- formation. We utilized Domain Adversarial Neural Network
quency positional encodings— are added to these embeddings. (DANN) [13] to remove information from embeddings. In
Patchout [7] is then applied to speed up the training and for this method, an additional sub-network (adversarial head)
regularization. The sequence is then passed through several is added to predict device/location labels from the original
self-attention layers and a feed-forward classifier. In our ex- network embeddings. The reversed gradient loss is added to
periments, we use the pre-trained model on Audioset [24] the network as an extra objective (Ltotal = Ltask +Ladv ). This
and then fine-tune the model on the ASC task. Since the objective–because of the gradient reversal–aims at removing
60 56.5 60.8
balanced accuracy
50 60.6
56.0
accuracy
accuracy
40 60.4
30 55.5 60.2
20 55.0 60.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
device device device
(a) Device probe balanced accuracy (b) Task accuracy on unseen devices (c) Task accuracy on all devices
Fig. 2: (a) Balanced accuracy of the device probe as we increase ωdevice (b) model accuracy on unseen devices (S4-S6) as we
increase only the ωdevice . (c) Overall accuracy of the model on validation set as we increase only ωdevice
the domain information by maximizing the domain confusion In this section, we discuss and analyze the results of our
in the original network embedding. experiments. As mentioned in the previous sections, we train
III. E XPERIMENT S ETUP our proposed model, and the two independent C ON G ATER
modules are trained to remove the domain information of the
Dataset. The dataset used for our experiments is the adapted device and location. More concretely, at inference time each
version of TAU Urban Acoustic Scenes 2022 Mobile develop- C ON G ATER parameter ω is used to remove the corresponding
ment dataset used for the Acoustic Scene Classification (ASC) domain information from the embeddings of the network by
task [5] in the DCASE 2022 challenge. The dataset contains increasing its value from zero to one. The parameters ωdevice ,
1-second audio recordings of 10 different acoustic scenes. The ωlocation control the intensity of the domain adaptation process
development set consists of audio recordings from 9 different for the device and the location, respectively. In our exper-
devices, 3 real devices (A, B, C), and 6 simulated devices (S1- iments, we mainly focus on the effect of ωdevice on the
S6). The audios are recorded in 12 different European cities. network, namely in terms of device information leakage, and
We split 70% of the development set for training in which model performance evaluated on all and unseen devices. In
only 6 devices (A, B, C, S1-S3) are used for training and the addition, we report and analyze the results when controlling
other 3 (S4-S6) are only seen during validation, which we refer simultaneously over both device and location domains. To ac-
to as unseen devices. Similarly for the cities, during training count for the possible variations in the results, we repeat each
10 cities are available and 2 cities are available only during experiment three times and report the average and standard
validation. deviation of the evaluation results.
Preprocessing. We use a sampling rate of 32kHz. We apply
Short Time Fourier Transformation (STFT) with a window
A. Device Information
size of 800 with an overlap of 320 (40%) to generate the
spectrograms. We apply mel-filters bank 128 in a similar setup In order to check whether domain information is indeed
to [8]. The final spectrograms have 128 mel-frequency bins removed from the network, we evaluate the degree of infor-
and 100 time frames. mation leakage using additional classifier heads. This method
Training. We train each model with a batch size of 100 with is commonly referred to as probing in the NLP literature [15].
CrossEntropy loss, Adamw optimizer, weight decay of 0.001, Each probe is a two-layer feed-forward network with a ReLU
−5 −4
max learning rate of 1 × 10 for PA SST model, and 1 × 10 activation function and is trained on the output embedding of
for the C ON G ATER layers. We use a pre-trained PA SST model the last attention layer of the network, before the task head.
on Audioset [24], We initialize all C ON G ATER layers with The probe aims to classify device labels from the embeddings.
zero weight and bias b = 5. This initialization forces the output We retrain the probe for each value of ωdevice . The probes are
−4
of C ON G ATER layers to 1 at the beginning of training. Each trained for 5 epochs with a learning rate of 1 × 10 . We
model is trained for 25 epochs with the following learning calculate the balanced accuracy of the device classification on
rate scheduler: for the first 3 epochs, the learning rate is the validation set and report the average and standard deviation
exponentially increasing to its max value, in the next 3 epochs of the results attained from three independent runs.
the model is trained with the max learning rate, and finally Figure 2a shows that at ωdevice = 0, namely the Baseline with
for the next 10 consecutive epochs the learning rate decreases no information removal from the model, the average balanced
to reach 0.01 of the max value. Adversarial training is used accuracy of the probe for device label is 60.7%. This indicates
for C ON G ATER layers to remove device/location information. that the embeddings of the model are highly informative about
In order to have a smooth adversarial gradient flow during the recording device of the incoming audio. As we increase
training, we use average the loss of three adversarial heads. No the ωdevice and retrain the probes, the balanced accuracy
normalization, augmentation, or dataset balancing is applied, consistently drops, indicating that the device information is
in order to be able to check the direct effect of information decreasing in the embeddings. At ωdevice = 1, we reach
removal on the baseline model where ωdevice = 0 and the lowest balanced accuracy of 17.9%, showing that the
ωlocation = 0. C ON G ATER module has successfully removed most of the
Unseen Devices Seen Devices
Model Unseen Overall
S4 S5 S6 A B C S1 S2 S3
Baseline 56.90.6 57.30.4 50.91.3 72.60.3 63.50.5 67.50.4 58.00.9 55.60.8 58.20.5 55.10.2 60.10.1
ωd = 0.6, ωl = 0.0 57.40.5 58.10.4 51.90.8 72.71.2 63.50.4 67.70.3 59.40.7 56.20.6 58.70.3 55.80.1 60.60.2
ωd = 0.7, ωl = 0.0 57.30.4 58.00.2 52.21.0 72.61.1 63.50.6 67.80.3 59.50.5 56.20.6 58.80.3 55.90.3 60.70.1
ωd = 0.9, ωl = 0.0 57.50.3 57.90.4 52.70.9 72.80.7 63.60.2 67.50.4 59.70.6 56.60.7 59.20.6 56.10.3 60.80.1
ωd = 0.9, ωl = 0.1 57.40.2 58.10.5 52.71.0 72.70.7 63.80.6 67.60.4 59.51.0 56.30.4 59.00.4 56.10.5 60.80.3
ωd = 1.0, ωl = 0.1 57.70.3 58.10.7 53.01.0 72.80.3 63.70.7 67.30.5 59.40.9 56.10.4 59.30.6 56.30.6 60.80.3
Device-specific Tuning 57.90.3 58.10.4 53.01.0 72.80.3 63.80.6 67.80.3 59.70.6 56.60.7 59.30.6 56.30.6 -
achieved in [ωd ,ωl ] [1.0,0.1] [0.6,0.] [1.0,0.1] [1.0,0.1] [0.9,0.1] [0.7,0.] [0.9,0.0] [0.9,0.0] [1.0,0.1] [1.0,0.1] -
TABLE I: Comparison of the performance of our C ON G ATER-based model on target devices under various values of the ω
parameters. We group devices as Seen, and Unseen, indicating whether they exist during training or only appear in evaluation.
The subscripts d and l refer to device and location, respectively. Baseline is the case with ωd = ωl = 0.0, and Device-specific
Tuning refers to dynamically find the best performing ω combination for a specific target device.
information about the device, as – in contrast to the Baseline 60
0.0 0.2 0.4 0.6 0.8 1.0

17.90.2 16.11.0 17.52.4 18.04.1 19.25.9 16.94.5
– the probe is not able to predict device labels with high
accuracy. Our results show that C ON G ATER can effectively 46.50.8 45.71.1 43.32.4 39.26.9 32.99.2 21.55.4 50
remove domain information from the model, and the amount
53.50.8 54.20.4 53.11.1 50.52.8 46.62.8 29.84.6
40
device
of information removal can be decided at inference time over
a continuous range. 57.01.1 56.60.3 55.90.5 55.81.2 52.30.7 36.63.2
B. Model Performance 59.40.7 59.60.4 58.80.8 57.80.8 56.00.9 43.92.3

30
The main purpose of domain adaptation in ASC is to 60.70.9 60.90.4 60.40.5 60.30.3 58.50.8 50.21.3 20
achieve better performance on unseen domains by generating
domain invariant feature representation. We now turn our
0.0 0.2 0.4 0.6 0.8 1.0
location
attention to the task accuracy of the model on unseen devices
(a) Device probing
(S4-S6). Figure 2b shows the task accuracy results of the
network on unseen devices as we increase ωdevice . We observe 0.0 0.2 0.4 0.6 0.8 1.0 24.70.7 24.00.7 22.90.3 22.10.9 21.61.9 18.91.7
that from the Baseline position ωdevice = 0 (where no infor-
27.10.5 26.60.7 25.80.3 25.20.6 23.51.0 19.91.2
26
mation from the device is removed), the average task accuracy
on unseen devices starts at 55.1%. By increasing ωdevice and 27.50.0 27.30.5 27.00.3 26.30.4 25.00.4 20.42.0 24
device
removing the device information, the accuracy of the network

27.30.2 27.30.5 27.30.2 27.10.5 25.80.2 19.82.6
on unseen devices continuously improves and reaches its max 22
at 56.2% when ω = 1. The results on unseen accuracy indicate 27.10.5 27.20.5 27.50.1 26.80.4 26.00.3 19.81.1
that increasing ω and removing more device information leads
27.00.4 26.90.2 26.90.4 27.00.5 26.20.1 19.90.8 20
to improving model generalization across unseen devices. We
also evaluate the overall (all devices) performance of the model 0.0 0.2 0.4 0.6 0.8 1.0
with the same procedure as before. Looking at Figure 2c, we location
see that by increasing ωdevice , task accuracy increases and (b) Location probing
peaks at ωdevice = 0.9 where the average task accuracy is
Fig. 3: Balanced accuracy of the probes, measuring the infor-
60.8%, followed by a negligible drop at ωdevice = 1. All in
mation leakage of device (a) and location (b) in our model,
all the results are a clear indication that by removing device
when increasing device and location ω parameters.
information from the network we achieve less informative
embeddings which leads to better generalization on unseen less information leakage of both device and location domains
2 than information removal of one domain.
devices as well as overall model performance.
Despite the successful removal of domain-specific infor-
C. Device Specific Tuning mation, our evaluation of task accuracy shows that remov-
In this section, we discuss the benefits of controllable in- ing only the location (in contrast to the device) does not
formation removal at inference, particularly in the scenario of result in meaningful performance improvement. However,
finding the optimal model configuration according to a specific C ON G ATER’s ability to manipulate domain information can
target domain. As shown in Figure 3 and similar to previous be leveraged to enhance model performance for a specific
experiments (Figure 2a), C ON G ATER-based model can adjust domain. We observed that by careful selection of the ωdevice
the information of both domains without additional training. and ωlocation , we can improve model’s performance with
Interestingly, the simultaneous removal of both factors on the regard to one or several devices. This ability can be particularly
top-right of the plot (ωdevice = 1 and ωlocation =1) leads to a advantageous when we introduce a new unknown device with a
small subset of audio. Instead of re-training the whole model
2
Extensive 2D analysis of the domains is available in the GitHub repository for a new device, we can simply select the suitable ω for
the same pre-trained model at inference without any need for [8] F. Schmid, S. Masoudian, K. Koutini, and G. Widmer, “Knowledge
further training. The selection process can be either based on distillation from transformers for low-complexity acoustic scene clas-
sification,” in Proceedings of the 7th Detection and Classification of
expert knowledge of the source and the target domain or by Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy,
empirically validating model performance on a small subset of France, November 2022.
the target domain at inference (Grid search of the C ON G ATER [9] J.-H. Lee, J.-H. Choi, P. M. Byun, and J.-H. Chang, “Multi-scale archi-
tecture and device-aware data-random-drop based fine-tuning method
parameter ω). Table I shows the specific values of ωdevice , for acoustic scene classification,” in Proceedings of the 7th Detection
ωlocation and model performance on each device. As it can and Classification of Acoustic Scenes and Events 2022 Workshop
be seen from the table there are unique ωdevice and ωlocation (DCASE2022), Nancy, France, November 2022.
[10] K. Drossos, P. Magron, and T. Virtanen, “Unsupervised adversarial
which result in better performance of the same model for one domain adaptation based on the wasserstein distance for acoustic scene
particular device. For instance, even though the best ω value classification,” in 2019 IEEE Workshop on Applications of Signal
for unseen devices is ωdevice = 1, ωlocation = 0.1, we can Processing to Audio and Acoustics (WASPAA). IEEE, 2019.
[11] S. Gharib, K. Drossos, E. Cakir, D. Serdyuk, and T. Virtanen, “Unsuper-
see that for device S2, model has 0.5% better performance vised adversarial domain adaptation for acoustic scene classification,” in
on average at ωdevice = 0.9, ωlocation = 0.0. Note that with Proceedings of the Workshop on Detection and Classification of Acoustic
device-specific tuning, it is possible to optimize for a specific Scenes and Events, DCASE 2018, Surrey, UK, November 19-20, 2018,
M. D. Plumbley, C. Kroos, J. P. Bello, G. Richard, D. P. W. Ellis, and
device, albeit at the expense of the performance of other A. Mesaros, Eds., 2018.
devices. [12] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning
in natural language processing,” in Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational
IV. C ONCLUSION Linguistics: Tutorials. Minneapolis, Minnesota: Association for Com-
putational Linguistics, Jun. 2019, pp. 15–18.
In this paper, we adopt C ON G ATER from the NLP domain [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi-
and implement it on audio spectrogram transformers. Training olette, M. Marchand, and V. Lempitsky, “Domain-adversarial training
C ON G ATER on ASC dataset for multi-device domain adapta- of neural networks,” The journal of machine learning research, vol. 17,
no. 1, pp. 2096–2030, 2016.
tion, we observe that our model using C ON G ATER modules [14] D. Kashyap, S. K. Aithal, C. Rakshith, and N. Subramanyam, “Towards
is effective at continuous device/location information removal domain adversarial methods to mitigate texture bias,” in ICML 2022:
from the network embedding at inference time. We observe Workshop on Spurious Correlations, Invariance and Stability.
[15] D. Kumar, O. Lesota, G. Zerveas, D. Cohen, C. Eickhoff, M. Schedl,
that by increasing the sensitivity of the parameter ωdevice , the and N. Rekabsaz, “Parameter-efficient modularised bias mitigation via
embeddings of the network effectively lose device informa- adapterfusion,” in Proceeding of 17th Conference of the European
tion (domain labels), shown by the decrease in the balanced Chapter of the Association for Computational Linguistics (EACL), 2023.
[16] H. Yuan, J. Zheng, Q. Ye, Y. Qian, and Y. Zhang, “Improving fake news
accuracy of the probes. We observe significant improvement in detection with domain-adversarial and graph-attention neural network,”
both unseen devices and overall accuracy for an already trained Decision Support Systems, vol. 151, p. 113633, 2021.
model by adjusting ωdevice from 0 to 1. This observation [17] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,
A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
indicates that removing device information leads to a more ro- learning for NLP,” in International Conference on Machine Learning,
bust embedding for unseen devices. We observe that removing vol. 97. Proceedings of Machine Learning Research, 09–15 Jun 2019.
information from location alone does not significantly improve [18] A. Lauscher, T. Lueken, and G. Glavaš, “Sustainable modular debiasing
of language models,” in Findings of the Association for Computational
task accuracy, nor the unseen accuracy of the devices. Finally, Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Associ-
we demonstrate that correct selection of C ON G ATER hyper- ation for Computational Linguistics, Nov. 2021.
parameter for location and device leads to device-specific [19] L. Hauzenberger and N. Rekabsaz, “Parameter efficient diff pruning for
bias mitigation,” arXiv preprint arXiv:2205.15171, 2022.
performance improvement. [20] J. M. Meissner, S. Sugawara, and A. Aizawa, “Debiasing masks: A
new framework for shortcut mitigation in NLU,” in Proceeding of the
R EFERENCES 2022 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2022.
[1] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” [21] S. Masoudian, O. Lesota, M. Schedl, and N. Rekabsaz, “Controllable
Neurocomputing, vol. 312, pp. 135–153, 2018. attribute removal for continuous bias mitigation at inference time,”
[2] H. Venkateswara and S. Panchanathan, Domain adaptation in computer 2023, preprint under review. [Online]. Available: https://openreview.net/
vision with deep learning. Springer, 2020. pdf?id=aQ1gwltnlP
[3] B. Kim, S. Yang, J. Kim, H. Park, J. Lee, and S. Chang, “Domain [22] F. Schmid, K. Koutini, and G. Widmer, “Efficient Large-scale Au-
generalization with relaxed instance frequency-wise normalization for dio Tagging via Transformer-to-CNN Knowledge Distillation,” arXiv
multi-device acoustic scene classification,” in Interspeech, 2022. preprint arXiv:2211.04772, 2022.
[4] D. Jin, Z. Jin, Z. Hu, O. Vechtomova, and R. Mihalcea, “Deep learning [23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
for text style transfer: A survey,” Computational Linguistics, vol. 48, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
no. 1, pp. 155–205, 2022. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans-
[5] I. Martı́n-Morató, F. Paissan, A. Ancilotto, T. Heittola, A. Mesaros, formers for image recognition at scale,” in 9th International Conference
E. Farella, A. Brutti, and T. Virtanen, “Low-complexity acoustic scene on Learning Representations, ICLR 2021, Virtual Event, Austria, May
classification in DCASE 2022 challenge,” in Proceedings of the 7th 3-7, 2021. OpenReview.net, 2021.
Workshop on Detection and Classification of Acoustic Scenes and Events [24] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence,
2022, DCASE 2022, Nancy, France, November 3-4, 2022, 2022. R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and
[6] K. Koutini, F. Henkel, H. Eghbal-zadeh, and G. Widmer, “CP-JKU human-labeled dataset for audio events,” in ICASSP, 2017.
submissions to DCASE’20: Low-complexity cross-device acoustic scene [25] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
classification with rf-regularized cnns,” Tech. Rep., DCASE2020 Chal- of deep bidirectional transformers for language understanding,” in
lenge, 2020. Proceedings of the 2019 Conference of the North American Chapter
[7] K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient of the Association for Computational Linguistics. Association for
training of audio transformers with patchout,” Interspeech, 2022. Computational Linguistics, 2019.

Domain Information Control at Inference Time For Acoustic Scene Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Domain Information Control at Inference Time For Acoustic Scene Classification

Uploaded by

Copyright:

Available Formats

Domain Information Control at Inference Time

for Acoustic Scene Classification

0.0 0.2 0.4 0.6 0.8 1.0

B. Model Performance 59.40.7 59.60.4 58.80.8 57.80.8 56.00.9 43.92.3

removing the device information, the accuracy of the network

You might also like