SLT Silence 2022 Complete

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

SILENCE IS SWEETER THAN SPEECH:

SELF-SUPERVISED MODEL USING SILENCE TO STORE SPEAKER INFORMATION

Chi-Luen Feng, Po-chun Hsu, Hung-yi Lee

National Taiwan University

ABSTRACT In this work, we choose speaker information as the en-


try point to demystify the SSL model. We adopt Speaker
Self-Supervised Learning (SSL) speech models achieve Identification (SID) task and Automatic Speaker Verification
decent performance on a wide range of downstream tasks. (ASV) task to measure speaker information in the represen-
However, how SSL models store various information in hid- tations. Instead of layer-wise analysis like in previous works,
den representations is still poorly understood. In this pa- we propose to analyze the speaker information through the
per, we focus on speaker information and found that SSL position of the waveform. We find that SSL model tends to
model stores speaker information in representations whose store speaker information in the last fragment, which contains
positions correspond to silences in a waveform. There are more silence.
several pieces of evidence. (1) We find that the silence part We hypothesize that SSL models store speaker informa-
in an utterance contributes more to the speaker identification tion in the representations corresponding to silence. To get
(SID) task. (2) If we only use the representation of a part of more evidence, we add silence in utterances, split them into
the utterance for SID, the silenced part has higher accuracy fragments, and use only one fragment to train the SID task at
than the other parts. (3) According to gradient saliency anal- a time. We find that the fragment corresponding to the silence
ysis, the silenced part has a greater influence on the results of part can get the highest score among all fragments. Moreover,
SID. Our findings not only contribute to a better understand- the gradient saliency analysis’s outcome shows that silence
ing of SSL models but also improve performance. By simply has a great influence on the speaker-related task. All the ev-
adding silence to the waveform with low silence, HuBERT idence points out that silence will help the SSL model store
improved its performance on speaker-related tasks by nearly speaker information.
5%. We then further analyzed the relationship between silence
Index Terms— Speech Self-Supervised Learning, Rep- percentage and SID accuracy and found that the amount of si-
resentation Learning, Speaker Identification lence in utterances was highly correlated with accuracy on the
SID task, The result is yet another evidence that silence is re-
lated to the speaker information of the SSL model. The result
1. INTRODUCTION further suggests that the SID/ASV performance of the low
silence utterances can be improved by adding silence. This
Self-Supervised Learning (SSL), which utilizes unlabeled simple approach does lead to better performance on SID/ASV
data to learn, has achieved many milestones in different tasks.
fields. In the speech field, there are several outstanding SSL To our knowledge, this is the first work to investigate the
models proposed in recent years [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, hypothesis that SSL models store different speech informa-
11, 12, 13, 14]. We can leverage representations from these tion into representations corresponding to different parts of
SSL models with a simple downstream model and perform an utterance. Although we only focus on speaker information
well in different speech tasks [15]. This shows that the SSL here, this work opens up a new direction for analyzing SSL
models have universal characteristics. The representations in- representations. It is a critical step toward uncovering how
tegrate different information, such as content, semantic, and SSL models store different information in their representa-
speaker information, into the low dimension representations. tions.
Despite the achievements of SSL models in various speech
tasks, there are only limited research and analysis about SSL 2. RELATED WORK
models and their representations. All previous works focus
mainly on the performance of different layers or types of According to the success of the SSL model in recent years,
models [16, 17, 18, 19]. There is currently little research on there are more and more papers trying to analyze the key
how SSL models store different aspects of speech information to the success of the SSL model. However, the current pa-
in their representation without interference. pers which analyze the SSL model mainly focus on the text
Downstream model (SID)

weighted-sum features Downstream model (SID)


w1 w2 w3-10 w11

f1 f2 f3-10 f11 f1 f2 f3-10 f11

HuBERT model HuBERT model

+ Silence + Silence

(a) (b)

Fig. 1. (a) Experiment structure in Section 3 and Section 4.1. The black dotted squares represent the silence fragment, which
is added only in Section 4.1. The input of the downstream model is the position wise weighted sum of the HuBERT represen-
tations. (b) Experiment structure in Section 4.2. The downstream model takes only one fragment as input at a time.

field [20, 21, 22, 23, 24, 25, 26]. For example, [27] analyzes
how attention maps in BERT process linguistic knowledge.
As for [28], this paper focuses on how the SSL model learned
the language information during the training process. There
are still seldom papers contributing to the study of the SSL
model in the speech field.
There are two research directions in the limited papers for
analysis of the speech SSL model. The first direction is com-
paring speech SSL models by training criteria. [16] shows
that training criteria have a high relationship with the perfor- Fig. 2. Norm weights of the SID task for each fragment. The
mance of downstream tasks. The second direction is to study l-th line represents using the representations extracted from
the SSL model from the perspective of content. For example, the l-th layer of HuBERT Base to train the SID task. There
[17] uses Canonical Correlation Analysis (CCA) to investi- are 12 lines corresponding to the 12 layers in HuBERT Base.
gate which information is encoded by each layer of the SSL
model. Inspired by these prior works, we want to explore the SUPERB benchmark, to analyze each layer, we use a
the logic behind the SSL model. However, unlike previous single layer representation instead of the weighted sum of all
works, we mainly focus on investigating how the information layers.
is stored in representations of the SSL model and how the In the original setting of the SID task in SUPERB, the
input waveform will cooperate with the SSL model. To our downstream model is a pooling layer and a liner layer. Here
knowledge, this is the first work to investigate the hypothesis we modify the pooling layer to further investigate whether
that SSL models store different speech information into rep- SSL models store speaker information in all representations
resentations corresponding to different parts of an utterance. equally. The modified SID process is shown in Figure 1(a).
We first equally segment the sequence of representations from
a specific layer in the upstream model into 10 fragments. We
3. WHERE IS THE SPEAKER INFORMATION
do mean pooing among the frame-level representations in
each fragment to obtain a representation for each fragment fi
Some previous works have shown that SSL models tend (i = 1 to 10). Then we assign a learnable weight wi to the
to store speaker information in the middle layers [18, 19]. i-th fragment, which is learned with the downstream model.
However, how the information is stored in different positions We multiply the representation of each fragment by the learn-
within an utterance remains unexplored. In this section, we able weight, and all the fragment representations are added to
first analyze the representations from different positions for obtain a single representation f for the whole utterance:
the SID task.
10
The setup of the SID task here is modified from SU- X
PERB [15]. We follow the setting in [15] to split the Vox- f= wi fi (1)
i=1
Celeb dataset [29] into training and testing sets. We use
HuBERT as the upstream model. Unlike the experiments in Such a design allows us to observe whether representations in
S S S

S S S

(a)

Fig. 3. Norm weights of the SID task for the shuffle ex-
periment. We shuffle the waveform fragments before feeding
them to HuBERT Base. The x-axis indexes represent the orig- S S S

inal fragment numbers in Figure 2.


(b)

Fig. 5. (a) Norm weights of the SID task with added silence.
Two SSL models are used in the experiment: HuBERT Base
and HuBERT Large; (b) norm weights of the ASV task with
Fig. 4. The percentage of the silence in each fragment for added silence. HuBERT is used in the ASV setting. On the
entire VoxCeleb. The lighter the color, the less silence. The x-axis, the red s represents the silent fragment. Red lines are
first fragment has the least silence. silence added in the front. Blue lines are silence added in the
middle. Green lines are silence added in the end.
different positions of an utterance have various contributions and evaluate them with the same pipeline. Figure 3 shows the
to SID. If a specific fragment has a larger weight than the results. All utterances perform the same perturbation. After
others, we can hypothesize that it is more critical to the SID the perturbation, the original last fragment (fragment 10) is
task. swapped to the middle of the utterance while still having the
However, it does not always mean that the corresponding highest norm weight. The other perturbed fragments also re-
fragment is critical if we only consider the absolute value of main weights close to the original values, showing that it is
its weight. If a fragment’s representation has small values, its not the position but the content of a segment that affects its
weight may become large to balance the small values in the norm weight. Based on the finding, we hypothesize that si-
representation. Therefore, we consider the values of weights lence fragments may be more related to the storage of speaker
and representations together and introduce norm weight w̄i information.
to indicate the importance of a fragment. The main idea of
norm weight w̄i is multiplying the learnable weight wi for
each fragment with the L2 norm of the fragment’s represen- 4. WHERE CAN WE FIND SPEAKER? SILENCE
tation ∥fi ∥.
At the end of Section 3, we hypothesize that silence correlates
w̄i = wi ∥fi ∥ (2)
with speaker information. In this section, we do more experi-
Figure 2 shows the average of the norm weights w̄i of the ments to verify their relationship. In Sections 4.1 and 4.2, we
fragments across the testing set. For a reasonable comparison insert the silence into waveforms and show the importance of
between utterances, we normalize the sum of all norm weights the silence part to the SID task. After that, we show the si-
in the same utterance to 1 before taking the average. We find lence’s contribution to the final result by gradient saliency in
that the last fragment of the waveform always contributes the Section 4.3.
most among all the fragments, while the first segment has the
least contribution. To investigate the reason behind this phe- 4.1. Silence is Important For SSL Model to Store Speaker
nomenon, we utilize the librosa toolkit to measure the silence Information
percentage of each fragment. We test all waveforms on Vox-
Celeb and average them. The result is shown in Figure 4, in To measure the importance of silence to speaker information,
which the numbers represent the silence percentage for that we add a silence fragment to the waveform, which length is
fragment. Based on Figure 4, we find that the last fragment 1/10 of the original waveform. Here the silence fragments
contains higher portions of silence than other fragments, and have all zero values. The positions where we insert the si-
the first segment has the lowest portions of silence. lence fragment into the waveform are the original waveform’s
To further verify that the observation is only caused by the front/middle/end, respectively. After that, following the anal-
content rather than the order of the fragments, we randomly ysis method in Section 3, we segment a representation se-
perturb the order of the fragments in each testing utterance quence into 11 fragments (10 corresponding to the original
Table 1. SID accuracy using different fragments of HuBERT Table 2. The gradient saliency statistic for different models
representations. The first column is the fragment ID, includ- and different settings. In each cell, the first number means the
ing the silence fragment (S). The second and third columns sum of the silence fragment gradient divided by the sum of the
are the silences connected at the front and end of the in- non-silence fragment gradient. The second number is for the
put waveform. The fourth column is the waveform without condition that silence is NOT added in waveforms. For com-
adding silence; the percentage in parentheses is the difference parison to the first number, we split the waveforms into two
in accuracy between No Silence and Silence Front. We use partitions, and we use the sum of the one partition gradient
McNemar’s test to see if there is a significant difference be- divided by the sum of the remaining gradient. For example,
tween each pair of Silence Front and No Silence, and † de- the second number for Front-1/5 cell takes the sum of the first
notes that the results are significantly different. 1/5 gradient divided by the sum of remain 4/5 gradient.
Fi Silence Front Silence End No Silence Silence position HuBERT HuBERT
with length Base Large
wav2vec2
S 0.499 X X
1 0.337 0.382 0.418(+18%)† Front - 1/5 0.984/0.228 0.605/0.202 0.973/0.212
2 0.364 0.364 0.418(+12%)† Front - 1/10 0.413/0.130 0.223/0.107 0.502/0.125
3 0.374 0.415 0.440(+14%)† Front - 1/20 0.363/0.072 0.203/0.057 0.281/0.031
4 0.376 0.425 0.425(+11%)† End - 1/5 0.829/0.200 0.626/0.202 1.295/0.202
5 0.378 0.425 0.434(+12%)† End - 1/10 0.496/0.101 0.362/0.100 0.613/0.102
6 0.373 0.428 0.430(+13%)† End - 1/20 0.316/0.052 0.196/0.048 0.331/0.038
7 0.380 0.431 0.439(+13%)†
8 0.400 0.433 0.423(+5%)† more speaker information than the others. This subsection
9 0.410 0.430 0.429(+4%) also separates a representation sequence into multiple frag-
10 0.416 0.354 0.423(+1%) ments, as in previous experiments. Still, unlike the earlier
S X 0.536 X experiments, we only pick one of the fragments as the input
of the SID downstream model. Besides, we pick HuBERT
waveform and 1 corresponding to the silence part we insert) as our upstream model and use the weighted sum of the out-
and do the same analysis as in Section 3. We hope to confirm puts of each layer as the representation. The process of the
whether the silence fragments contribute more to the SID task experiments in this subsection is shown in Figure 1(b).
by observing the norm weights. The dataset we use is also the
test dataset of VoxCeleb as in Section 3. To confirm that our The results are shown on Table 1. In the columns la-
results are general, we use two models in this section: Hu- belled with ”Silence Front” and ”Silence End”, we added the
BERT Base and HuBERT Large. Furthermore, we use the silence in the front or end of the original waveform, respec-
weighted sum of the outputs of each layer as the representa- tively. Then we separate the representations into 11 fragments
tion of these two models. (10 corresponding to the original waveform, 1 corresponding
The outcome is shown in Figure 5. We find that no matter to the silence part we insert). Compared to other fragments,
where we insert silence in the front/middle/end of the orig- the silence fragment outperforms other fragments with about
inal waveform, the norm weight of the silence fragment is 10% accuracy no matter whether silence is added at the front
always the largest, which means that the downstream model or end. This means that the silence fragments contain more
will mainly use this fragment to classify the speakers. We speaker information, so the SID performance is better than
also do the same analysis on Automatic Speaker Verification other fragments.
(ASV). The setup of the ASV task here is the same as in SU-
PERB, except that the pooling layer is modified as in Sec- In the column labeled ”No Silence”, we don’t add silence
tion 3. Moreover, we only test for the HuBERT model. We in the original waveform. After getting the representations,
can observe same trend in ASV task which also shown in Fig- we separate a representation sequence into 10 fragments
ure 5. The result shows that the representations correspond- and also pick one of them to train the downstream model.
ing to silence fragments store more speaker information than Compared to the accuracies of non-silence fragments in the
other representations. waveform without adding silence (the fourth column), those
with silence inserted in the beginning (the second column)
4.2. Silence Fragment: Best Fragment to Extract Speaker are lower. The results show that the silence fragments contain
Information the speaker information and aggregate speaker information
from other fragments. The observations suggest that adding
In Section 4.1, the norm weights show that the silence frag- silence can serve as an alternative to disentangling speakers
ments are essential for speaker information. This section pro- and content information. We leave the idea as our future
vides more experiments to verify the silence fragments store work.
(a) (b) (c)
Fig. 6. Sample result for gradient saliency. The result is extracted by HuBERT Base. (a) add silence for 1/10 length in the front
(b) add silence for 1/10 length in the end (c) add nothing
and for comparison to the first number, we split the wav into
two partition and we use the sum of the one partition gradi-
ent divided by the sum of remain gradient. For example, the
second number for Front 1/6 cell is taking the sum of first 1/6
gradient divided by the sum of remain 5/6 gradient. The result
is taking the mean across the dataset. We can clearly see that
in every setting, the first number is far more than the second
number, which means the silence fragment contribute most
of the gradient compared to the original setting. This result
shows that silence fragment does have a large impact on the
Fig. 7. Relationship between the percentage of silence in
final result.
waveform and the accuracy of the SID task.

4.3. Gradient Saliency Analysis 5. SILENCE CAN HELP TO IMPROVE THE SID
TASK
In the previous section, we already know the importance of si-
lence to speaker information. In order to understand whether 5.1. Amount of Silence vs SID Task Performance
the silence fragment affects the final results, we introduce the
gradient saliency. The gradient saliency measures the gradi- After investigating the relationship between silence and
ent, which is the derivative of the final output at each sample speaker information, we want to know whether the amount of
of the input wav file. The value indicates the impact of the silence in an utterance affects its SID accuracy. If the silence
sample on the output result. The larger the gradient norm, the indeed has some relationship with speaker information, the
greater the influence. amount of silence and accuracy in the SID task would be pos-
In the experiment setting, we adopt VoxCeleb to get the itively correlated. In this experiment, we use the test dataset
gradient of each wav file. Then we do the normalization, of VoxCeleb as the measurement object. We use the Librosa
which means each gradient is divided by the sum of the all toolkit to measure the ratio of the length of speech less than
gradient. The sample result is shown in Figure 6. For visual- 10db to the length of the original speech in each utterance.
ization, we split the wav into 100 fragments. We first do the The final result is shown in Figure 7. We find that if silence
normalization and then aggregate the gradient in each frag- ratio is less than a threshold, about 5%, performance will
ment. It’s obvious that the gradient accumulates in the silence be reduced by about 30% to 50% compared to other cases.
fragment when we add additional silence. On the other side, With the results of this experiment, we observe that silence
the gradient distributes all over the wav file without additional strongly correlates with the accuracy of the SID task in this
silence. experiment. This means that the silent part of the waveform
We also show the statistic over the whole dataset in Ta- is indeed related to the speaker’s information.
ble 2. We test three different SSL models for the gradient
saliency. Each cell has two numbers. The first number means 5.2. Adding Silence: The Key to Improve Performance on
that the sum of silence fragment gradient is divided by the Speaker-Related Task
sum of non-silence fragment gradient. For example, in the
Front - 1/5 setting, we take the first 1/6 gradient, which is In Section 5.1, we already know that there should be enough
gradient of silence part, and divided it by the remain gradient. silence in utterance to perform a good SID performance. This
The second number means that silence is not added in wavs, inspired a straightforward way to improve the accuracy of
Table 3. The improvement between the utterances that have low silence. In each model, we test for three settings, the number
”X-Y” means that the utterance’s silence percentage falls in this range. The evaluation metric for SID/ASV is accuracy/EER.
HuBERT Base HuBERT Large wav2vec2
Silence position
0-5 5-10 10-15 0-5 5-10 10-15 0-5 5-10 10-15
with length

Front - 1/5 +0.02 +0.01 -0.06 +0.02 +0.02 +0.01 +0.04 +0.03 +0.01
Front - 1/10 +0.03 +0.02 +0.01 +0.02 +0.01 +0.01 +0.02 +0.04 +0.02
VoxCeleb Front - 1/20 +0.05 +0.02 -0.01 +0.03 +0.02 +0.01 +0.04 +0.04 +0.03
SID End - 1/5 +0.01 +0.02 -0.07 +0.01 +0.01 +0.04 +0.03 +0.02 -0.06
End - 1/10 +0.01 +0.03 +0.02 -0.01 +0.01 -0.03 -0.01 +0.03 -0.04
End - 1/20 +0.02 +0.02 +0.01 +0.05 +0.03 +0.03 +0.04 +0.03 +0.02
Front - 1/5 +0.01 +0.01 -0.04 +0.02 +0.03 +0.02 +0.02 +0.00 +0.01
Front - 1/10 +0.02 +0.02 +0.01 +0.02 +0.02 +0.02 +0.03 +0.02 -0.01
SITW Front - 1/20 +0.05 +0.04 +0.00 +0.03 +0.02 -0.01 +0.04 +0.04 +0.02
ASV End - 1/5 -0.01 +0.00 -0.01 +0.01 +0.01 -0.04 +0.02 +0.01 -0.01
End - 1/10 +0.01 +0.02 +0.04 +0.01 +0.02 -0.01 +0.03 +0.01 -0.03
End - 1/20 +0.02 +0.02 +0.04 +0.02 +0.01 -0.00 +0.03 +0.05 +0.01
Front - 1/5 +0.01 +0.01 -0.06 +0.04 +0.02 -0.05 +0.02 +0.00 +0.01
Front - 1/10 +0.04 -0.01 -0.04 +0.02 +0.01 -0.02 +0.03 +0.02 +0.01
VOiCES Front - 1/20 +0.02 +0.03 +0.02 +0.04 +0.02 -0.03 +0.05 +0.03 +0.01
ASV End - 1/5 +0.03 +0.01 -0.01 +0.03 -0.07 -0.03 +0.03 +0.04 -0.01
End - 1/10 +0.02 +0.01 +0.04 +0.03 +0.02 +0.00 +0.03 +0.02 -0.01
End - 1/20 +0.03 +0.01 +0.00 +0.04 +0.02 +0.05 +0.04 -0.00 -0.02

speaker-related tasks – Adding silence to the utterances with- amount of silence in the original utterance is too small. This
out sufficient silence. This approach only modifies the pre- result tells us that utterance needs enough silence to achieve
processing process when an utterance enters an SSL model good performance and also shows that silence plays a pivotal
during inference, and it is unrelated to the pre-training and role in speaker information.
fine-tuning stages. Here we show that the utterances with in-
sufficient silence can benefit from this simple preprocessing. 6. CONCLUSION
We choose three target datasets: VoxCeleb, SITW [30]
and VOiCES [31]. In the SITW dataset, we follow the previ- In this work, we utilize SID/ASV probing task and the Hu-
ous work to use the core-core setting in the eval set. As for BERT model to understand how the SSL model processes and
VOiCES, we use the eval set in the VOiCES 2019 challenge.1 stores the speaker information in the waveform. We use the
We use the librosa toolkit to identify the silence amount of upstream-downstream model structure to do probing tasks to
each utterance. We consider three groups of utterances, in understand speaker information. We fix the HuBERT model
which the silence amount are 0%-5%, 5%-10%, and 10%- as our upstream model and use a simple linear layer as our
15%, to measure whether adding silence can help. Besides, downstream model. We verify that speaker information is
the place we add silence is the front or end, and the length of stored in the silence fragment via a series of experiments. Be-
silence is equal to 1/5, 1/10, and 1/20 of the length of the orig- sides, the data shows that when the waveform’s silence ratio
inal waveform. We select HuBERT Base, HuBERT Large, is lower than 5%, the accuracy of the SID task will decrease
and wav2vec2 as testing models. by about 30% to 50%. Last but not least, when adding some
The result is shown in Table 3. In almost all the settings, silence in the original waveform during inference, the SID
adding silence yields improvement. Moreover, we can find and ASV tasks can increase up to 5% without fine-tuning the
that when silence length is equal to 1/20, the performance upstream model. These findings help us to gain more insight
is better than in other settings. Besides, most of the 0%-5% into how SSL models process speaker information and can
setting is better than the 5%-10% and the 10%-15%. This help others to have another idea when dealing with speaker-
means that an utterance can be improved even more when the related tasks. In future directions, we will investigate more
1 For the SITW and VOiCES datasets, we directly utilise the models which
SSL with different structures and criteria to know how gen-
train on VoxCeleb to extract the features from SITW and VOiCES enrollment
eral the findings in this paper are. Moreover, we will also
data. The features are the model’s final outputs, which have 1,251 dimen- examine whether other speech attributes are stored based on
sions. input as speaker information.
7. ACKNOWLEDGEMENTS Language Technology Workshop (SLT). IEEE, 2021, pp.
344–350.
We are grateful to Abdelrahman Mohamed for his comments
and discussions of this paper. Besides, we would like to thank [11] Andy T Liu, Shang-Wen Li, and Hung-yi Lee, “Tera:
Cheng-Kuang (CK) Lee of the NVIDIA AI Technology Cen- Self-supervised learning of transformer encoder repre-
ter Taiwan for his help with the GPU computing resourses on sentation for speech,” IEEE/ACM Transactions on Au-
NVIDIA GPU Cloud (NGC). dio, Speech, and Language Processing, vol. 29, pp.
2351–2366, 2021.
8. REFERENCES [12] Shaoshi Ling, Yuzong Liu, Julian Salazar, and Ka-
trin Kirchhoff, “Deep contextualized acoustic repre-
[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, sentations for semi-supervised speech recognition,” in
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah- ICASSP 2020-2020 IEEE International Conference on
man Mohamed, “Hubert: Self-supervised speech rep- Acoustics, Speech and Signal Processing (ICASSP).
resentation learning by masked prediction of hidden IEEE, 2020, pp. 6429–6433.
units,” CoRR, vol. abs/2106.07447, 2021.
[13] Heng-Jui Chang, Shu-Wen Yang, and Hung-yi Lee,
[2] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, “Distilhubert: Speech representation learning by layer-
and Michael Auli, “wav2vec 2.0: A framework for self- wise distillation of hidden-unit BERT,” CoRR, vol.
supervised learning of speech representations,” CoRR, abs/2110.01900, 2021.
vol. abs/2006.11477, 2020.
[14] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun
[3] Alexei Baevski, Steffen Schneider, and Michael Auli, Babu, Jiatao Gu, and Michael Auli, “data2vec: A gen-
“vq-wav2vec: Self-supervised learning of discrete eral framework for self-supervised learning in speech,
speech representations,” CoRR, vol. abs/1910.05453, vision and language,” CoRR, vol. abs/2202.03555,
2019. 2022.

[4] Steffen Schneider, Alexei Baevski, Ronan Collobert, [15] Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang,
and Michael Auli, “wav2vec: Unsupervised pre-training Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T.
for speech recognition,” CoRR, vol. abs/1904.05862, Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-
2019. Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong
Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji
[5] Andy T. Liu, Shu-Wen Yang, Po-Han Chi, Po chun Hsu, Watanabe, Abdelrahman Mohamed, and Hung-yi Lee,
and Hung yi Lee, “Mockingjay: Unsupervised speech “SUPERB: speech processing universal performance
representation learning with deep bidirectional trans- benchmark,” CoRR, vol. abs/2105.01051, 2021.
former encoders.,” CoRR, vol. abs/1910.12638, 2019.
[16] Yu-An Chung, Yonatan Belinkov, and James Glass,
[6] Shaoshi Ling and Yuzong Liu, “Decoar 2.0: Deep con- “Similarity analysis of self-supervised speech represen-
textualized acoustic representations with vector quanti- tations,” in ICASSP 2021 - 2021 IEEE International
zation,” ArXiv, vol. abs/2012.06659, 2020. Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2021, pp. 3040–3044.
[7] Aäron van den Oord, Yazhe Li, and Oriol Vinyals, “Rep-
resentation learning with contrastive predictive coding,” [17] Ankita Pasad, Ju-Chieh Chou, and Karen Livescu,
CoRR, vol. abs/1807.03748, 2018. “Layer-wise analysis of a self-supervised speech repre-
sentation model,” CoRR, vol. abs/2107.04734, 2021.
[8] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James R.
Glass, “An unsupervised autoregressive model [18] Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang
for speech representation learning,” CoRR, vol. Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu
abs/1904.03240, 2019. Wei, Jinyu Li, and Xiangzhan Yu, “Unispeech-sat:
Universal speech representation learning with speaker
[9] Yu-An Chung, Hao Tang, and James Glass, “Vector- aware pre-training,” CoRR, vol. abs/2110.05752, 2021.
quantized autoregressive predictive coding,” arXiv
preprint arXiv:2005.08392, 2020. [19] Sanyuan Chen, Chengyi Wang, Zhengyang Chen,
Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki
[10] Po-Han Chi, Pei-Hung Chung, Tsung-Han Wu, Chun- Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm:
Cheng Hsieh, Yen-Hao Chen, Shang-Wen Li, and Hung- Large-scale self-supervised pre-training for full stack
yi Lee, “Audio albert: A lite bert for self-supervised speech processing,” arXiv preprint arXiv:2110.13900,
learning of audio representation,” in 2021 IEEE Spoken 2021.
[20] Jesse Vig and Yonatan Belinkov, “Analyzing the struc-
ture of attention in a transformer language model,”
CoRR, vol. abs/1906.04284, 2019.
[21] Yongjie Lin, Yi Chern Tan, and Robert Frank, “Open
sesame: Getting inside bert’s linguistic knowledge,”
CoRR, vol. abs/1906.01698, 2019.
[22] Jesse Vig, “A multiscale visualization of attention in the
transformer model,” CoRR, vol. abs/1906.05714, 2019.
[23] Yaru Hao, Li Dong, Furu Wei, and Ke Xu, “Visualizing
and understanding the effectiveness of BERT,” CoRR,
vol. abs/1908.05620, 2019.
[24] Shikhar Vashishth, Shyam Upadhyay, Gaurav Singh
Tomar, and Manaal Faruqui, “Attention interpretability
across NLP tasks,” CoRR, vol. abs/1909.11218, 2019.

[25] Olga Kovaleva, Alexey Romanov, Anna Rogers, and


Anna Rumshisky, “Revealing the dark secrets of
BERT,” CoRR, vol. abs/1908.08593, 2019.
[26] Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and
Yonatan Belinkov, “Exploiting redundancy in pre-
trained language models for efficient transfer learning,”
CoRR, vol. abs/2004.04010, 2020.
[27] Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D. Manning, “What does BERT look
at? an analysis of bert’s attention,” CoRR, vol.
abs/1906.04341, 2019.
[28] David Cheng-Han Chiang, Sung-Feng Huang, and
Hung-yi Lee, “Pretrained language model embryology:
The birth of ALBERT,” CoRR, vol. abs/2010.02480,
2020.

[29] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-


man, “Voxceleb: a large-scale speaker identification
dataset,” CoRR, vol. abs/1706.08612, 2017.
[30] Mitchell McLaren, Luciana Ferrer, Diego Castán, and
Aaron D. Lawson, “The speakers in the wild (sitw)
speaker recognition database,” in INTERSPEECH,
2016.
[31] Colleen Richey, Marı́a Auxiliadora Barrios, Zeb Arm-
strong, Chris Bartels, Horacio Franco, Martin Gra-
ciarena, Aaron Lawson, Mahesh Kumar Nandwana,
Allen R. Stauffer, Julien van Hout, Paul Gamble, Jeff
Hetherly, Cory Stephenson, and Karl Ni, “Voices ob-
scured in complex environmental settings (VOICES)
corpus,” CoRR, vol. abs/1804.05053, 2018.

You might also like