Professional Documents
Culture Documents
Samll Text Active Learning
Samll Text Active Learning
Figure 2: Module architecture of small-text. The core module can optionally be extended with a PyTorch
and transformers integration, which enable to use GPU-based models and state-of-the-art transformer-based
text classifiers of the Hugging Face transformers library, respectively. The dependencies between the module’s
packages have been omitted.
tions for active learning in general already exist, els such as BERT (Devlin et al., 2019) are available
few address text classification, which requires fea- through the transformers integration. This inte-
tures specific to natural language processing, such gration also includes a wrapper that enables the use
as word embeddings (Mikolov et al., 2013) or lan- of the recently published SetFit training paradigm
guage models (Devlin et al., 2019). To fill this (Tunstall et al., 2022), which uses contrastive learn-
gap, we introduce small-text, an active learning ing to fine-tune SBERT embeddings (Reimers and
library that provides tried and tested components Gurevych, 2019) in a sample efficient manner.
for both experiments and applications. As the query strategy, which selects the instances
to be labeled, is the most salient component of an
2 Overview of Small-Text active learning setup, the range of alternative query
The main goal of small-text is to offer state-of- strategies provided covers four paradigms at the
the-art active learning for text classification in a time of writing: (i) confidence-based strategies:
convenient and robust way for both researchers and least confidence (Lewis and Gale, 1994; Culotta
practitioners. For this purpose, we implemented and McCallum, 2005), prediction entropy (Roy and
a modular pool-based active learning mechanism, McCallum, 2001), breaking ties (Luo et al., 2005),
illustrated in Figure 2, which exposes interfaces BALD (Houlsby et al., 2011), CVIRS (Reyes
for classifiers, query strategies, and stopping cri- et al., 2018), and contrastive active learning (Mar-
teria. The core of small-text integrates scikit- gatina et al., 2021); (ii) embedding-based strategies:
learn (Pedregosa et al., 2011), enabling direct use BADGE (Ash et al., 2020), BERT k-means (Yuan
of its classifiers. Overall, the library provides thir- et al., 2020), discriminative active learning (Gissin
teen query strategies, including some that are only and Shalev-Shwartz, 2019), and SEALS (Cole-
usable on text data, five stopping criteria, and two man et al., 2022); (iii) gradient-based strategies:
integrations of well-known machine learning li- expected gradient length (EGL; Settles et al.,
braries, namely PyTorch (Paszke et al., 2019) and 2007), EGL-word (Zhang et al., 2017), and EGL-
transformers (Wolf et al., 2020). The integra- sm (Zhang et al., 2017); and (iv) coreset strategies:
tions ease the use of CUDA-based GPU computing greedy coreset (Sener and Savarese, 2018) and
and transformer models, respectively. The modular lightweight coreset (Bachem et al., 2018). Since
architecture renders both integrations completely there is an abundance of query strategies, this list
optional, resulting in a slim core that can also be will likely never be exhaustive—also because strate-
used in a CPU-only scenario without unnecessary gies from other domains, such as computer vision,
dependencies. Given the ability to combine a con- are not always applicable to the text domain, e.g.,
siderable variety of classifiers and query strategies, when using the geometry of images (Konyushkova
we can easily build a vast number of combinations et al., 2015), and thus will be disregarded here.
of active learning setups. Furthermore, small-text includes a consider-
The library provides relevant text classification able amount of different stopping criteria: (i) stabi-
baselines such as SVM (Joachims, 1998) and Kim- lizing predictions (Bloodgood and Vijay-Shanker,
CNN (Kim, 2014), and many more can be used 2009), (iv) overall-uncertainty (Zhu et al., 2008),
through scikit-learn. Recent transformer mod- (iii) classification-change (Zhu et al., 2008),
(ii) predicted change of F-measure (Altschuler and 4 Code Example
Bloodgood, 2019), and (v) a criterion that stops
In this section we show a code example to perform
after a fixed number of iterations. Stopping criteria
active learning with transformers models.
are often neglected in active learning although they
exert a strong influence on labeling efficiency. Dataset First, we create (for the sake of a sim-
The library is available via the python packaging ple example) a synthetic two-class spam dataset
index and can be installed with just a single com- of 100 instances. The data is given by a list of
mand: pip install small-text. Similarly, the texts and a list of integer labels. To define the to-
integrations can be enabled using the extra require- kenization strategy, we provide a transformers
ments argument of Python’s setuptools, e.g., the tokenizer. From these individual parts we construct
transformers integration is installed using pip a TransformersDataset object which is a dataset
install small-text[transformers]. The ro- abstraction that can be used by the interfaces in
bustness of the implementation rests on extensive small-text. This yields a binary text classifica-
unit and integration tests. Detailed examples, an tion dataset containing 50 examples of the positive
API documentation, and common usage patterns class (spam) and the negative class (ham) each:
are available in the online documentation.1
import numpy as np
3 Library versus Annotation Tool from small_text import TransformersDataset, \
TransformerModelArguments
We designed small-text for two types of set- from transformers import AutoTokenizer
tings: (i) experiments, which usually consist of ei-
# Fake data example:
ther automated active learning evaluations or short- # 50 spam and 50 non-spam examples
lived setups with one or more human annotators, text = np.array(['this is ham'] * 50 +
and (ii) real-world applications, in which the final ['this is spam'] * 50)
model is subsequently applied on unlabeled or un- labels = np.array([0] * 50 + [1] * 50)
seen data. Both cases benefit from a library which transformer_model = 'bert-base-uncased'
offers a wide range of well-tested functionality. tokenizer = AutoTokenizer.from_pretrained(
To clarify on the distinction between a library transformer_model)
and an annotation tool, small-text is a library,
train = TransformersDataset.from_arrays(
by which we mean a reusable set of functions and text, labels, tokenizer,
classes that can be used and combined within more target_labels=np.array([0, 1]),
complex programs. In contrast, annotation tools max_length=10
provide a graphical user interface and focus on the )
interaction between the user and the system. Ob-
viously, small-text is still intended to be used Active Learning Configuration Next, we con-
by annotation tools but remains a standalone li- figure the classifier and query strategy. Although
brary. In this way it can be used (i) in combination the active learner, query strategies, and stopping
with an annotation tool, (ii) within an experiment criteria components are dataset- and classifier-
setting, or (iii) as part of a backend application, agnostic, classifier and dataset have to match
e.g. a web API. As a library it remains compatible (i.e. TransformerBasedClassification must be
to all of these use cases. This flexibility is sup- used with TransformersDataset) owing to the
ported by the library’s modular architecture which different underlying data structures:
is also in concordance with software engineering
from small_text import LeastConfidence, \
best practices, where high cohesion and low cou- TransformerBasedClassificationFactory \
pling (Myers, 1975) are known to contribute to- as TransformerFactory
wards highly reusable software (Müller et al., 1993; num_classes = 2
Tonella, 2001). As a result, small-text should be model_args = TransformerModelArguments(
transformer_model)
compatible with most annotations tools that are
extensible and support text classification. clf_factory = TransformerFactory(model_args,
1
num_classes, kwargs={'device': 'cuda'})
1
https://small-text.readthedocs.io
query_strategy = LeastConfidence()
Since the active learner may need to instantiate a
from sklearn.metrics import accuracy_score
new classifier before the training step, a factory
(Gamma et al., 1995) is responsible for creating num_queries = 5
new classifiers. Finally, we set the query strategy for i in range(num_queries):
to least confidence (Culotta and McCallum, 2005). # Query 10 samples per iteration.
indices_queried = active_learner.query(
Initialization There is a chicken-and-egg prob- num_samples=10
lem for active learning because most query strate- )
gies rely on the model, and a model in turn is # Simulate user interaction here.
trained on labeled instances which are selected by # Replace this for real-world usage.
the query strategy. This problem can be solved y = train.y[indices_queried]
by either providing an initial model (e.g. through
# Provide labels for the queried indices.
manual labeling), or by using cold start approaches active_learner.update(y)
(Yuan et al., 2020). In this example we simulate
a user-provided initialization by looking up the re- # Evaluate accuracy on the train set
spective true labels and providing an initial model: print(f'Iteration {i+1}')
y_pred = active_learner.classifier\
from small_text import \ .predict(train)
PoolBasedActiveLearner, \ print('Train accuracy: {:.2f}'.format(
random_initialization_balanced as init accuracy_score(y_pred, train.y)))
Table 1: Comparison between small-text and relevant previous active learning libraries. We abbre-
viated the number of query strategies by “QS”, the number of stopping criteria by “SC”, and the
low-resource-text-classification framework by lrtc. All information except “Publication Year” and “Code
Repository” has been extracted from the linked GitHub repository of the respective library on February 24th, 2023.
Random baselines were not counted towards the number of query strategies. Publications: 1 Reyes et al. (2016),
2
Yang et al. (2017), 3 Danka and Horvath (2018), 4 Tang et al. (2019), 5 Atighehchian et al. (2020), 6 Ein-Dor et al.
(2020), 7 Kottke et al. (2021), 8 Tsvigun et al. (2022).
the experimental active learning setting. Apart braries offer stopping criteria, which are crucial
from providing 22 query strategies, it supports al- to reducing the total annotation effort and thus di-
ternative active learning settings, e.g., active learn- rectly influence the efficiency of the active learning
ing with noisy annotators. The low-resource- process (Vlachos, 2008; Laws and Schütze, 2008;
text-classification-framework (lrtc; (Ein- Olsson and Tomanek, 2009). We can also see a
Dor et al., 2020)) is an experimentation framework difference in the number of provided query strate-
for the low resource scenario and supports which gies. While a higher number of query strategies
can be easily extended. It also focuses on text is certainly not a disadvantage, it is more impor-
classification and has a number of built-in mod- tant to provide the most relevant strategies (either
els, datasets, and query strategies to perform ac- due to recency, domain-specificity, strong general
tive learning experiments. Another recent library performance, or because it is a baseline). Based
is scikit-activeml which offers general active on these criteria, small-text provides numerous
learning built around scikit-learn. It comes recent strategies such as BADGE (Ash et al., 2020),
with 29 query strategies but provides no stopping BERT K-Means (Yuan et al., 2020), and contrastive
criteria. GPU-based functionality can be used via active learning (Margatina et al., 2021), as well
skorch,2 a PyTorch wrapper, which is a ready-to- as the gradient-based strategies by Zhang et al.
use adapter as opposed to our implemented clas- (2017), where the latter are unique to active learn-
sifier structures but is on the other hand restricted ing for text classification. Selecting a subset of
to the scikit-learn interfaces. ALToolbox (Tsvi- query strategies is especially important since active
gun et al., 2022) is an active learning framework learning experiments are computationally expen-
that provides an annotation interface and a bench- sive (Margatina et al., 2021; Schröder et al., 2022),
marking mechanism to develop new query strate- and therefore not every strategy can be tested in the
gies. While it has some overlap with small-text, context of an experiment or application. Finally,
it is not a library, but also focuses on text data, only small-text, lrtc, and ALToolbox focus on
namely on text classification and sequence tagging. text, and only about half of the libraries offer access
In Table 1, we compare small-text to the pre- to GPU-based deep learning, which has become
viously mentioned libraries, and compare them indispensable for text classification due to the re-
based on several criteria related to active learning cent advances and ubiquity of transformer-based
or to the respective code base: While all libraries models (Vaswani et al., 2017; Devlin et al., 2019).
provide a selection of query strategies, not all li- The distinguishing characteristic of small-text
2 is the focus on text classification, paired with a
We also evaluated the use of skorch but transformer mod-
els were not supported at that time. multitude of interchangeable components. It of-
Dataset Name (ID) Type Classes Training Test Model Strategy Rank Result
1 ⋆ Acc. AUC Acc. AUC
AG’s News (AGN) N 4 120,000 7,600
Customer Reviews2 (CR) S 2 3,397 378 BERT PE 2.20 2.80 0.917 0.858
Movie Reviews3 (MR) S 2 9,596 1,066 BT 1.40 1.60 0.919 0.868
Subjectivity4 (SUBJ) S 2 9,000 1,000 LC 3.80 3.20 0.916 0.863
TREC-65 (TREC-6) Q 6 5,500 ⋆ 500 CA 4.20 5.00 0.915 0.857
BA 3.00 5.20 0.917 0.855
Table 2: Key characteristics about the examined BD 6.20 4.60 0.909 0.862
datasets: 1 Zhang et al. (2015), 2 Hu and Liu (2004), CS 6.60 7.60 0.910 0.843
3 RS 7.80 5.40 0.901 0.856
Pang and Lee (2005), 4 Pang and Lee (2004), 5 Li and
Roth (2002). The dataset type was abbreviated by N SetFit PE 2.80 3.20 0.927 0.906
(News), S (Sentiment), Q (Questions). ⋆: Predefined BT 2.80 1.60 0.926 0.912
LC 2.20 2.60 0.927 0.908
test sets were available and adopted. CA 4.80 3.80 0.924 0.907
BA 5.20 6.20 0.923 0.902
BD 6.60 5.60 0.915 0.904
fers the most comprehensive set of features (as CS 2.80 4.40 0.927 0.909
shown in Table 1) and through the integrations RS 6.60 6.80 0.907 0.899
these components can be mixed and matched to
Table 3: The “Rank” columns show the mean rank when
easily build numerous different active learning se- ordered by mean accuracy (Acc.) and by area under
tups, with or without leveraging the GPU. Finally, curve (AUC). The “Result” columns show the mean
it allows to use concepts from natural language pro- accuracy and AUC. All values used in this table refer to
cessing (such as transformer models) and provides state after the final iteration. Query strategies are abbre-
query strategies unique to text classification. viated as follows: prediction entropy (PE), breaking ties
(BT), least confidence (LC), contrastive active learning
6 Experiment (CA), BALD (BA), BADGE (BD), greedy coreset (CS),
and random sampling (RS).
We perform an active learning experiment com-
paring an SBERT model trained with the recent
sentence transformers fine-tuning paradigm (Set- brevity, we refer to the first as “BERT” and to the
Fit; (Tunstall et al., 2022)) over a BERT model second as “SetFit”. To compare their performance
trained with standard fine-tuning. SetFit is a con- during active learning, we provide an extensive
trastive learning approach that trains on pairs of benchmark over multiple computationally inexpen-
(dis)similar instances. Given a fixed amount of dif- sive uncertainty-based query strategies, which were
ferently labeled instances, the number of possible selected due to encouraging results in our previous
pairs is considerably higher than the size of the work. Moreover, we include BALD, BADGE, and
original set, making this approach highly sample greedy coreset—all of which are computationally
efficient (Chuang et al., 2020; Hénaff, 2020) and more expensive, but have been increasingly used in
therefore interesting for active learning. recent work (Ein-Dor et al., 2020; Yu et al., 2022).
Setup We reproduce the setup of our previous Results In Table 3, the results show the summa-
work (Schröder et al., 2022) and evaluate on the rized classification performance in terms of (i) fi-
datasets shown in Table 2 with an extended set of nal accuracy after the last iteration, and (ii) area
query strategies. Starting with a pool-based ac- under curve (AUC). We also compare strategies
tive learning setup with 25 initial samples, we per- by ranking them from 1 (best) to 8 (worst) per
form 20 queries during each of which 25 instances model and dataset by accuracy and AUC. First,
are queried and labeled. Since SetFit has only we can also confirm for SetFit the earlier finding
been evaluated for single-label classification (Tun- that uncertainty-based strategies perform strong for
stall et al., 2022), we focus on single-label clas- BERT (Schröder et al., 2022). Second, SetFit con-
sification as well. The goal is to compare the figurations result in between 0.06 and 1.7 percent-
following two models: (i) BERT (bert-large- age points higher mean accuracy, and also in be-
uncased; (Devlin et al., 2019)) with 336M param- twen 4.2 and 6.6 higher AUC when averaged over
eters and (ii) SBERT (paraphrase-mpnet-base- model and query strategy. Interestingly, the greedy
v2; (Reimers and Gurevych, 2019)) with 110M pa- coreset strategy (CS) is remarkably more success-
rameters. The first model is trained via vanilla fine- ful for the SetFit runs compared to the BERT runs.
tuning and the second using SetFit. For the sake of Detailed results per configuration can be found
1.0 ated a pool-based binary active learning setup, and
their main finding is that, when using active learn-
0.8
Test accuracy
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- Ying-Peng Tang, Guo-Xiang Li, and Sheng-Jun Huang.
fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, 2019. ALiPy: Active learning in python. arXiv
Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- preprint arXiv:1901.03802.
cent Dubourg, Jake Vanderplas, Alexandre Passos, Paolo Tonella. 2001. Concept analysis for module re-
David Cournapeau, Matthieu Brucher, Matthieu Per- structuring. IEEE Trans. Software Eng., 27(4):351–
rot, and Édouard Duchesnay. 2011. Scikit-learn: Ma- 363.
chine learning in python. Journal of Machine Learn-
ing Research (JMLR), 12(85):2825–2830. Akim Tsvigun, Leonid Sanochkin, Daniil Larionov,
Gleb Kuzmin, Artem Vazhentsev, Ivan Lazichny,
Nils Reimers and Iryna Gurevych. 2019. Sentence- Nikita Khromov, Danil Kireev, Aleksandr Ruba-
BERT: Sentence embeddings using Siamese BERT- shevskii, and Olga Shahmatova. 2022. ALToolbox:
networks. In Proceedings of the 2019 Conference on A set of tools for active learning annotation of natural
Empirical Methods in Natural Language Processing language texts. In Proceedings of the The 2022 Con-
and the 9th International Joint Conference on Natu- ference on Empirical Methods in Natural Language
ral Language Processing (EMNLP-IJCNLP), pages Processing: System Demonstrations, pages 406–434,
3982–3992, Hong Kong, China. Association for Com- Abu Dhabi, UAE. Association for Computational
putational Linguistics. Linguistics.
Oscar Reyes, Carlos Morell, and Sebastián Ventura. Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke
2018. Effective active learning strategy for multi- Bates, Daniel Korat, Moshe Wasserblat, and Oren
label learning. Neurocomputing, 273:494–508. Pereg. 2022. Efficient few-shot learning without
prompts. arXiv preprint arXiv:2209.11055.
Oscar Reyes, Eduardo Pérez, María del Carmen
Rodríguez-Hernández, Habib M. Fardoun, and Se- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
bastián Ventura. 2016. JCLAL: A Java Framework Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
for Active Learning. Journal of Machine Learning Kaiser, and Illia Polosukhin. 2017. Attention is all
Research (JMLR), 17(95):1–5. you need. In Proceedings of the Advances in Neural
Information Processing Systems 30 (NeurIPS), pages
Julia Romberg and Tobias Escher. 2022. Automated 5998–6008.
topic categorisation of citizens’ contributions: Reduc-
ing manual labelling efforts through active learning. Andreas Vlachos. 2008. A stopping criterion for active
In Electronic Government, pages 369–385, Cham. learning. Computer Speech & Language, 22(3):295–
Springer International Publishing. 312.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien B Experiments
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Each experiment configuration represents a com-
Joe Davison, Sam Shleifer, Patrick von Platen, Clara bination of model, dataset and query strategy, and
Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le has been run for five times.
Scao, Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander M. Rush. 2020. Transform- B.1 Datasets
ers: State-of-the-art natural language processing. In
Proceedings of the 2020 Conference on Empirical We used datasets that are well-known benchmarks
Methods in Natural Language Processing: System in text classification and active learning. All
Demonstrations (EMNLP), pages 38–45.
datasets are accessible to the Python ecosystem via
Yao-Yuan Yang, Shao-Chuan Lee, Yu-An Chung, Tung- Python libraries that provide fast access to those
En Wu, Si-An Chen, and Hsuan-Tien Lin. 2017. datasets. We obtained CR and SUBJ using glu-
libact: Pool-based active learning in python. arXiv onnlp, and AGN, MR, and TREC using hugging-
preprint arXiv:1710.00379.
face datasets.
Yue Yu, Lingkai Kong, Jieyu Zhang, Rongzhi Zhang,
and Chao Zhang. 2022. AcTune: Uncertainty-based B.2 Pre-Trained Models
active self-training for active fine-tuning of pretrained In the experiments, we fine-tuned (i) a large
language models. In Proceedings of the 2022 Con-
ference of the North American Chapter of the As- BERT model (bert-large-uncased) and (ii) an
sociation for Computational Linguistics: Human SBERT paraphrase-mpnet-base model (sentence-
Language Technologies, pages 1422–1436, Seattle, transformers/paraphrase-mpnet-base-v2). Both are
United States. Association for Computational Lin- available via the huggingface model repository.
guistics.
Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd- B.3 Hyperparameters
Graber. 2020. Cold-start active learning through self- Maximum Sequence Length We set the maxi-
supervised language modeling. In Proceedings of the mum sequence length to the minimum multiple of
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 7935–7948. ten, so that 95% of the given dataset’s sentences
Association for Computational Linguistics. contain at most that many tokens.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Transformer Models For BERT, we adopt the
Character-level convolutional networks for text clas- hyperparameters from Schröder et al. (2022). For
sification. In C. Cortes, N. D. Lawrence, D. D. Lee,
SetFit, we use the same learning rate and optimizer
M. Sugiyama, and R. Garnett, editors, Proceedings of
the Advances in Neural Information Processing Sys- parameters but we train for only one epoch.
tems 28 (NIPS), pages 649–657. Curran Associates,
Inc., Montreal, Quebec, Canada. C Evaluation
Ye Zhang, Matthew Lease, and Byron C. Wallace. 2017. In Table 4 and Table 5 we report final accuracy and
Active discriminative text representation learning. In AUC scores including standard deviations, mea-
Proceedings of the 31st AAAI Conference on Artifi- sured after the last iteration. Note that results ob-
cial Intelligence (AAAI), pages 3386–3392.
tained through PE, BT, and LC are equivalent for
Jingbo Zhu, Huizhen Wang, and Eduard Hovy. 2008. binary datasets.
Multi-criteria-based strategy to stop active learning
for data annotation. In Proceedings of the 22nd Inter- C.1 Evaluation Metrics
national Conference on Computational Linguistics
(Coling 2008), pages 1129–1136, Manchester, UK. Active learning was evaluated using standard met-
Coling 2008 Organizing Committee. rics, namely accuracy und area under the learning
curve. For both metrics, the respective scikit-learn
implementation was used.
Supplementary Material
A Technical Environment
All experiments were conducted within a Python
3.8 environment. The system had CUDA 11.1 in-
stalled and was equipped with an NVIDIA GeForce
RTX 2080 Ti (11GB VRAM).
Dataset Model Query Strategy
PE BT LC CA BA BD CS RS
BERT 0.898 0.003 0.901 0.004 0.900 0.001 0.889 0.010 0.889 0.008 0.894 0.003 0.881 0.006 0.886 0.004
AGN
SetFit 0.900 0.002 0.902 0.004 0.902 0.002 0.892 0.006 0.887 0.010 0.896 0.003 0.896 0.003 0.877 0.005
BERT 0.920 0.009 0.920 0.009 0.916 0.006 0.917 0.010 0.919 0.010 0.911 0.010 0.915 0.012 0.902 0.014
CR
SetFit 0.937 0.014 0.937 0.014 0.937 0.014 0.938 0.009 0.934 0.004 0.913 0.011 0.939 0.011 0.912 0.010
BERT 0.850 0.005 0.850 0.005 0.846 0.008 0.844 0.008 0.859 0.003 0.835 0.017 0.843 0.006 0.831 0.020
MR
SetFit 0.871 0.009 0.871 0.009 0.871 0.009 0.869 0.004 0.867 0.005 0.864 0.008 0.870 0.008 0.871 0.003
BERT 0.959 0.005 0.959 0.005 0.958 0.003 0.958 0.008 0.959 0.003 0.948 0.006 0.957 0.004 0.937 0.006
SUBJ
SetFit 0.962 0.004 0.962 0.004 0.962 0.004 0.960 0.002 0.966 0.002 0.942 0.002 0.963 0.003 0.932 0.005
BERT 0.960 0.002 0.966 0.003 0.960 0.008 0.965 0.006 0.958 0.007 0.958 0.009 0.952 0.015 0.947 0.009
TREC-6
SetFit 0.966 0.005 0.961 0.005 0.966 0.005 0.963 0.008 0.961 0.005 0.958 0.005 0.967 0.004 0.946 0.009
Table 4: Final accuracy per dataset, model, and query strategy. We report the mean and standard deviation over
five runs. The best result per dataset is printed in bold. Query strategies are abbreviated as follows: prediction
entropy (PE), breaking ties (BT), least confidence (LC), contrastive active learning (CA), BALD (BA), BADGE
(BD), greedy coreset (CS), and random sampling (RS). The best result per dataset is printed in bold.
Table 5: Final area under curve (AUC) per dataset, model, and query strategy. We report the mean and standard
deviation over five runs. The best result per dataset is printed in bold. Query strategies are abbreviated as follows:
prediction entropy (PE), breaking ties (BT), least confidence (LC), contrastive active learning (CA), BALD (BA),
BADGE (BD), greedy coreset (CS), and random sampling (RS). The best result per dataset is printed in bold.
D Library Adoption
As mentioned in Section 7, the experiment code of
previous works documents how small-text was
used and can be found at the following locations: