Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Ming Li1 , Yong Zhang2 , Shwai He1 , Zhitao Li2 , Hongyu Zhao1 ,
Jianzong Wang2 , Ning Cheng2 , Tianyi Zhou1 *
1
University of Maryland 2 Ping An Technology (Shenzhen) Co., Ltd.
minglii@umd.edu, jzwang@188.com, tianyi@umd.edu

Abstract
Instruction tuning is critical to improve LLMs
but usually suffers from low-quality and redun-
arXiv:2402.00530v1 [cs.CL] 1 Feb 2024

dant data. Data filtering for instruction tuning


has proved important in improving both the ef-
ficiency and performance of the tuning process.
But it also leads to extra cost and computation
due to the involvement of LLMs in this process.
To reduce the filtering cost, we study Super-
filtering: Can we use a smaller and weaker
model to select data for finetuning a larger and
stronger model? Despite the performance gap
between weak and strong language models, we
find their highly consistent capability to per-
ceive instruction difficulty and data selection
results. This enables us to use a much smaller
Figure 1: Top: Comparison of data filtering for instruc-
and more efficient model to filter the instruction
tion tuning of a student model. (a) The filter model is
data used to train a larger language model. Not
a strong proprietary LLM, e.g. ChatGPT, which can
only does it largely speed up the data filtering,
be time-consuming and expensive but usually performs
but the filtered-data-finetuned LLM achieves
promisingly. (b) The filter model is the student model
even better performance on standard bench-
itself or a similar-sized open-source LLM, which is still
marks. Extensive experiments validate the effi-
time-consuming but free to use. (c) Weak-to-strong
cacy and efficiency of our approach. Our codes,
superfiltering proposed by this paper, which utilizes
models, and data will be released at https://
a much smaller filter model, e.g. GPT-2, to train a
github.com/tianyi-lab/Superfiltering.
stronger student LLM. We find it costs much less time
but maintains the performance. Bottom: Comparisons
1 Introduction of two student models finetuned using 5% data selected
by LLaMA2-7B and GPT-2 from the Alpaca dataset. (d)
The landscape of natural language processing has Both models trained on 5% data outperform the baseline
witnessed a transformative change with the intro- model trained on 100% data. (e) GPT-2 as the superfilter
duction of large language models (LLMs) like speeds up data filtering by ∼ 20 times.
GPT-3 (Brown et al., 2020), GPT-4(OpenAI, 2023), is commonly utilized, which finetunes an LLM to
LLaMA (Touvron et al., 2023a,b), Mistral (Jiang produce preferred responses on a broad range of
et al., 2023), etc (Penedo et al., 2023; Scao et al., tasks described by natural language instructions.
2022). These models offer unprecedented capa- Earlier works of instruction tuning focus on cre-
bilities in generating contextually rich, coherent, ating large, varied, and high-quality datasets of
and often creative text. A key advantage of them various tasks with responses curated by human ex-
is the ability to follow instructions, which leads to perts (Khashabi et al., 2020; Ye et al., 2021; Wei
promising performance on zero-shot (prompting) et al., 2022; Wang et al., 2022; Du et al., 2022),
or few-shot (in-context learning) tasks. To achieve which can be bottlenecked by the intensive human
it, a supervised learning technique named instruc- labor. An alternative is to generate the data by a
tion tuning (Wei et al., 2022; Longpre et al., 2023a) powerful teacher LLM (Wang et al., 2023b; Taori
* Corresponding author et al., 2023; Xu et al., 2023; Li et al., 2023a) but the

1
quality is hard to control and largely depends on the In extensive experiments, our Superfiltering strat-
teacher. LIMA (Zhou et al., 2023) finds that a mere egy using GPT-2 as the filter, as exemplified on
1,000 human-crafted high-quality data could sig- several widely used instruction datasets, brings sig-
nificantly improve an LLM’s instruction-following nificant speedups to data filtering for instruction
capability, based on which they posit that LLMs tuning. By utilizing only 5% of the original data
acquire most knowledge during the pretraining and volume, Superfilter allows us to attain LLMs com-
thus a few data suffices for instruction tuning. parable, and in some instances superior, to those
To further free the human labor in data curation achieved by training with full data. Our main con-
and accelerate the instruction tuning process, a line tributions can be summarized in three folds:
of recent works apply an extra filter algorithm to • Weak-to-Strong Consistency on Data Filter-
select data from the existing dataset. However, the ing: We discover a strong consistency between
model used in the filtering process usually needs to small and large LLMs in perceiving and evalu-
be as powerful as ChatGPT (Chen et al., 2023b; Lu ating the difficulty of instruction tuning data.
et al., 2023) or requires additional reward data train-
ing (Du et al., 2023; Bukharin and Zhao, 2023), or • Efficent Superfiltering Strategy: We propose
is the student model itself (Li et al., 2023b), which the first method of Superfiltering that utilizes
leads to additional expensive cost and latency due a small LM, e.g., GPT-2 (124M), to select data
to the inference on these large filter models, espe- for instruction tuning, and brings significant
cially when the original dataset is large while only speedups to LLM finetuning pipeline.
a tiny fraction of data needs to be selected. These • Efficacy of Selected Training Data: Superfilter-
paradigms are presented in Figure 1 (a) and (b). To ing is precise in allocating high-quality and infor-
reduce the filtering cost, we study Superfiltering: mative data improving LLM instruction tuning.
Can we use a smaller and weaker model as a
filter to select instruction-tuning data for train- 2 Problem Formulation
ing a larger and stronger model? This was first 2.1 Preliminaries
studied for training small classification models by
Coleman et al. (2020) while the effectiveness on We define a dataset as D, containing n triplets x =
the open-domain instruction dataset is un-explored. (Instruction, [Input], Response) as the instruc-
Recently, Weak-to-Strong Generalization (Burns tion tuning data samples. Earlier instructing tun-
et al., 2023) proposes to utilize a weaker ChatGPT ing samples mostly contain separated instruction
to generate data used to finetune a stronger GPT4 and input segments of better controls (Wang
model, which shares a similar spirit with our Su- et al., 2022; Longpre et al., 2023b; Taori et al.,
perfiltering as depicted in Figure 1(c). 2023), while most of the current datasets directly
In Superfitering, we find that a smaller and merge the inputs to instructions (Zhou et al.,
weaker GPT-2 (124M) (Radford et al., 2019) suf- 2023; Chiang et al., 2023; Xu et al., 2023; Li
fices to replace previously used large filter mod- et al., 2023a). For simplicity, we define x =
els and select high-quality instruction tuning data map(Instruction, [Input]) as the complete in-
used to finetune a much larger LLaMA2 (7B or struction and y as the corresponding response.
13B). This is motivated by our main discovery of The mapping function could be the simple con-
filter models’ consistency on two data statistical catenation with some control tokens. Thus D =
metrics, perplexity and instruction-following dif- {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} represents a col-
ficulty (IFD) score (Li et al., 2023b). Despite the lection of n instruction-response pairs.
differences in scales across different filter models, Perplexity In the instruction tuning setting, the
their rankings of the same instruction tuning dataset model is trained to maximize the likelihood of re-
are surprisingly consistent, as demonstrated by the sponse given the corresponding instruction as the
large rank correlation coefficients evaluated on dif- condition. Hence, perplexity can be a potential
ferent models and datasets. Our thorough empirical metric to measure the difficulty. Specifically, the
study implies that weaker language models possess perplexity of a given sample (xi , yi ) is defined as:
a capability consistent with their stronger coun-
N
!
terparts in comprehending and discerning the diffi- 1 X
PPL(yi |xi ) = exp − log p(yi,j |xi , yi,1 , ..., yi,j−1 )
culty of diverse instructions, though they may differ N j=1
in other skills like reasoning and generalization. (1)

2
where PPL(yi |xi ) and PPL(yi ) denote the perplex-
ities of the given model in fitting response yi with
and without the instruction xi , respectively.

2.2 Formulation and Motivations


Superfiltering aims to find a data filtering score (1)
that excels in identifying high-quality and informa-
tive training data, and (2) computed by a small and
low-cost filter model without further training. To
this end, we try to find a data evaluation metric con-
sistent between weak and strong language models.
Given a candidate score, we investigate whether
it is possible to utilize a much weaker language
model, e.g. GPT-2, to calculate for the relatively
stronger student model. We hypothesize that, al-
though the intrinsic abilities between weak and
strong language models vary dramatically, indi-
cated by the discrepancies of perplexities on the
pretraining stage, their ability to perceive instruc-
tion difficulty could be similar. To verify our hy-
Figure 2: Pairwise comparison between each model pothesis, experiments are conducted and presented
finetuned using Superfiltered data (5%, 10%, and in Section 3.
15% of the original dataset) and the full-data (100%)
finetuned model. We report results for two base models To verify the hypothesis, we conduct a thorough
(LLaMA2-7B/13B) and two datasets (Alpaca and empirical study of the consistency of perplexities
Alpaca-GPT4 datasets). The win-tie-lose is judged by computed by different language models on the
GPT-4 given two models’ responses to each instruction same instruction-tuning dataset. In Section 3.1,
from WizardLM test set. we focus on verifying the consistency of perplexity
across weak-to-strong models by comparing their
scale and orderings of samples on each dataset. The
where N is the length of response yi and yi,j rep-
results show that though the scales vary drastically,
resents the jth token in the response yi .
the orderings remain consistent, which verifies our
IFD score Li et al. (2023b) firstly proposes a hypothesis. In Section 3.2, we conduct the same
self-guided method in which no extra models study on IFD scores, on which both the scales and
are utilized but needs to calculate Instruction- the orderings are consistent across weak-to-strong
Following Difficulty (IFD) scores based on the pre- models, indicating IFD score as a more promising
experienced LLM or original pre-trained LLM. The score for Superfiltering than perplexity.
IFD score is a pure statistical metric, that compares Figure 2 compares each Superfiltering-selected-
the losses or perplexities when the model generates data finetuned model and the full-data finetuned
a response yi with and without instructional con- model by using GPT-4 as a judge to decide their
text x1 , measuring how much help the instruction numbers of wins/ties/losses on a test set of instruc-
provides to the generation of the corresponding tions. More details of the evaluation metric can be
response. A higher IFD score, indicating less in- found in Section 4.3. Superfiltering-trained mod-
structional help, suggests a greater difficulty. On els always outperform the baseline given different
the contrary, the low IFD score represents that the base models, datasets, and selection ratios, demon-
given instruction can directly benefit the language strating the effectiveness of our proposed weak-to-
model largely even without further training, repre- strong Superfiltering scheme.
senting the easiness and necessity of the instruction.
For a given instruction-following data pair, the IFD 3 Weak-to-Strong Consistency
score is calculated as follows:
In this section, we delve into the hypothesis that
PPL(yi |xi ) weak and strong language models share a relatively
IFD(yi |xi ) = (2) consistent capability in perceiving the difficulties
PPL(yi )

3
Model(weak-to-strong) Rank Correlation ↑ Overlap Ratios ↑
Dataset
Name Size Perplexity IFD score 5% 10% 15%
GPT-2 124M 0.726 0.679 0.28 0.41 0.49
GPT-2-large 774M 0.790 0.682 0.26 0.40 0.50
Alpaca GPT-2-XL 1.5B 0.802 0.693 0.27 0.40 0.49
GPT-NEO 1.3B 0.846 0.802 0.38 0.51 0.59
LLaMA2-7B 7B 1.000 1.000 1.00 1.00 1.00
GPT-2 124M 0.730 0.788 0.24 0.40 0.51
GPT-2-large 774M 0.795 0.820 0.21 0.36 0.48
Alpaca-GPT4 GPT-2-XL 1.5B 0.800 0.818 0.18 0.33 0.45
GPT-NEO 1.3B 0.842 0.876 0.33 0.52 0.62
LLaMA2-7B 7B 1.000 1.000 1.00 1.00 1.00
GPT-2 124M 0.763 0.802 0.42 0.54 0.61
GPT-2-large 774M 0.809 0.848 0.44 0.58 0.65
WizardLM 70k GPT-2-XL 1.5B 0.821 0.855 0.44 0.57 0.65
GPT-NEO 1.3B 0.857 0.893 0.52 0.63 0.69
LLaMA2-7B 7B 1.000 1.000 1.00 1.00 1.00

Table 1: The rank correlation coefficient (Spearman’s ρ) and overlap ratio (of selected data) between LLaMA2-7B
and smaller language models when applied as filter models on three widely-used instruction tuning datasets. When
calculating Spearman’s ρ, the samples are sorted by perplexity or IFD scores calculated by different filter models.
For the overlap ratio, we consider three data filtering budgets, i.e., when 5%, 10%, or 15% of the dataset are selected.
The large rank coefficient between LLaMA2-7B and other smaller models indicates the consistency of different
models in perceiving the difficulties of instruction tuning data.

of instruction tuning samples. different models, Spearman’s rank correlation co-


efficient (Spearman’s ρ) is utilized. Spearman’s
3.1 Weak-to-Strong Perplexity Consistency ρ is a non-parametric measure used to assess the
As mentioned in the previous section, we first strength and direction of the relationship between
need to have a grasp of to what extent language two variables that are ranked or ordinal in nature.
models of different sizes are consistent with each For two lists containing the same elements but dif-
other in understanding instructions and generating ferent ordering, this value measures the similarity
corresponding responses. Thus we calculate the of the ordering in the range of −1 to 1. The closer
perplexity scores of several pretrained language the value is to 1, the more consistent the ordering
models, including relatively small language models of these two lists.
like GPT-2 (124M), GPT-2-large (774M), GPT-2- Specifically, within each dataset D, we sort the
XL (1.5B) (Radford et al., 2019), GPT-NEO (1.3B) data samples based on the perplexity scores cal-
(Black et al., 2021) and recent relatively strong culated by different models, resulting in several
LLaMA2-7b (Touvron et al., 2023b), on several lists containing the same data but different orders,
instruction-tuning dataset including the Alpaca noted as DPPL, GPT-2 , DPPL, LLaMA2-7B , etc. Since
(Taori et al., 2023), Alpaca-GPT4 (Peng et al., most of the fine-tuned experiments in our work are
2023), and WizardLM 70k (Xu et al., 2023). The implemented on the LLaMA2-7B model, we set
results are shown in Figure 3 (upper), where each DPPL, LLaMA2-7B as our standard sorted list and cal-
box presents a perplexity distribution of a given culate the Spearman’s ρ between the sorted lists of
dataset and language model. A clear tendency can small language models and LLaMA2-7B:
be found that the stronger the language models are,
the lower the perplexities are, which is consistent ρPPL, GPT-2 = g(DPPL, GPT-2 , DPPL, LLaMA2-7B )
with the common beliefs for LLM pretraining: the (3)
better a language model, the lower this perplexity. where g is the function of calculating this coef-
The above experimental results only showcase ficient. All the resulting values on different in-
the perplexity scales of different models and ne- struction tuning datasets and different models are
glect the potential perplexity ordering/ranking of presented in Table 1, the Spearman’s Coefficient-
different data samples, which is much more vital Perplexity column.
for data filtering. Thus to evaluate the similarity From the results, we can see even the lowest co-
in perplexity ordering on a given dataset between efficient value is still greater than 0.7, calculated

4
Figure 3: The distributions of perplexity (top) and IFD score (bottom) computed by five models (left-to-right:
weak-to-strong) on three instruction tuning datasets. Observations: (1) The scale of perplexity varies drastically
across different models, indicating their difference in generation capability; (2) The scale of IFD scores is consistent
across models, indicating their consistency in measuring difficulties.

between GPT-2(124M) and LLaMA2-7B, and the experiments, the perplexity does not directly rep-
highest coefficient value is greater than 0.85, calcu- resent the difficulty or quality of the instruction
lated between GPT-NEO(1.3B) and LLaMA2-7B. tuning sample and is thus not able to be used for
The values presented in the table are reasonably the data section. Thus we further extend our find-
high, indicating the consistent capability of differ- ings to the Instruction-Following Difficulty (IFD)
ent models in perceiving instructions. Moreover, score proposed by Cherry LLM (Li et al., 2023b).
there is also a clear tendency that the stronger the It is used to select a subset of high-quality samples
language models are, the higher the coefficient val- from the given instruction-tuning dataset to train
ues. Comparing the perplexity distributions in Fig- an LLM with better performance.
ure 3 (upper) and the coefficient values in Table 1, a
Similarly, we calculate the IFD scores on differ-
clear consistency can be revealed: Despite the large
ent instruction-tuning datasets with different lan-
variance in the scales of perplexities generated by
guage models and draw their distributions as shown
different language models, representing the intrin-
in Figure 3 (lower). We observe that though the per-
sic abilities of different language models, the high
plexity scales vary noticeably between models, the
consistency in the perplexity ordering indicates the
IFD scales remain similar, indicating its potential
similarity of them to understand instructions. That
to be the general selection metric for different mod-
is to say, for a given instruction tuning sample, if
els. Furthermore, the IFD-based Spearman’s ρ are
the weak language models find it hard to generate
also presented in Table 1 Spearman’s Coefficient-
based on the corresponding instruction, the strong
IFD score column. Similar to the perplexity-based
models might probably feel the same way even
coefficient values, IFD-based values also remain
though their probability of generating this response
high, indicating a strong consistency of IFD rank-
is much larger, and vice versa.
ings calculated on different models. Such a con-
This phenomenon directly provides a glance at
sistency validates the scalability of weaker models
the weak-to-strong perplexity consistency, which
in evaluating instruction difficulty, indicating their
serves as the basis for utilizing weak language mod-
adeptness at identifying complex instructions akin
els as the proxies for strong language models.
to their stronger counterparts. Another interest-
ing phenomenon is that the IFD-based coefficient
3.2 Weak-to-Strong IFD Consistency
values are greater than perplexity-based values on
Though a clear consistency in the perplexities of high-quality datasets, e.g. Alpaca-GPT4 and Wiz-
different language models is revealed by the above ardLM 70k, indicating an even higher consistency

5
in IFD scores for these datasets. 4 Experimental Setup
To provide an even further apparent glance at this 4.1 Datasets
consistency, we calculate the overlap ratio when uti-
The Alpaca dataset (Taori et al., 2023) is devel-
lizing IFD scores to select the high-quality subset.
oped by Stanford University, comprises 52,000
The performances of the LLMs could be slightly
instruction-following samples, and was created us-
estimated by the overlap ratio due to the previous
ing the self-instruct paradigm (Wang et al., 2023b).
success of this metric. As the percentage thresh-
This dataset was generated by leveraging OpenAI’s
old increases from 5% to 15%, there is a signifi-
text-davinci-003 model. The Alpaca dataset repre-
cant and growing overlap in the samples identified
sents a classical dataset with moderate qualities, to
by the weaker models and strong models like the
further verify our method on the originally high-
LLaMA2-7B model. Although the overlap is not
quality dataset, we also implement our method on
complete, it is substantial, this increasing overlap
the Alpaca-GPT4 dataset (Peng et al., 2023), which
with higher thresholds reinforces our hypothesis,
contains the responses generated by GPT4.
affirming a consistent and scalable capability in in-
struction evaluation across models of varying sizes. 4.2 Implementation Details
This weak-to-strong IFD consistency directly We utilize the prompt and code base from Vicuna
verifies our hypothesis that language models with (Chiang et al., 2023) and flash attention (Dao et al.,
different sizes possess similar capabilities in un- 2022) while the overall training arguments are
derstanding the difficulty of the instructions, even aligned with the common training configuration.
though their intrinsic abilities are varied. It means The Adam optimizer (Kingma and Ba, 2017), with
that the difficult instruction tuning samples defined a 2×10−5 learning rate for the LLaMA2-7B model
by the IFD scores are probably “generally” diffi- (Touvron et al., 2023b) and a 1 × 10−5 learning
cult no matter what language model is utilized for rate for the LLaMA2-13B model, and a batch size
the calculation. This phenomenon directly makes of 128, steer the training across three epochs with a
it possible to utilize weak language models as the max length of 2048. The warmup rate is set to 0.03.
proxies for strong language models for calculating
the IFD scores, and thus, to select data for instruc- 4.3 Evaluation Metrics and Benchmarks
tion tuning. 4.3.1 Pair-wise comparison
Evaluating responses generated by Large Language
Models (LLMs) like GPT-4 remains a complex
3.3 Superfiltering and ongoing research area, particularly for open-
domain questions where establishing a clear ground
From the above section, we observe that the IFD truth is challenging. Traditional methods often fall
score is a highly consistent metric when calculat- short in assessing the instruction-following abil-
ing based on different instruction-tuning datasets ity of these models. Recent trends, however, in-
and varied-size language models. Thus we pro- volve using LLMs themselves, such as GPT-4, as
pose “Superfiltering”, the first approach utilizing evaluators, a practice that has gained widespread
only small language models, i.e. GPT-2 (124M) acceptance in the field (Touvron et al., 2023b; Chi-
(Radford et al., 2019) to filter data for the instruc- ang et al., 2023; Dettmers et al., 2023; Liu et al.,
tion tuning of modern LLMs. Superfiltering uses 2023). Previous studies (Zheng et al., 2023; Li
smaller, less resource-intensive models (referred et al., 2023c; Sottana et al., 2023) have shown
to as “weak” models) as effective substitutes for that GPT4’s evaluations are consistent with human
larger models (referred to as “strong” models) in evaluations. We utilized the testing instruction set
the data evaluations. For the first time, making this from WizardLM (Xu et al., 2023) and Vicuna (Chi-
process so efficient as to put it into practical usage. ang et al., 2023) which contain 218 and 80 diverse
Specifically, following Li et al. (2023b), for the human-curated instructions respectively.
given instruction-tuning dataset, the GPT-2 model Our study adopts the evaluation strategy as out-
is directly used to calculate the IFD score of each lined by Chen et al. (2023b); Li et al. (2023b,a),
sample. Then the top k-percent samples with the involving a detailed rating system for model-
highest IFD scores under 1 are selected for faster generated responses. Each response is scored re-
instruction tuning. flecting various dimensions such as the accuracy

6
and relevance of the response. This method is in 5 Experimental Result
line with previous research efforts to assess the ef-
5.1 Main results
fectiveness of language models more accurately.
Moreover, to address the issue of positional bias, In this section, we present the evaluation results of
as discussed in the works of Ko et al. (2020); Wang three different evaluation settings as described in
et al. (2023a), we present the responses generated the previous section as shown in Table 2. The Pair-
by the model in two separate sequences for eval- Wise Winning Score indicates the result directly
uation by the LLM judge. This approach aims to comparing with the corresponding model trained
ensure a more balanced and unbiased assessment with full data. These values that are greater than
of the model’s performance. Then for each instruc- 1.0 represent better responses generated by our Su-
tion, we compare the responses by "Win-Tie-Loss". perfiltering models than full data models. The de-
tailed win-tie-lose numbers are presented in Figure
2. Moreover, the performance of our models and
4.3.2 AlapcaEval Leaderboard baseline models on the Huggingface Open LLM
Leaderboard and the AlpacaEval Leaderboard
The AlpacaEval Leaderboard, utilizing the Alpaca-
are also presented in Table 2 where we can see
Farm (Dubois et al., 2023; Li et al., 2023c) eval-
our models using 5%, 10%, 15% data outperform
uation dataset, is an automated, efficient, and re-
the models trained with full data on both bench-
liable evaluation tool for LLMs. It benchmarks
marks on both LLaMA2-7B and LLaMA-13B set-
LLMs’ performance in following generic user in-
tings. These results further showcase the effective-
structions by comparing their outputs with those
ness of our automatically selected data. Moreover,
from Davinci003, demonstrating high alignment
the usefulness of Superfiltering on the high-quality
with human expert annotations. AlpacaFarm, un-
Alpaca-GPT4 dataset further shows the potential
derlying AlpacaEval, is a cost-effective simulator
of our method, which is astonishing that a pure
for research on learning from human feedback,
statistical metric based on a weak language model
significantly reducing the time and cost tradition-
like GPT-2 is able to filter the responses generated
ally associated with such studies. While AlpacaE-
by GPT-4.
val offers valuable insights, it primarily focuses
on simpler instructions and does not encompass 5.2 Ablation Study
safety evaluations or complex tasks, and its evalua-
In this subsection, extensive ablation experiments
tion may correlate win rates with response lengths.
are conducted to validate the effectiveness of our
These tools represent significant advancements in
Superfiltering. The experiments are performed on
LLM evaluation and development, enabling more
the LLaMA2-7B model using the Alpaca dataset.
accessible and diverse research. Considering our
Our focus is on two aspects: the impact of different
budget, we only run the evaluation on 5% settings.
data selection strategies and the effect of using
various language models for data selection. All
4.3.3 Open LLM Leaderboard models are trained under the same settings.
As shown in Table 3, in addition to our method
The Huggingface Open LLM Leaderboard, incor- “Superfiltering (GPT-2)”, we also try several base-
porating the evaluation method from the Eval Har- line strategies: “Random” represents the models
ness (Gao et al., 2021), serves as a comprehen- trained with randomly selected data. “Diversity”
sive framework for evaluating generative language represents the models trained with data consider-
model capabilities. It focuses on four critical ing only diversity, by utilizing the k-means algo-
benchmarks: ARC (Clark et al., 2018), HellaSwag rithm. “Perplexity” represents the models trained
(Zellers et al., 2019), MMLU (Hendrycks et al., with data based on the perplexity calculated on
2021), and TruthfulQA (Lin et al., 2022). These GPT-2. Moreover, the lower part of the table lists
benchmarks test the models on various aspects, the models using the IFD score to select the train-
such as reasoning, common-sense understanding, ing subset, powered by other language models.
and factual accuracy. The leaderboard offers an The performances of models are assessed by the
effective platform for comparing different LLMs, pair-wise winning score, which is calculated as
providing valuable insights into their performance (Num(Win)−Num(Lose))/Num(All) +1, and all
across these diverse and challenging tasks the comparisons are performed by GPT4 on the

7
Dataset/ Superfilter Pairwise ↑ Huggingface Open LLM Leaderboard ↑ AlpacaEval ↑
Base Model Ratio(Size) Winning Score Average ARC HellaSwag MMLU TruthfulQA Win Rate
Alpaca/ 100% 1.000 55.25 54.35 78.65 47.02 40.98 27.75
LLaMA2-7B 5%(2,600) 1.133 55.67 56.57 80.15 45.21 40.74 33.04
10%(5,200) 1.101 56.97 58.02 80.57 47.16 42.14 -
15%(7,800) 1.193 56.61 56.23 80.29 46.73 43.21 -
Alpaca/ 100% 1.000 58.78 57.59 81.98 54.05 41.49 35.00
LLaMA2-13B 5%(2,600) 1.174 60.96 61.60 83.84 55.79 42.63 45.71
10%(5,200) 1.069 61.11 62.12 83.74 55.09 43.50 -
15%(7,800) 1.142 60.90 60.92 83.58 55.24 43.86 -
Alpaca-GPT4/ 100% 1.000 58.71 54.69 80.05 47.89 52.21 71.32
LLaMA2-7B 5%(2,600) 1.014 59.66 56.74 81.19 46.80 53.92 72.13
10%(5,200) 1.064 59.80 57.42 81.79 45.67 54.33 -
15%(7,800) 1.078 60.02 57.00 81.21 46.15 55.72 -
Alpaca-GPT4/ 100% 1.000 60.81 57.94 82.22 54.84 48.25 77.86
LLaMA2-13B 5%(2,600) 1.041 63.29 62.29 84.96 55.78 50.13 78.15
10%(5,200) 1.046 63.65 62.63 84.51 55.39 52.06 -
15%(7,800) 1.078 63.65 62.88 84.32 55.35 52.05 -

Table 2: Comparison of Superfiltering with four data selection ratios (5%, 10%, 15%, 100%) when finetuning
two LLMs (LLaMA2-7B/13B) on two datasets (Alpaca and Alpaca-GPT4). The finetuned models are evaluated by
the pair-wise winning score (comparison to the baseline model finetuned on 100% data), Open LLM Leaderboard,
and AlpacaEval. In the parathesis are the ratio of data being used and its exact number. The winning score is
calculated as (Num(Win)−Num(Lose))/Num(All) +1, where the win-tie-lose numbers are reported in Figure 2.
The consistent improvement on all the three evaluation benchmarks demonstrates the effectiveness of Superfitering.

Ablation Pairwise Winning Score ↑ metric. Moreover, the models using LLaMA2-7B
Data Selection Budget 5% 10% 15% reasonably achieve the highest performance, due
to the consistency between the model to calculate
Strategy: Random 0.936 0.968 0.977
Diversity 0.927 0.977 0.982 the IFD scores and the model to be trained.
Perplexity 0.261 0.569 0.610
Filter: GPT-2-large 1.165 1.046 1.193 6 Further Discussion
GPT-2-XL 1.064 1.165 1.128
GPT-NEO 1.096 1.197 1.156 6.1 “Plug-and-Play” without Additional
LLaMA2-7B 1.303 1.330 1.294 Training
Superfilter (IFD, GPT-2) 1.133 1.101 1.193 In the realm of data selection for language model
instruction tuning, our Superfiltering introduces a
Table 3: Ablation study of data selection strategies
transformative advantage: the unnecessity of train-
and filter models on finetuning LLaMA2-7B using the
Alpaca dataset. The pairwise winning score compares
ing for even weak language models. Traditional
each finetuned model with the full-data finetuned model proxy-based methods like Coleman et al. (2020)
and computes (Num(Win)−Num(Lose))/Num(All) +1. and Nguyen et al. (2022) are required to further
All the comparisons are performed by GPT-4 on the train weak models to bridge the performance gap
WizardLM test set. with stronger models. However, our study reveals
that pre-trained weak models are naturally effec-
tively capable of acting as the proxies for strong
WizardLM test set. models when utilizing the IFD for data selection,
As shown in Table 3, compared with other strate- without requiring any additional training.
gies, models trained with our method consistently Moreover, in the context of instruction tuning
outperform the models trained on the full dataset, data selection, model training is always necessary
indicating the efficacy of our method. Regarding if no extra strong models like ChatGPT or other
the impact of different language models, whichever trained reward models are utilized. Lu et al. (2023)
language model is utilized to calculate the IFD utilizes chatGPT to tag the instruction datasets
scores, the corresponding models would surpass the and train LLMs for tagging instruction samples
baseline model, indicating the strong consistency based on these data. Wei et al. (2023) first splits
and transferability of the IFD score as the selection a range of subsets from the original data and

8
records each fine-tuned model’s performance on instruction data, generating both ratings and expla-
the validation set as the labels of data quality. nations. Du et al. (2023) and Bukharin and Zhao
Then a self-attention network as the data selector (2023) utilize an extra reward model to assess the
is trained. Cao et al. (2023) utilizes a trained quality of data and utilize these scores as a part of
regression model to estimate the inference losses their method. Li et al. (2023b) firstly proposes
as data qualities on several datasets. a self-guided method in which no extra LLMs
The performances of the resulting efficient selec- are utilized but still needs to calculate Instruction-
tion models are appealing while a possible concern Following Difficulty (IFD) scores based on the
is the generalizability since they all need extra per- original pre-trained LLM. Though effective, these
formance indicators, i.e. development set, in exist- methods overly rely on large language models and
ing datasets. On the contrary, our approach directly are too time-consuming to put into practical use.
utilizes established, widely accepted models like
GPT-2, which are known for their broad applicabil- 7.3 Small Model Proxies for Large Models
ity and generalization ability. These models do not The use of proxy models is increasingly recognized
necessitate fine-tuning on specific datasets, thereby in machine learning, particularly when resources
reducing the risk of out-of-distribution issues. Our are constrained or there is a limited understanding
method demonstrates that these smaller models can of the original model’s architecture. Chen et al.
be deployed in a “plug-and-play” manner, achiev- (2023a) and Hase et al. (2020) demonstrate the util-
ing commendable performance immediately. This ity of lightweight proxy models in evaluating free-
innovative approach not only simplifies the data text rationales. Similarly, Puigcerver et al. (2021)
selection process but also revolutionizes the effi- leverages embeddings from expert models with a k-
ciency and applicability of such methods in large nearest neighbors classifier to simplify the training
language model instruction tuning. of more complex systems. Coleman et al. (2020)
and the FAMIE (Nguyen et al., 2022) apply down-
7 Related Work scaled proxy models in fields like image classifi-
7.1 Instruction Tuning cation and information extraction, utilizing tech-
niques such as layer removal and knowledge distil-
Recent advancements in natural language process-
lation for aligning these proxies with larger models.
ing (NLP) have been significantly influenced by
Building on this, Burns et al. (2023) explores the
instruction tuning, a method that tailors large lan-
concept of enhancing larger models through weak
guage models (LLMs) for varied tasks using ex-
supervision, training them on labels generated by
plicit instructions (Wei et al., 2022; Sanh et al.,
weaker models. By extending the "Weak to Strong"
2022; Longpre et al., 2023a). This approach has
concept to LLM instruction tuning data selection,
enhanced LLMs’ ability to understand and follow
our research employs pre-trained smaller models
instructions in diverse contexts. Initial research in
as proxies, demonstrating their effectiveness in as-
this area focused on expanding dataset sizes to im-
sessing instruction complexity, thereby bridging
prove instruction-following capabilities (Honovich
the gap between the comprehensive capabilities of
et al., 2023; Wang et al., 2023b). Additionally, in-
larger models and the agility of smaller ones.
novative methods, such as using LLMs to generate
instructional data, are being explored to streamline
8 Conclusion
and enhance the instruction tuning process (Wang
et al., 2023b; Xu et al., 2023; He et al., 2023; Li This paper presented “Superfiltering”, a novel and
et al., 2023a). efficient approach for data filtering in the instruc-
tion tuning of LLMs. By effectively utilizing
7.2 Instruction Tuning Data Selection weaker models as proxies for evaluating instruc-
To further select the data for more efficient instruc- tional data, particularly in the context of IFD scores,
tion tuning, existing automatic data selection meth- we achieved a significant leap in efficiency, accel-
ods mainly utilize extra LLMs for the selection. Lu erating the data filtering process largely. The exper-
et al. (2023) utilizes proprietary chatGPT to tag the imental results affirm that our method considerably
instruction data to ensure diversity and complex- reduces computational overhead while maintaining
ity. Chen et al. (2023b) utilizes proprietary LLMs or even improving the instructional capabilities of
chatGPT and Claude2 to assess the quality of the LLMs. Thus, Superfiltering stands as a testament

9
to our initial hypothesis and objectives, marking a Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
substantial contribution to the field of natural lan- Ashish Sabharwal, Carissa Schoenick, and Oyvind
Tafjord. 2018. Think you have solved question an-
guage processing by offering a scalable, resource-
swering? try arc, the ai2 reasoning challenge.
efficient, and effective strategy for the advancement
of AI technologies. Cody Coleman, Christopher Yeh, Stephen Mussmann,
Baharan Mirzasoleiman, Peter Bailis, Percy Liang,
Jure Leskovec, and Matei Zaharia. 2020. Selection
References via proxy: Efficient data selection for deep learning.
In International Conference on Learning Representa-
Sid Black, Gao Leo, Phil Wang, Connor Leahy, tions.
and Stella Biderman. 2021. GPT-Neo: Large
Scale Autoregressive Language Modeling with Mesh- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,
Tensorflow. If you use this software, please cite it and Christopher Ré. 2022. Flashattention: Fast and
using these metadata. memory-efficient exact attention with io-awareness.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
Neelakantan, Pranav Shyam, Girish Sastry, Amanda of quantized llms. arXiv preprint arXiv:2305.14314.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Qianlong Du, Chengqing Zong, and Jiajun Zhang. 2023.
Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Mods: Model-oriented data selection for instruction
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- tuning.
teusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
Radford, Ilya Sutskever, and Dario Amodei. 2020. Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM:
Language models are few-shot learners. In Ad- General language model pretraining with autoregres-
vances in Neural Information Processing Systems, sive blank infilling. In Proceedings of the 60th An-
volume 33, pages 1877–1901. Curran Associates, nual Meeting of the Association for Computational
Inc. Linguistics (Volume 1: Long Papers), pages 320–335,
Dublin, Ireland. Association for Computational Lin-
Alexander Bukharin and Tuo Zhao. 2023. Data diversity guistics.
matters for robust instruction tuning.
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang,
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy
Bowen Baker, Leo Gao, Leopold Aschenbrenner, Liang, and Tatsunori B. Hashimoto. 2023. Alpaca-
Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan farm: A simulation framework for methods that learn
Leike, Ilya Sutskever, and Jeff Wu. 2023. Weak-to- from human feedback.
strong generalization: Eliciting strong capabilities
with weak supervision. Leo Gao, Jonathan Tow, Stella Biderman, Sid Black,
Anthony DiPofi, Charles Foster, Laurence Golding,
Yihan Cao, Yanbin Kang, and Lichao Sun. 2023. In- Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff,
struction mining: High-quality instruction data selec- Jason Phang, Laria Reynolds, Eric Tang, Anish Thite,
tion for large language models. Ben Wang, Kevin Wang, and Andy Zou. 2021. A
framework for few-shot language model evaluation.
Hanjie Chen, Faeze Brahman, Xiang Ren, Yangfeng Ji,
Yejin Choi, and Swabha Swayamdipta. 2023a. REV: Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal.
Information-theoretic evaluation of free-text ratio- 2020. Leakage-adjusted simulatability: Can models
nales. In Proceedings of the 61st Annual Meeting of generate non-trivial explanations of their behavior
the Association for Computational Linguistics (Vol- in natural language? In Findings of the Association
ume 1: Long Papers), pages 2007–2030, Toronto, for Computational Linguistics: EMNLP 2020, pages
Canada. Association for Computational Linguistics. 4351–4367, Online. Association for Computational
Linguistics.
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa
Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- Xingwei He, Zhenghao Lin, Yeyun Gong, Hang Zhang,
vasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan,
2023b. Alpagasus: Training a better alpaca with Weizhu Chen, et al. 2023. Annollm: Making large
fewer data. language models to be better crowdsourced annota-
tors. arXiv preprint arXiv:2303.16854.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Stoica, and Eric P. Xing. 2023. Vicuna: An open- 2021. Measuring massive multitask language under-
source chatbot impressing gpt-4 with 90%* chatgpt standing. In International Conference on Learning
quality. Representations.

10
Or Honovich, Thomas Scialom, Omer Levy, and Timo Zoph, Jason Wei, and Adam Roberts. 2023a. The flan
Schick. 2023. Unnatural instructions: Tuning lan- collection: Designing data and methods for effective
guage models with (almost) no human labor. In instruction tuning. ArXiv, abs/2301.13688.
Proceedings of the 61st Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Shayne Longpre, Le Hou, Tu Vu, Albert Webson,
Long Papers), pages 14409–14428, Toronto, Canada. Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V
Association for Computational Linguistics. Le, Barret Zoph, Jason Wei, et al. 2023b. The flan
collection: Designing data and methods for effective
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- instruction tuning. arXiv preprint arXiv:2301.13688.
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Florian Bressand, Gianna Lengyel, Guil- Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Jun-
laume Lample, Lucile Saulnier, Lélio Renard Lavaud, yang Lin, Chuanqi Tan, Chang Zhou, and Jingren
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Zhou. 2023. #instag: Instruction tagging for analyz-
Thibaut Lavril, Thomas Wang, Timothée Lacroix, ing supervised fine-tuning of large language models.
and William El Sayed. 2023. Mistral 7b.
Minh Van Nguyen, Nghia Ngo, Bonan Min, and Thien
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Nguyen. 2022. FAMIE: A fast active learning frame-
Sabharwal, Oyvind Tafjord, Peter Clark, and Han- work for multilingual information extraction. In Pro-
naneh Hajishirzi. 2020. UNIFIEDQA: Crossing for- ceedings of the 2022 Conference of the North Amer-
mat boundaries with a single QA system. In Find- ican Chapter of the Association for Computational
ings of the Association for Computational Linguistics: Linguistics: Human Language Technologies: System
EMNLP 2020, pages 1896–1907, Online. Association Demonstrations, pages 131–139.
for Computational Linguistics.
OpenAI. 2023. Gpt-4 technical report.
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A
method for stochastic optimization. Guilherme Penedo, Quentin Malartic, Daniel Hesslow,
Ruxandra Cojocaru, Alessandro Cappelli, Hamza
Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,
Kim, and Jaewoo Kang. 2020. Look at the first and Julien Launay. 2023. The refinedweb dataset for
sentence: Position bias in question answering. In falcon llm: Outperforming curated corpora with web
Proceedings of the 2020 Conference on Empirical data, and web data only.
Methods in Natural Language Processing (EMNLP),
pages 1109–1121, Online. Association for Computa- Baolin Peng, Chunyuan Li, Pengcheng He, Michel Gal-
tional Linguistics. ley, and Jianfeng Gao. 2023. Instruction tuning with
gpt-4. arXiv preprint arXiv:2304.03277.
Ming Li, Lichang Chen, Jiuhai Chen, Shwai He,
Heng Huang, Jiuxiang Gu, and Tianyi Zhou. 2023a. Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa,
Reflection-tuning: Data recycling improves llm Cedric Renggli, André Susano Pinto, Sylvain Gelly,
instruction-tuning. ArXiv, abs/2310.11716. Daniel Keysers, and Neil Houlsby. 2021. Scalable
transfer learning with expert models. In International
Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Conference on Learning Representations.
Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and
Jing Xiao. 2023b. From quantity to quality: Boosting Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
llm performance with self-guided data selection for Dario Amodei, Ilya Sutskever, et al. 2019. Language
instruction tuning. ArXiv, abs/2308.12032. models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori,
Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Victor Sanh, Albert Webson, Colin Raffel, Stephen
Tatsunori B. Hashimoto. 2023c. Alpacaeval: An Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
automatic evaluator of instruction-following models. Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
https://github.com/tatsu-lab/alpaca_eval. M Saiful Bari, Canwen Xu, Urmish Thakker,
Shanya Sharma Sharma, Eliza Szczechla, Taewoon
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti
TruthfulQA: Measuring how models mimic human Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han
falsehoods. In Proceedings of the 60th Annual Meet- Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong,
ing of the Association for Computational Linguistics Harshit Pandey, Rachel Bawden, Thomas Wang, Tr-
(Volume 1: Long Papers), pages 3214–3252, Dublin, ishala Neeraj, Jos Rozen, Abheesht Sharma, An-
Ireland. Association for Computational Linguistics. drea Santilli, Thibault Fevry, Jason Alan Fries, Ryan
Teehan, Teven Le Scao, Stella Biderman, Leo Gao,
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Thomas Wolf, and Alexander M Rush. 2022. Multi-
Ruochen Xu, and Chenguang Zhu. 2023. G-eval: task prompted training enables zero-shot task gener-
Nlg evaluation using gpt-4 with better human align- alization. In International Conference on Learning
ment. Representations.
S. Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Teven Le Scao, Angela Fan, Christopher Akiki,
Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Elizabeth-Jane Pavlick, Suzana Ili’c, Daniel Hesslow,

11
Roman Castagn’e, Alexandra Sasha Luccioni, Franc- tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
cois Yvon, Matthias Gallé, Jonathan Tow, Alexan- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
der M. Rush, Stella Rose Biderman, Albert Web- stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
son, Pawan Sasanka Ammanamanchi, Thomas Wang, Ruan Silva, Eric Michael Smith, Ranjan Subrama-
Benoît Sagot, Niklas Muennighoff, Albert Villanova nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
del Moral, Olatunji Ruwase, Rachel Bawden, Stas lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Melanie Kambadur, Sharan Narang, Aurelien Ro-
Suarez, Victor Sanh, Hugo Laurenccon, Yacine Jer- driguez, Robert Stojnic, Sergey Edunov, and Thomas
nite, Julien Launay, Margaret Mitchell, Colin Raf- Scialom. 2023b. Llama 2: Open foundation and
fel, Aaron Gokaslan, Adi Simhi, Aitor Soroa Etx- fine-tuned chat models.
abe, Alham Fikri Aji, Amit Alfassy, Anna Rogers,
Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai
Chris C. Emezue, Christopher Klamm, Colin Leong, Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui.
Daniel Alexander van Strien, David Ifeoluwa Ade- 2023a. Large language models are not fair evaluators.
lani, Dragomir R. Radev, Eduardo Gonz’alez Pon- Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
ferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Natan, Francesco De Toni, Gérard Dupont, Germán Hajishirzi. 2023b. Self-instruct: Aligning language
Kruszewski, Giada Pistilli, Hady ElSahar, Hamza models with self-generated instructions. In Proceed-
Benyamina, Hieu Trung Tran, Ian Yu, Idris Abdul- ings of the 61st Annual Meeting of the Association for
mumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier Computational Linguistics (Volume 1: Long Papers),
de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, pages 13484–13508, Toronto, Canada. Association
Jonathan Chang, Jorg Frohberg, Josephine L. To- for Computational Linguistics.
bing, Joydeep Bhattacharjee, Khalid Almubarak,
Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Weber, Long Phan, Loubna Ben Allal, Ludovic Tan- labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
guy, Manan Dey, Manuel Romero Muñoz, Maraim Naik, Arjun Ashok, Arut Selvan Dhanasekaran,
Masoud, Mar’ia Grandury, Mario vSavsko, Max Anjana Arunkumar, David Stap, Eshaan Pathak,
Huang, Maximin Coavoux, and Mayank Singh. 2022. Giannis Karamanolakis, Haizhi Lai, Ishan Puro-
Bloom: A 176b-parameter open-access multilingual hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia,
language model. ArXiv, abs/2211.05100. Krima Doshi, Kuntal Kumar Pal, Maitreya Patel,
Mehrad Moradshahi, Mihir Parmar, Mirali Purohit,
Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma,
2023. Evaluation metrics in the era of GPT-4: Reli- Ravsehaj Singh Puri, Rushang Karia, Savan Doshi,
ably evaluating large language models on sequence Shailaja Keyur Sampat, Siddhartha Mishra, Sujan
to sequence tasks. In Proceedings of the 2023 Con- Reddy A, Sumanta Patro, Tanay Dixit, and Xudong
ference on Empirical Methods in Natural Language Shen. 2022. Super-NaturalInstructions: Generaliza-
Processing, pages 8776–8788, Singapore. Associa- tion via declarative instructions on 1600+ NLP tasks.
tion for Computational Linguistics. In Proceedings of the 2022 Conference on Empiri-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann cal Methods in Natural Language Processing, pages
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, 5085–5109, Abu Dhabi, United Arab Emirates. As-
and Tatsunori B. Hashimoto. 2023. Stanford alpaca: sociation for Computational Linguistics.
An instruction-following llama model. https:// Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
github.com/tatsu-lab/stanford_alpaca. Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Dai, and Quoc V Le. 2022. Finetuned language mod-
Martinet, Marie-Anne Lachaux, Timothée Lacroix, els are zero-shot learners. In International Confer-
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal ence on Learning Representations.
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Lai Wei, Zihao Jiang, Weiran Huang, and Lichao Sun.
Grave, and Guillaume Lample. 2023a. Llama: Open 2023. Instructiongpt-4: A 200-instruction paradigm
and efficient foundation language models. for fine-tuning minigpt-4.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,
bert, Amjad Almahairi, Yasmine Babaei, Nikolay Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Jiang. 2023. Wizardlm: Empowering large language
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton models to follow complex instructions.
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. 2021.
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- CrossFit: A few-shot learning challenge for cross-
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan task generalization in NLP. In Proceedings of the
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, 2021 Conference on Empirical Methods in Natural
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Language Processing, pages 7163–7189, Online and
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- Punta Cana, Dominican Republic. Association for
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- Computational Linguistics.

12
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. HellaSwag: Can a ma-
chine really finish your sentence? In Proceedings of
the 57th Annual Meeting of the Association for Com-
putational Linguistics, pages 4791–4800, Florence,
Italy. Association for Computational Linguistics.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging
llm-as-a-judge with mt-bench and chatbot arena.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis,
Luke Zettlemoyer, and Omer Levy. 2023. Lima: Less
is more for alignment.

13

You might also like