Professional Documents
Culture Documents
Muppet: Massive Multi-Task Representations With Pre-Finetuning
Muppet: Massive Multi-Task Representations With Pre-Finetuning
Muppet: Massive Multi-Task Representations With Pre-Finetuning
Table 3: We present results for the GLUE benchmark task and a MRC dataset. Bolded numbers show the MUPPET
vs. base model, underline marks the best number. If not explicitly stated, the results are showing the accuracy of
the evaluation set. For the MRC tasks, we report both exact match (EM) and F1 as is standard in the literature. For
SQuAD, we reused the task head from pre-finetuning.
SP Commonsense Summarization
BoolQ CQA HellaSwag OpenQA CNN/DailyMail Gigaword Reddit TIFU
RoBERTa-B 82.0 66.2 65.1 63.8 - - -
+ MUPPET 83.8 69.4 69.0 64.6 - - -
Table 4: We present results for the non-GLUE Sentence Prediction tasks as well as a set of standard Commonsense
tasks. Bolded numbers signify MUPPET vs. base model, while an underline signifies the best number. If not
explicitly stated, the results are showing the accuracy of the evaluation set. For commonsense tasks, we re-use the
task head from pre-finetuning.
for structured prediction from the Penn Treebank MTL scale of around a dozen datasets. To the best
dataset (Marcus et al., 1993). We present these of our knowledge, our paper has the largest MTL
results in Table 5 and Table 6. set up to date. Accordingly, we are interested in
We see that the MUPPET variants of our models empirically exploring the effects of scaling up the
out-perform the baselines consistently across task number of datasets to the representations learned
type and dataset. As a special case we do an in during MTL.
depth analysis of the MUPPET variant of RoBERTa We pre-finetune a collection of RoBERTa-Base
on the notoriously tough ANLI dataset and see the models with varying numbers of datasets. We
same pattern. Pre-finetuned models consistently train seven models, six uniformly chosen between
outperform their base counterparts. 10 and 40, ensuring that at each point, the se-
lected datasets are a superset of the datasets from
5 Understanding Multi-Task at Scale prior points. The last model is fully trained on all
datasets. Concretely given two models trained with
5.1 Importance of Scale a different number of datasets a, b : a > b, model
The first axis we would like to explore is the scale a will contain all datasets used to train model b and
on which multi-task learning is done. Previous more.
work, such as T5 and MT-DNN, focused on the For each version of the model, we fine-tune five
SP Structured Prediction (Penn) Summarization
Table 5: We present results on a large set of different tasks across datasets that are not available to the model during
the pre-finetuning stage. Bolded numbers signify MUPPET vs. base model, while an underline signifies the best
number. For Chunking/Parsing, we use F1, while for Part-Of-Speech tagging, we use accuracy.
Table 6: We show the performance of the RoBERTa model and the pre-finetuned RoBERTa-MUPPET model on
the ANLI benchmark. Bolded numbers signify MUPPET vs base model, underline signifies best number. ‘S’ refers
to SNLI, ‘M’ to MNLI dev (-m=matched, -mm=mismatched), and ‘F’ to FEVER; ‘A1–A3’ refer to the rounds
respectively and ‘ANLI’ refers to A1+A2+A3.
datasets and plot the results in Figure 1. Specifi- past some critical point, our pre-trained representa-
cally we finetune STS-B (Cer et al., 2017), BoolQ tions become more generalizable. Furthermore, al-
(Clark et al., 2019), RACE (Lai et al., 2017), though dependent on the dataset, this critical point
SQuAD (Lai et al., 2017), and MNLI (Williams is roughly between 10 and 25 tasks.
et al., 2018a). We include these five datasets in This suggests that previously observed MTL lim-
the first MTL run (10 datasets) to remove any bias itations were not fundamental and can instead be
from adding them in a later stage. attributed to the lack of sufficient scale.
We see a couple of interesting patterns. First,
5.2 Importance of Heterogenous Batches
for individual tasks such as RTE (Bentivogli et al.,
2009), increasing the pre-finetuning scale mono- Another critical factor to getting MTL to learn gen-
tonically improves performance. This is aligned eralizable representations is the method through
with other papers that have seen benefits from first which MTL is implemented, specifically the se-
training on MNLI (Williams et al., 2018a) and lection of batches. To better quantify this trend,
then fine-tuning on RTE (Liu et al., 2019b). For we experimented with three balancing schemes:
other datasets, we see that doing MTL in the < 15 dataset homogenous, batch homogenous and batch
datasets regime is detrimental for end-task fine- heterogenous.
tuning. This is also aligned with other empirical We refer to dataset homogenous as selecting
observations, i.e., T5 reported that doing MTL did batches from datasets sequentially. So we first
not improve over only fine-tuning. Nevertheless, train on dataset A, then train on dataset B, etc.
it seems that as we increase the number of tasks On the other hand, batch homogenous refers to
RoBERTa Pre-Finetuning Scale Ablation
88
86
84
Acc
82
Dataset
80 RTE
BoolQ
RACE
78 SQuAD
MNLI
0 10 20 30 40
# of Datasets
Figure 1: We plot the RoBERTa evaluation accuracy of five datasets: RTE, BoolQ, RACE, SQuAD, and MNLI,
across various scales of multi-task learning measured in the number of datasets. We notice that performance
initially degrades until a critical point is reached regarding the number of the datasets used by the MTL framework
for all but one dataset. Post this critical point; our representations improve over the original RoBERTa model.
selecting batches containing only data from the splits was done through random sampling.
same task; therefore, all gradients are from the We then fine-tune every low-resource split with
same dataset. This is implemented by selecting all every pre-finetuning checkpoint from Section §5.1.
datasets, batching on a dataset level, and selecting We plot the heatmaps generated from these runs in
those same batches randomly during training. Fi- Figure 3.
nally, batch heterogeneous refers to a single update Multiple patterns emerge. First, we see a clear
containing a batch from multiple different datasets visualization of the critical point mentioned when
spanning different tasks. We implemented this by doing pre-finetuning. As we increase the scale of
first creating homogenous sub-batches, calculating MTL, better representations are available for down-
loss per sub-batch per GPU, and then aggregating stream finetuning. Furthermore, we see that pre-
across GPUs manifesting in a gradient update that finetuned models at a larger scale are much more
contains various datasets and, therefore, tasks. data-efficient than standard pre-trained models.
To dissect the importance of heterogeneous Specifically looking at the 34/40 pre-finetuning
batches, we train a RoBERTa-Base model on 35 scale on Figure 3 we see that we reach higher
randomly selected tasks using the three data selec- evaluation accuracies much sooner than the base
tion methodologies outlined above. We then fine- RoBERTa model (row 0).
tune these three models on the same five data-sets
mentioned in the previous section. 6 Conclusion
We present our results in Figure 2. We see the In this work, we propose pre-finetuning, a stage
importance of properly defining a batching strat- after pre-training to further refine representations
egy for effective multi-task learning. Our findings before end-task finetuning. We show that we can ef-
are also consistent with (Aghajanyan et al., 2020) fectively learn more robust representations through
which saw that sequential training of data-sets de- multi-task learning (MTL) at scale. Our MTL mod-
grades generalizable representations. els outperform their vanilla pre-trained counter-
parts across several tasks. Our analysis shows that
5.3 Low Resource Experiments
properly scaling MTL with heterogeneous batches
We noticed in Section §4 that data-sets with smaller and loss scaling is critical to leveraging better repre-
data-set sizes tended to improve more from MTL sentations. We also show a critical point regarding
training. To strengthen this hypothesis, we look the number of tasks when doing multi-task learning,
at two factors: the scale of pre-finetuning and the where fewer tasks degrade representations com-
scale of fine-tuning (size of fine-tuning data-set). pared to the pre-trained model, but more tasks than
We select three data-sets that were not used in this point improve representations.
pre-finetuning in Section §5.1. We also select nine We discussed a practical setting in which do-
partitions per fine-tuning data-set, which is sam- ing this massive multi-task learning is stable and
pled uniformly between 10% of the data-set and effective through simple loss scaling and hetero-
100% of the data-set. Selecting the low-resource geneous batches. With our method, we improve
RoBERTa Base MTL Batching Strategy Ablation
Batch Strategy
90 Dataset Homogenous
Batch Homogenous
Batch Heterogenous
Evaluation Accuracy
85
80
75
70
65
RTE BoolQ RACE SQuAD MNLI
Dataset
Figure 2: We plot the evaluation accuracy of RoBERTa across five datasets: RTE, BoolQ, RACE, SQuAD, and
MNLI, using our three batching strategies for multi-task: Dataset Homogeneous, Batch Homogeneous, Batch
Heterogeneous. The use of heterogenous batches outperforms other batching strategies by a significant margin and
highlights the importance of implementing MTL with the correct batching strategy.
Pre-Finetuning and Low-Resource Scale Ablation: QNLI Pre-Finetuning and Low-Resource Scale Ablation: CoLA
0.95
0.88
0
0
0.86
10
10
Pre-finetuning Scale
Pre-finetuning Scale
0.90
0.84
Evaluation Accuracy
Evaluation Accuracy
16
16
0.82
0.85
22
22
0.80
0.78
28
28
0.80
0.76
34
34
0.75 0.74
40
40
0.72
10% 20% 30% 40% 50% 60% 70% 90% 100% 10% 20% 30% 40% 50% 60% 70% 90% 100%
Low Resource Scale Low Resource Scale
Figure 3: We fine-tune every low-resource split with every pre-finetuning checkpoint from Section §5.1 for two
datasets not available in any of the pre-finetuning MTL datasets; QNLI (Rajpurkar et al., 2016b) and CoLA
(Warstadt et al., 2018). The pre-finetuning scale is reported in terms of the number of datasets.
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Chin-
Christopher D Manning. 2020. Electra: Pre-training Yew Lin, and Deepak Ravichandran. 2001. Toward
text encoders as discriminators rather than genera- semantics-based answer pinpointing. In Proceed-
tors. arXiv preprint arXiv:2003.10555. ings of the First International Conference on Human
Language Technology Research.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,
Ashish Sabharwal, Carissa Schoenick, and Oyvind Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and
Tafjord. 2018. Think you have solved question an- Yejin Choi. 2019. Cosmos qa: Machine reading
swering? try arc, the ai2 reasoning challenge. arXiv comprehension with contextual commonsense rea-
preprint arXiv:1803.05457. soning. arXiv preprint arXiv:1909.00277.
Shankar Iyer, Nikhil Dandekar, and Kornel Csernai.
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, 2017. First quora dataset release: Question pairs.
Trung Bui, Seokhwan Kim, Walter Chang, and Na-
zli Goharian. 2018. A discourse-aware attention Mandar Joshi, Eunsol Choi, Daniel Weld, and
model for abstractive summarization of long docu- Luke Zettlemoyer. 2017. triviaqa: A Large
ments. arXiv preprint arXiv:1804.05685. Scale Distantly Supervised Challenge Dataset for
Reading Comprehension. arXiv e-prints, page
Marie-Catherine De Marneffe, Mandy Simons, and arXiv:1705.03551.
Judith Tonhauser. 2019. The Commitment-
Bank: Investigating projection in naturally oc- Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,
curring discourse. To appear in proceedings of Shyam Upadhyay, and Dan Roth. 2018. Looking be-
Sinn und Bedeutung 23. Data can be found at yond the surface: A challenge set for reading com-
https://github.com/mcdm/CommitmentBank/. prehension over multiple sentences. In Proceedings
of the 2018 Conference of the North American Chap- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-
ter of the Association for Computational Linguistics: feng Gao. 2019a. Multi-task deep neural networks
Human Language Technologies, Volume 1 (Long Pa- for natural language understanding. arXiv preprint
pers), pages 252–262. arXiv:1901.11504.
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Sabharwal, Oyvind Tafjord, Peter Clark, and Han- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
naneh Hajishirzi. 2020. Unifiedqa: Crossing format Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
boundaries with a single qa system. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
SciTail: A textual entailment dataset from science Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
question answering. In AAAI. Dan Huang, Andrew Y. Ng, and Christopher Potts.
Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em- 2011. Learning word vectors for sentiment analy-
manuel Vincent, Payam Adineh, David Corney, sis. In Proceedings of the 49th Annual Meeting of
Benno Stein, and Martin Potthast. 2019. Semeval- the Association for Computational Linguistics: Hu-
2019 task 4: Hyperpartisan news detection. In Pro- man Language Technologies, pages 142–150, Port-
ceedings of the 13th International Workshop on Se- land, Oregon, USA. Association for Computational
mantic Evaluation, pages 829–839. Linguistics.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Mitchell Marcus, Beatrice Santorini, and Mary Ann
method for stochastic optimization. arXiv preprint Marcinkiewicz. 1993. Building a large annotated
arXiv:1412.6980. corpus of english: The penn treebank.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019.
field, Michael Collins, Ankur Parikh, Chris Alberti, Right for the wrong reasons: Diagnosing syntac-
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, tic heuristics in natural language inference. CoRR,
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, abs/1902.01007.
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
ral questions: a benchmark for question answering Sabharwal. 2018. Can a suit of armor conduct elec-
research. Transactions of the Association of Compu- tricity? a new dataset for open book question answer-
tational Linguistics. ing. arXiv preprint arXiv:1809.02789.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Marius Mosbach, Maksym Andriushchenko, and Diet-
and Eduard Hovy. 2017. Race: Large-scale reading rich Klakow. 2020. On the stability of fine-tuning
comprehension dataset from examinations. arXiv bert: Misconceptions, explanations, and strong base-
preprint arXiv:1704.04683. lines. arXiv preprint arXiv:2006.04884.
Hector Levesque, Ernest Davis, and Leora Morgen- Shashi Narayan, Shay B Cohen, and Mirella Lap-
stern. 2012. The winograd schema challenge. In ata. 2018. Don’t give me the details, just the
Thirteenth International Conference on the Princi- summary! topic-aware convolutional neural net-
ples of Knowledge Representation and Reasoning. works for extreme summarization. arXiv preprint
Citeseer. arXiv:1808.08745.
Hector J Levesque, Ernest Davis, and Leora Morgen- Yixin Nie, Adina Williams, Emily Dinan, Mohit
stern. 2011. The Winograd schema challenge. In Bansal, Jason Weston, and Douwe Kiela. 2019. Ad-
AAAI Spring Symposium: Logical Formalizations of versarial nli: A new benchmark for natural language
Commonsense Reasoning, volume 46, page 47. understanding. arXiv preprint arXiv:1910.14599.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-
Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. ing class relationships for sentiment categorization
Bart: Denoising sequence-to-sequence pre-training with respect to rating scales. In Proceedings of the
for natural language generation, translation, and ACL.
comprehension. arXiv preprint arXiv:1910.13461.
Mohammad Taher Pilehvar and Jose Camacho-
Xin Li and Dan Roth. 2002. Learning question clas- Collados. 2019. WiC: The word-in-context dataset
sifiers. In COLING 2002: The 19th International for evaluating context-sensitive meaning representa-
Conference on Computational Linguistics. tions. In Proceedings of NAACL-HLT.
X. Liu, Hao Cheng, Pengcheng He, W. Chen, Yu Wang, Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Hoifung Poon, and Jianfeng Gao. 2020. Adversar- Dario Amodei, and Ilya Sutskever. 2019. Language
ial training for large neural language models. ArXiv, models are unsupervised multitask learners. OpenAI
abs/2004.08994. Blog, 1(8):9.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine GLUE: A multi-task benchmark and analysis plat-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, form for natural language understanding. In Pro-
Wei Li, and Peter J Liu. 2019. Exploring the limits ceedings of the 2018 EMNLP Workshop Black-
of transfer learning with a unified text-to-text trans- boxNLP: Analyzing and Interpreting Neural Net-
former. arXiv preprint arXiv:1910.10683. works for NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016a. Squad: 100,000+ questions Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan,
for machine comprehension of text. arXiv preprint Ruoxi Jia, Bo Li, and Jingjing Liu. 2021. Info{bert}:
arXiv:1606.05250. Improving robustness of language models from an
information theoretic perspective. In International
Conference on Learning Representations.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016b. Squad: 100,000+ questions Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
for machine comprehension of text. arXiv preprint man. 2018. Neural network acceptability judgments.
arXiv:1606.05250. arXiv preprint arXiv:1805.12471.
Melissa Roemmele, Cosmin Adrian Bejan, and An- Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017.
drew S. Gordon. 2011. Choice of plausible alterna- Crowdsourcing multiple choice science questions.
tives: An evaluation of commonsense causal reason- arXiv preprint arXiv:1707.06209.
ing. In 2011 AAAI Spring Symposium Series.
Adina Williams, Nikita Nangia, and Samuel Bowman.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and 2018a. A broad-coverage challenge corpus for sen-
Hannaneh Hajishirzi. 2016. Bidirectional attention tence understanding through inference. In Proceed-
flow for machine comprehension. arXiv preprint ings of the 2018 Conference of the North American
arXiv:1611.01603. Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume
Eva Sharma, Chen Li, and Lu Wang. 2019. Bigpatent: 1 (Long Papers), pages 1112–1122. Association for
A large-scale dataset for abstractive and coherent Computational Linguistics.
summarization. arXiv preprint arXiv:1906.03741.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018b. A broad-coverage challenge corpus for sen-
Richard Socher, Alex Perelygin, Jean Wu, Jason tence understanding through inference. In Proceed-
Chuang, Christopher D Manning, Andrew Ng, and ings of the 2018 Conference of the North American
Christopher Potts. 2013. Recursive deep models Chapter of the Association for Computational Lin-
for semantic compositionality over a sentiment tree- guistics: Human Language Technologies, Volume
bank. In Proceedings of the 2013 conference on 1 (Long Papers), pages 1112–1122. Association for
empirical methods in natural language processing, Computational Linguistics.
pages 1631–1642.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Xlnet: Generalized autoregressive pretraining for
Dropout: a simple way to prevent neural networks language understanding. In Advances in neural in-
from overfitting. The journal of machine learning formation processing systems, pages 5753–5763.
research, 15(1):1929–1958.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, gio, William W. Cohen, Ruslan Salakhutdinov, and
Jonathon Shlens, and Zbigniew Wojna. 2015. Re- Christopher D. Manning. 2018. HotpotQA: A
thinking the inception architecture for computer vi- dataset for diverse, explainable multi-hop question
sion. corr abs/1512.00567 (2015). answering. In Conference on Empirical Methods in
Natural Language Processing (EMNLP).
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and
Jonathan Berant. 2018. Commonsenseqa: A ques- Yang Yi, Yih Wen-tau, and Christopher Meek. 2015.
tion answering challenge targeting commonsense WikiQA: A Challenge Dataset for Open-Domain
knowledge. arXiv preprint arXiv:1811.00937. Question Answering. page 2013–2018.