Professional Documents
Culture Documents
PTR: Prompt Tuning With Rules For Text Classification: Petroni Et Al. 2019
PTR: Prompt Tuning With Rules For Text Classification: Petroni Et Al. 2019
Class Set
…
…
…
MLM Head MLM Head CLS Head
Pre-Training [CLS] These movies are [mask] as they are well [mask] Fine-Tuning [CLS] I like this .
MLM Head
Figure 1: The example of pre-training, fine-tuning, and prompt tuning. The blue rectangles in the figure are special
prompt tokens, whose parameters are randomly initialized and learnable during prompt tuning.
classification tasks such as natural language infer- verification. The expensive extra computation costs
ence (Schick and Schütze, 2021; Liu et al., 2021). make auto-generated prompts more suitable for
For instance, by using “<S1 > . [MASK] , <S2 >” as few-shot learning, rather than those normal learn-
the prompt template, and {“yes”, “maybe”, “no”} ing settings with many instances and classes.
as the set of label words, we can also easily use In this paper, we propose prompt tuning with
PLMs to predict the discourse relation (entailment, rules (PTR) for many-class classification tasks. For
neutral, or contradiction) between <S1 > and <S2 >. a many-class classification task, we manually de-
sign essential sub-prompts and apply logic rules
However, for those tasks with many classes, it to compose sub-prompts into final task-specific
becomes challenging to manually find appropri- prompts. Compared with existing prompt tuning
ate templates and suitable label words to distin- methods, PTR has two advantages:
guish different classes. For example, relation clas-
sification, a typical many-class classification task, (1) Prior Knowledge Encoding. PTR can ap-
requires models to predict semantic relations be- ply logic rules to encode prior knowledge about
tween two marked entities in the text. Given the tasks and classes into prompt tuning. We also take
relation “person:parent” and the relation “organi- relation classification as an example, the predicted
zation:parent”, it is hard to pick label words to dis- results are usually correlated to both relational se-
tinguish them. A straightforward solution is auto- mantics of sentences and types of entities. We can
matically generating prompts: Schick et al. (2020); build the prompt for the relation “person:parent”
Schick and Schütze (2021) automatically identify and the relation“organization:parent” with two sub-
label words from the vocabulary for human-picked prompts, one sub-prompt is to determine whether
templates; Shin et al. (2020) utilize gradient-guided marked entities are human beings or organizations,
search to automatically generate both templates and and the other sub-prompt is to determine whether
label words; Gao et al. (2020) propose to generate sentences express the semantics of the parent-child
all prompt candidates and apply development sets relationship.
to find the most effective ones. Although auto- (2) Efficient Prompt Design. Compared with
generated prompts can avoid intensive labor, it is manually designing templates and individual label
not surprising that most auto-generated prompts words for all classes, it is easier to design several
cannot achieve comparable performance to human- simple sub-prompts and then combine these sub-
picked ones. Meanwhile, auto-generated prompts prompts to form task-specific prompts according
require extra computation costs for generation and to logic rules. Moreover, compared with automati-
cally generating prompts based on machine learn- at least one [MASK] is placed into xprompt for
ing models, using logic rules to generate prompts M to fill label words. For instance, for a bi-
is more efficient and interpretable. nary sentiment classification task, we set a tem-
To verify the effectiveness of PTR, we con- plate T (·) = “ · It was[MASK].”, and map x to
duct experiments on relation classification, a typi- xprompt = “x It was[MASK].”.
cal and complicated many-class classification task. By input xprompt into M, we compute the hidden
Four popular benchmarks of relation classifica- vector h[MASK] of [MASK]. Given v ∈ V, we
tion are used for our experiments, including TA- produce the probability that the token v can fill
CRED (Zhang et al., 2017), TACREV (Alt et al., the masked position p([MASK] = v|xprompt ) =
2020), ReTACRED (Stoica et al., 2021), and Se- P exp(v·h[MASK] ) , where v is the embedding of
ṽ∈V exp(ṽ·h[MASK] )
mEval 2010 Task 8 (Hendrickx et al., 2010). Suffi- the token v in the PLM M.
cient experimental results show that PTR can signif- There also exists an injective mapping function
icantly and consistently outperform existing state- φ : Y → V that bridges the set of classes and the
of-the-art baselines. set of label words. In some papers, the function φ is
named “verbalizer”. With the verbalizer φ, we can
2 Preliminaries formalize the probability distribution over Y with
Before introducing PTR, we first give some essen- the probability distribution over V at the masked
tial preliminaries, especially some details about position, i.e. p(y|x) = p([MASK] = φ(y)|xprompt ).
prompt tuning for classification tasks. Formally, a Here, we also take a binary sentiment classification
classification task can be denoted as T = {X , Y}, task as an example. We map the positive sentiment
where X is the instance set and Y is the class set. to “great” and the negative sentiment to “terrible”.
For each instance x ∈ X , it consists of several According to M fills the masked position of xprompt
tokens x = {w1 , w2 , . . . , w|x| }, and is annotated with “great” or “terrible”, we can know the instance
with a label yx ∈ Y. x is positive or negative.
For prompt tuning, with the template T (·),
2.1 Vanilla Fine-tuning for PLMs the set of label words V, and the verbal-
Given a PLM M, vanilla fine-tuning first converts izerPφ, the learning objective is to maximize
1
the instance x = {w1 , w2 , . . . , w|x| } to the in- |X | x∈X log p([MASK] = φ(yx )|T (x)). In Fig-
put sequence {[CLS], w1 , w2 , . . . , w|x| , [SEP]}, ure 1, we show pre-training, fine-tuning, and
and then uses M to encode all tokens of prompt tuning, which can indicate the connections
the input sequence into corresponding vectors and differences among them clearly.
{h[CLS] , hw1 , hw2 , . . . , hw|x| , h[SEP] }. For a
3 Prompting Tuning with Rules (PTR)
downstream classification task, a task-specific
head is used to compute the probability distribu- As we mentioned before, prompt tuning is well
tion over the class set Y with the softmax func- developed for few-class classification tasks such
tion p(·|x) = Softmax(Wh[CLS] + b), where as sentiment classification and natural language
h[CLS] is the hidden vector of [CLS], b is a inference. Considering the difficulty of design-
learnable bias vector, and W is a learnable ma- ing prompts for many-class classification tasks, we
trix randomly initialized before fine-tuning. The propose PTR to incorporate logic rules to com-
parameters of M, b, and W are tuned to maximize pose task-specific prompts with several simple sub-
1 P
|X | x∈X log p(yx |x). prompts. In this section, we will introduce the
details of PTR, and select relation classification
2.2 Prompt Tuning for PLMs as an example to illustrate how to adapt PTR for
To apply cloze-style tasks to tune PLMs, prompt many-class classification tasks.
tuning has been proposed. Formally, a prompt
consists of a template T (·) and a set of label 3.1 Overall Framework of PTR
words V. For each instance x, the template is PTR is driven by basic human logic inferences. For
first used to map x to the prompt input xprompt = example, in relation classification, if we wonder
T (x). The template defines where each token of whether two marked entities in a sentence have
x is placed and whether to add any additional to- the relation “person:parent”, with prior knowledge,
kens. Besides retaining the original tokens in x, we would check whether the sentence and two
fes (·, ·) feo (·, ·) fes ,eo (·, ·, ·)
Logic Rules
Label Words Label Words Label Words
fes (x, person)^
feo (y, person)^
fes ,eo (x, ’s parent was, y)
! “person:parent”
MLM Head MLM Head MLM Head
[CLS] Mark Twain was the father of Langdon. Langdon Mark Twain
Input Template
Logic Rules
Label Words Label Words Label Words
fes (x, organization)^
feo (y, organization)^
fes ,eo (x, ’s parent was, y)
! “organization:parent”
MLM Head MLM Head MLM Head
Figure 2: Illustration of PTR. By composing the sub-prompts of the conditional functions fes (·, ·), feo (·, ·) and
fes ,eo (·, ·, ·), we could easily get an effective prompt to distinguish “person:parent” and “organization:parent”.
marked entities meet certain conditions: (1) The entities. Similarly, determining the relation “orga-
two marked entities are human beings; (2) the sen- nization:parent” can be formalized as
tence indicates the parental semantics between the
two marked entities. fes (x, organization)
Inspired by this, for any text classification task ∧ fes ,eo (x, ’s parent was, y)
∧ feo (y, organization)
T = {X , Y}, we design a conditional function
→ “organization:parent”.
set F. Each conditional function f ∈ F de-
termines whether the function input meets cer-
Based on the above-mentioned logic rules and
tain conditions. For instance, the conditional
conditional functions, we can compose the sub-
function f (x, person) can determine whether
prompts of rule-related conditional functions to
the input x is a person, the conditional func-
handle the task T . Next, we will introduce how to
tion f (x, ’s parent was, y) can determine
design sub-prompts for conditional functions, as
whether y is the parent of x. Intuitively, these
well as how to aggregate sub-prompts for tasks.
conditional functions are essentially the predicates
of first-order logic. 3.2 Sub-Prompts for Conditional Functions
For each conditional function f ∈ F, PTR set
a template Tf (·) and a set of label words Vf to For each conditional function f ∈ F, we manually
build its sub-prompt. According to the semantics design a sub-prompt for it. Similar to the conven-
of Y, we can use logic rules to transform the clas- tional prompt setting, a sub-prompt also consists
sification task into the calculation of a series of of a template and a set of label words.
conditional functions. As shown in Figure 2, for The basic conditional function is a unary func-
relation classification, determining whether the en- tion. For binary sentiment classification, unary
tities x and y have the relation “person:parent” can functions can determine whether a sentence is pos-
be formalized as itive or negative; For entity typing, unary functions
can indicate entity types; For topic labeling, unary
fes (x, person) ∧ fes ,eo (x, ’s parent was, y)
functions can be used to predict article topics.
∧ feo (y, person) → “person:parent”,
Here we still take relation classification
where fes (·, ·) is the conditional function to de- as an example. Given a sentence x =
termine the subject entity types, feo (·, ·) is the {. . . es . . . eo . . .}, where es and eo are the subject
conditional function to determine the object en- entity and the object entity respectively, we set
tity types, and fes ,eo (·, ·, ·) is the conditional func- fes (·, {person|organization| . . .}) to determine the
tion to determine the semantic connection between type of es . The sub-prompt template and label
word set of fes can be formalized as Dataset #train #dev #test #rel
SemEval 6,507 1,493 2,717 19
Tfes (x) = “x the[MASK]es ”, TACRED 68,124 22,631 15,509 42
(1)
Vfes = {“person”, “organization”, . . .}. TACREV 68,124 22,631 15,509 42
ReTACRED 58,465 19,584 13,418 40
For the object entity eo , the sub-prompt template
and label word set of feo are similar to Eq. (1). Table 1: Statistics of different datasets.
Another important type of conditional functions
is a binary function. For natural language in-
As the aggregated template may contain multi-
ference, binary functions can determine the re-
ple [MASK] tokens, we must consider all masked
lation between two sentences is entailment, neu-
positions to make predictions, i.e.,
tral, or contradiction; For relation classification,
such functions can be used to determine the n
Y
p(y|x) = p([MASK]j = φj (y)|T (x)), (5)
complex connections between two entities. If j=1
we set fes ,eo (·, {’s parent was|was born in| . . .}, ·),
the sub-prompt template and label word set of where n is the number of masked positions in T (x),
fes ,eo can be formalized as and φj (y) is to map the class y to the set of la-
bel words V[MASK]j for the j-th masked position
Tfes ,eo (x) = “ x es [MASK]eo ”,
(2) [MASK]j . Eq. (5) can be used to tune PLMs and
Vfes ,eo = {“’s parent was”, “was born in”, . . .}.
classify classes. The final learning objective of
Normally, unary and binary functions are suffi- PTR is to maximize
cient to compose logic rules for classification tasks. 1 X Yn
For classification tasks with complex semantics, log p [MASK]j = φj (y)|T (x) . (6)
|X | x∈X j=1
these design methods of unary and binary func-
tions can be extended to multi-variable functions,
which can provide more powerful sub-prompts. 4 Experiments
Table 2: For some relations of the datasets TACRED, TACREV, and ReTACRED, the relation-specific label words
for the template: hS1i T he[MASK]1 hE1i [MASK]2 the [MASK]3 hE2i.
Table 3: For some relations of the dataset SemEval, the relation-specific label words for the template:
hS1i T he[MASK]1 hE1i was[MASK]2 to the [MASK]3 hE2i.
More details of these datasets are shown in Ta- prompts after reversing a part of relations. For
ble 1. For all the above datasets, we use F1 scores models tuning with some reversed relations, we
as the main metric for evaluation. name them “. . . (Reversed)”.
Table 4: F1 scores (%) on the TACRED, TACREV, ReTACRED and SEMEVAL. For RO BERTA _ LARGE, we
report the results of the vanilla version RO BERTA _ LARGE without adding any entity markers. More adaptions of
RO BERTA _ LARGE for relation classification will be reported in the next tables. In the “Extra Data” column, “w/o”
means that no additional data is used for pre-training and fine-tuning, yet “w/” means that extra data or knowledge
bases are used for data augmentation. The best results are bold, and the best results under normal settings (without
reversing relations) are underlined.
stand relational semantics, a series of knowledge- the knowledge required for downstream tasks.
enhanced PLMs have been further explored, which (3) Within knowledge-enhanced PLMs, LUKE
use knowledge bases as additional information to and K NOW BERT are better than other models.
enhance PLMs. For knowledge-enhanced PLMs, This is because these two models respectively use
we select S PAN BERT (Joshi et al., 2020), K NOW- knowledge as data augmentation and architecture
BERT (Peters et al., 2019), LUKE (Yamada et al., enhancement in the fine-tuning phase. In contrast,
2020), and MTB (Baldini Soares et al., 2019) as S PAN BERT and MTB inject task-specific knowl-
our baselines. And these four baselines are typical edge into parameters in the pre-training phase and
models that use knowledge to enhance learning ob- then fine-tune their parameters without extra knowl-
jectives, input features, model architectures, and edge. All the above results indicate that even if
pre-training strategies respectively. the task-specific knowledge is already contained in
For the above baselines, we report the results of PLMs, it is difficult for fine-tuning to stimulate the
their papers as well as some results reproduced by knowledge for downstream tasks.
Zhou and Chen (2021). Table 4 shows the overall (4) Either for the normal setting or the setting of
performance on four datasets. From the table, we reversing some relations, PTR achieves significant
can see that: improvements than all baselines. Even compared
(1) Conventional methods design complicated with those knowledge-enhanced models, PTR still
models, including various recurrent neural layers, outperforms these models on TACRED.
attention neural layers, and graph neural layers. Al- In general, all these experimental results show
though learning these complicated models from that the pre-training phase enables PLMs to
scratch have achieved effective results, the vanilla capture sufficient task-specific knowledge, how
PLM RO BERTA _ LARGE can easily outperform to stimulate the knowledge for downstream
them, indicating that the rich knowledge of PLMs tasks is crucial. Compared with typical fine-
captured from large-scale unlabeled data is impor- tuning methods, PTR can better stimulate task-
tant for downstream tasks. specific knowledge.
(2) Within PLMs, those knowledge-enhanced
4.4 Comparison between PTR and Prompt
PLMs defeat the vanilla PLM RO BERTA _ LARGE.
Tuning Methods
On the one hand, this shows that introducing ex-
tra task-specific knowledge to enhance models is In this part, we compare PTR with several recent
effective. On the other hand, this indicates that methods using prompts for relation classification,
simply fine-tuning PLMs cannot completely cover including:
Model Input Example Extra Extra TACRED TACREV ReTACRED
Layer Label
Using randomly initialized markers for prompts
ENT M ARKER [CLS][E1] Mark Twain [/E1] was born in w/o w/o 69.4 79.8 88.4
( CLS ) [E2] Florida [/E2] . [SEP]
ENT M ARKER [CLS][E1] Mark Twain [/E1] was born in w/ w/o 70.7 80.2 90.5
[E2] Florida [/E2] . [SEP]
TYP M ARKER [CLS] [E1:Person] Mark Twain w/ w/ 71.0 80.8 90.5
[/E1:Person] was born in [E2:State]
Florida [/E2:State] . [SEP]
Using the vocabulary of PLMs for prompts
ENT M ARKER [CLS] @ Mark Twain @ was born in # Florida # . w/ w/o 71.4 81.2 90.5
( PUNCT ) [SEP]
TYP M ARKER [CLS] @ * person * Mark Twain @ was born w/ w/ 74.6 83.2 91.1
( PUNCT ) in # ^ state ^ Florida # . [SEP]
A DA P ROMPT [CLS] Mark Twain was born in Florida . Florida w/o w/o - 80.8 -
was [MASK] of Mark Twain [SEP]
PTR [CLS] Mark Twain was born in Florida . the w/o w/o 72.4 81.4 90.9
[MASK] Mark Twain [MASK] the [MASK]
Florida [SEP]
PTR [CLS] Mark Twain was born in Florida . the w/o w/o 75.9 83.9 91.9
(R EVERSED ) [MASK] Florida [MASK] the [MASK] Mark
Twain[SEP]
Table 5: F1 scores (%) of various models using prompts on the datasets TACRED, TACREV, and ReTACRED. All
these models are based on RO BERTA _ LARGE or BERT_ LARGE. For each model, we also provide the processed
input of an example sentence “Mark Twain was born in Florida”. We also provide the information that whether
these models add extra neural layers to generate features or use extra human annotations to provide more informa-
tion. “w/o” indicates that the model does not use additional human annotations or neural layers, yet “w/” indicates
the model requires additional human annotations or neural layers. The best results are bold, and the best results
under normal settings (without reversing relations) are underlined.
(1) Fine-tuning PLMs with entity markers: words selection mechanism for prompt tuning,
for relation classification, some methods add addi- which scatters the relation label into a variable num-
tional entity markers to the input as additional infor- ber of label words to handle the complex multiple
mation, which is similar to prompt tuning. Among label space. The model proposed by Chen et al.
these marker-based models, ENT M ARKER meth- (2021) is named A DA P ROMPT. As A DA P ROMPT
ods inject special symbols to index the positions of has not released its code, we only show its reported
entities and TYP M ARKER methods additionally results on TACREV.
introduce the type information of entities. Intrin- In Table 5, we can find that using the vocabulary
sically, the ENT M ARKER methods are similar of PLMs for prompts can lead to better results. By
to prompting by introducing extra serial informa- designing particular sub-prompts, our PTR method
tion to indicate the position of special tokens, i.e., has the same effect as the TYP M ARKER methods.
named entities. Analogously, the TYP M ARKER Furthermore, PTR establishes the effectiveness in
methods could be regarded as a type of template for experiments without additional human annotations
prompts but require additional annotation of type and neural layers. Note that in the training pro-
information. cedure, TYP M ARKER methods optimize the pa-
“. . . ( PUNCT )” indicates using the vocabulary of rameters of the encoder and additional output layer,
PLMs to build markers. “. . . ( CLS )” indicates us- PTR just optimizes the encoder and several learn-
ing the hidden vector of [CLS] for classification. able tokens in prompts.
Other models use additional neural layers to ag- To further compare the models using the vo-
gregate all entity features for classification. For cabulary of PLMs for prompts, we introduce the
more details of marker-based models, we refer to few-shot learning scenario. Following the few-shot
the paper of Zhou and Chen (2021). setting in Gao et al. (2020), we sample K training
(2) Prompt tuning for relation classification: instances and K validation instances per class from
Chen et al. (2021) introduce an adaptive label the original training set and development set, and
Model TACRED TACREV ReTACRED
8 16 32 Mean 8 16 32 Mean 8 16 32 Mean
ENT M ARKER 27.0 31.3 31.9 30.1 27.4 31.2 32.0 30.2 44.1 53.7 59.9 52.6
( PUNCT )
TYP M ARKER 28.9 32.0 32.4 31.1 27.6 31.2 32.0 30.3 44.8 54.1 60.0 53.0
( PUNCT )
A DA P ROMPT - - - - 25.2 27.3 30.8 27.8 - - - -
PTR 28.1 30.7 32.1 30.3 28.7 31.4 32.4 30.8 51.5 56.2 62.1 56.6
Table 6: F1 scores (%) of various models with different sizes of training instances on the datasets TACRED,
TACREV, and ReTACRED.
Figure 3: Changes in F1 scores (%) with increasing number of training epochs on the dataset ReTACRED.
Ning Ding, Xiaobin Wang, Yao Fu, Guangwei Xu, Rui Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Wang, Pengjun Xie, Ying Shen, Fei Huang, Hai-Tao dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Zheng, and Rui Zhang. 2021. Prototypical represen- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tation learning for relation extraction. In Proceed- Roberta: A robustly optimized bert pretraining ap-
ings of ICLR. proach. arXiv preprint arXiv:1907.11692.
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, Peng
Farhadi, Hannaneh Hajishirzi, and Noah Smith. Li, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2020.
2020. Fine-tuning pretrained language models: Learning from Context or Names? An Empirical
Weight initializations, data orders, and early stop- Study on Neural Relation Extraction. In Proceed-
ping. arXiv preprint arXiv:2002.06305. ings of EMNLP, pages 3661–3672.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Matthew E Peters, Mark Neumann, Robert Logan,
Making pre-trained language models better few-shot Roy Schwartz, Vidur Joshi, Sameer Singh, and
learners. arXiv preprint arXiv:2012.15723. Noah A Smith. 2019. Knowledge enhanced con-
textual word representations. In Proceedings of
Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan EMNLP-IJCNLP, pages 43–54.
Liu, and Maosong Sun. 2019. OpenNRE: An open
and extensible toolkit for neural relation extraction. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
In Proceedings of EMNLP-IJCNLP, pages 169–174. Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, edge bases? In Proceedings of EMNLP, pages
Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian 2463–2473.
Padó, Marco Pennacchiotti, Lorenza Romano, and
Stan Szpakowicz. 2010. SemEval-2010 task 8: Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao,
Multi-way classification of semantic relations be- Ning Dai, and Xuanjing Huang. 2020. Pre-trained
tween pairs of nominals. In Proceedings of SemEval, models for natural language processing: A survey.
pages 33–38. Science China Technological Sciences, pages 1–26.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Yuhao Zhang, Peng Qi, and Christopher D. Manning.
Ilya Sutskever. 2018. Improving language under- 2018. Graph convolution over pruned dependency
standing by generative pre-training. OpenAI. trees improves relation extraction. In Proceedings
of EMNLP, pages 2205–2215.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
Wei Li, and Peter J Liu. 2020. Exploring the limits geli, and Christopher D Manning. 2017. Position-
of transfer learning with a unified text-to-text trans- aware attention and supervised data improve slot fill-
former. JMLR, 21:1–67. ing. In Proceedings of EMNLP, pages 35–45.
Wenxuan Zhou and Muhao Chen. 2021. An improved
Timo Schick, Helmut Schmid, and Hinrich Schütze.
baseline for sentence-level relation extraction. arXiv
2020. Automatically identifying words that can
preprint arXiv:2102.01373.
serve as labels for few-shot text classification. In
Proceedings of COLING, pages 5569–5578.