Professional Documents
Culture Documents
GLiNER - Generalist Model For NER
GLiNER - Generalist Model For NER
Bidirectional Transformer
[ENT] Person [ENT] Organization [ENT] Location [SEP] Alain Farley works at McGill University
0 1 2 3 4 5
Token/word
representation
Figure 2: Model architecture. GLiNER employs a BiLM and takes as input entity type prompts and a sentence/text.
Each entity is separated by a learned token [ENT]. The BiLM outputs representations for each token. Entity
embeddings are passed into a FeedForward Network, while input word representations are passed into a span
representation layer to compute embeddings for each span. Finally, we compute a matching score between entity
representations and span representations (using dot product and sigmoid activation). For instance, in the figure, the
span representation of (0, 1), corresponding to "Alain Farley," has a high matching score with the entity embeddings
of "Person".
datasets. Their approach enables the replication task. This approach naturally solves the scalability
and even surpassing of the original capability of issues of autoregressive models and allows for bidi-
ChatGPT when evaluated in zero-shot on several rectional context processing, which enables richer
NER datasets. While these works have achieved representations. When trained on the dataset re-
remarkable results, they present certain limitations leased by (Zhou et al., 2023), which comprises
we seek to address: They use autoregressive lan- texts from numerous domains and thousands of
guage models, which can be slow due to token- entity types, our model demonstrates impressive
by-token generation; Moreover, they employ large zero-shot performance. More specifilcally, it out-
models with several billion parameters, limiting performs both ChatGPT and fine-tuned LLMs on
their deployment in compute-limited scenarios. zero-shot NER datasets (Table 1). Our model’s
Furthurmore, as NER is treated as a text gener- robustness is further evident in its ability to han-
ation problem, the generation of entities is done dle languages not even encountered during training.
in several decoding steps, and there is no way to Specifically, our model surpasses ChatGPT in 8 of
perform the prediction of multiple entity types in 10 languages (Table 3) that were not included in its
parallel. training data.
Table 1: Zero-Shot Scores on Out-of-Domain NER Benchmark. We report the performance of GLiNER with
various DeBERTa-v3 (He et al., 2021) model sizes. Results for Vicuna, ChatGPT, and UniNER are from Zhou et al.
(2023); USM and InstructUIE are from Wang et al. (2023); and GoLLIE is from Sainz et al. (2023).
English 62.7 37.2 42.4 41.7 AnatEM 88.5 88.5 88.9 88.4
Spanish 58.7 34.7 38.7 42.1 bc2gm 80.7 82.4 83.7 82.0
Dutch 62.6 35.7 35.6 38.9 bc4chemd 87.6 89.2 87.9 86.7
bc5cdr 89.0 89.3 88.7 88.7
Bengali 39.7 23.3 0.89 25.9
Broad Twitter 80.3 81.2 82.5 82.7
Persian 52.3 25.9 14.9 30.2
CoNLL03 91.5 93.3 92.6 92.5
Non-Latin
Avg. F1
75
43.5
| |=5.6
43.0
70
42.5 BERT
alBERT
RoBERTa
42.0
50 51 52 53 54 55 65
100 500 1000 5000 10000
Avg. zero-shot F1 on OOD NER Benchmark Dataset Size
Figure 4: Zero-shot performance for different back- Figure 5: Supervised performance across different
bones. It reports the avg. results on 20 NER and OOD dataset sizes. The evaluation is conducted on the 20
NER datasets NER datasets (in table 4).
Results The results of our experiment are re- zero-shot results on both the OOD benchmark and
ported in Table 3. Firstly, we observe that for the the 20 NER benchmark in Figure 4.
in-domain fine-tuning, our GLiNER model, pre- Result The results of our experiment, as shown
trained on Pile-NER, achieves slightly better re- in the Figure 4, clearly demonstrate the superiority
sults than the non-pretrained variant, with an av- of deBERTa-v3 over other pretrained BiLMs. It
erage difference of 0.8. Moreover, our pretrained achieves the highest performance on both bench-
GLiNER model outperforms InstructUIE (with an marks by a clear margin. ELECTRA and AlBERT
average difference of 0.9) despite being fine-tuned also show notable performance, albeit slightly
on the same dataset, whereas InstructUIE is signif- lower, while BERT and RoBERTa lag behind with
icantly larger (approximately 30 times so). This similar scores. However, it should be noted that
demonstrates that our proposed architecture is in- all of the backbones we tested demonstrate strong
deed competitive. However, our model falls behind performance compared to existing models. More
UniNER by almost 3 points. Nevertheless, our specifically, even BERT-base, which ranks among
model still manages to achieve the best score in 7 the lower performers, achieves around 49 F1 on
out of 20 datasets. the OOD benchmark. This score is still 2 F1 points
higher than the average for models like ChatGPT
5 Further analysis and ablations and InstructUIE.
Here we conduct different set of experiments to 5.2 Effect of Pretraining on In-domain
better investigate our model. Performance
5.1 Effect of Different Backbones In this section, we investigate the impact of pre-
training on the Pile-NER dataset for supervised
In our work, we primarily utilize the deBERTa-v3 in-domain training on the 20 NER datasets, across
model as our backbone due to its strong empirical various data sizes. The experiments range from
performance. However, we demonstrate here that 100 samples per dataset to 10,000 (full training
our method is adaptable to a wide range of BiLMs. setup). We use the same hyperparameters for all
configurations. The results are reported in Figure
Setup Specifically, we investigate the perfor-
5.
mance of our model using other popular BiLMs,
including BERT (Devlin et al., 2019), RoBERTa Results As shown in the figure, models pre-
(Liu et al., 2019), AlBERT (Lan et al., 2019), and trained on Pile-NER consistently outperform their
ELECTRA (Clark et al., 2020). We also conducted counterparts that are only trained on supervised
experiments with XLNet (Yang et al., 2019) but did data, indicating successful positive transfer. We
not achieve acceptable performance (achieving at further observe that the gain is larger when super-
most 3 F1 on the OOD benchmark) despite exten- vised data is limited. For instance, the difference in
sive hyperparameter tuning. For a fair comparison, performance is 5.6 when employing 100 samples
we employed the base size (GLiNER-M) and tuned per dataset, and the gap becomes smaller as the
the learning rate for each model. We report the size of the dataset increases.
Negative Samples Prec Rec F1 Full 60.90
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Leo Gao, Stella Rose Biderman, Sid Black, Laurence
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Golding, Travis Hoppe, Charles Foster, Jason Phang,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Horace He, Anish Thite, Noa Nabeshima, Shawn
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Presser, and Connor Leahy. 2020. The pile: An
Gretchen Krueger, T. J. Henighan, Rewon Child, 800gb dataset of diverse text for language modeling.
Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens ArXiv, abs/2101.00027.
Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021.
Clark, Christopher Berner, Sam McCandlish, Alec Debertav3: Improving deberta using electra-style pre-
Radford, Ilya Sutskever, and Dario Amodei. 2020. training with gradient-disentangled embedding shar-
Language models are few-shot learners. ArXiv, ing. ArXiv, abs/2111.09543.
abs/2005.14165.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, tional lstm-crf models for sequence tagging.
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion John D. Lafferty, Andrew McCallum, and Fernando
Stoica, and Eric P. Xing. 2023. Vicuna: An open- Pereira. 2001. Conditional random fields: Probabilis-
source chatbot impressing gpt-4 with 90%* chatgpt tic models for segmenting and labeling sequence data.
quality. In International Conference on Machine Learning.