Report - Text Paraphrase Detection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Topic 8: Text

Paraphrase Detection
Nhóm 6
21C11030 Lê Trung Thành
21C11001 Lại Việt Anh
21C11009 Nguyễn Lê Quang Hùng
21C11029 Hoàng Minh Thanh
1. Introduction

2. Related work

Contents 3. Sentence BERT (SBERT)

4. Optimization techniques

5. Demo

6. Q & A
1. Introduction
Introduction
❑ Paraphrase Detection
❑ Common Evaluation Metrics
❑ Common Corpora
❑ Text Paraphrase Detection Challenge
What is Paraphrase Detection?
❑ Given two sentences, determine whether they
roughly have the same meaning [1].
❑ Usually formalized as a binary classification
problem [1].

Examples:
❑ Mary gave birth to a son in 2000 [1].
❑ He is 14 years old, and his mother is Mary [1].
Common Evaluation Metrics

❑ Accuracy
❑ F1 score
Common Corpora

Number of papers mentioning the


dataset ❑ MRPC [2]
2022 ❑ Quora Question
2021
Pairs [3]
2020
❑ GLUE [4]
2019

2018

0 100 200 300 400 500 600 700


GLUE MRPC Quora Question Pairs
Text Paraphrase Detection Challenge

❑ Plagiarism is a serious problem in science.


❑ However, paraphrasing plagiarism has not been extensively explored yet.
As a preliminary step before detecting paraphrase plagiarism.
❑ The purpose of this competition is to invite researchers to contribute new
methods to solve our proposed problem and text paraphrase detection in
general.
❑ The completion of this task promises to advance techniques for paraphrase
plagiarism detection.
Text Paraphrase Detection Challenge

❑ Evaluation metric:
▪ F1 score
❑ Baseline:
▪ sent_tokenize from nltk.tokenize
▪ SBERT embeddings
▪ PyNNDescent for fast Approximate Nearest
Neighbors
2. Related work
Methods
❑ Rule Based [5]
❑ Machine Learning Based [6]
❑ Deep Learning Based

Rule Based
Deep Learning Based
❑ Consider sentences as a sequence of
characters or terms
❑ Represent given sentences into vector
space
❑ Lexical
❑ Syntactic
❑ Semantic
❑ Compare similarity between vectors
❑ Euclidean distance
❑ Cosine distance
Deep Learning Based
❑ CNN
❑ RNN-based
 LSTM
❑ Transformer-based
3. Sentence BERT (SBERT)
BERT in Paraphrase Detection
 BERT using token embeddings.
 BERT present each token as an embedding vector
Why Sentence BERT?

 BERT in paraphase detection is slow

Dataset 10k key-pair sentence

=> 50.000.000 calculation

65 hours​
Why Sentence BERT?

 SBERT?

https://github.com/UKPLab/sentence-transformers/issues/924
Compare SBERT vs BERT
 SBERT (Bi-Encoder) vs BERT (Cross-Encoder)
 BERT use token embeddings, SBERT use sentence
embedding.
 BERT use Classifier, SBERT use Cosine-similarity
Sentence BERT vs BERT
• SBERT use sentence embeddings.
Sentence BERT vs BERT
Retrieve & Re-Rank
 For complex semantic search scenarios, a retrieve & re-
rank pipeline is advisable:
Semantic Search
 Embed all data in your corpus.
Ex:
 How to learn Python online?
 How to learn Python on the web?
 What is Python ?
 Type :
 Symmetric Semantic search : SBERT
 Asymmetric Semantic search : Marco
 Method :
 Elastic Search
 Approximate Nearest Neighbor
4. Optimization techniques
Optimization techniques
 Cross-encoder
 Clustering
 Embedding in Knowledge Graph
 Concurrent Paraphrase Mining
 Model Distillation
 Augmented SBERT (Domain-transfer)
Cross-encoder
 SBERT (Bi-Encoder) vs BERT (Cross-Encoder)

SBERT
Weight
Weight
Weight
SBERT

Vector

Vector Cosine-
similarity
Clustering and BERTopic
 Clustering and BERTTopic
 Paraphase detect on new topic
Embedding in Knowledge Graph
 PyNNDescent : for fast Approximate Nearest Neighbors
 Building neighbor graphs
 Searching a nearest neighbor graph
Concurrent Paraphrase Mining
Concurrent Paraphrase Mining
 top_k – For each sentence, we retrieve up to top_k
other sentences

20k sentences Chunk it to 20x1000 sentences​


Distill in Paraphase detection
 Knowledge Distillation

 Dimensionality Reduction
 Quantization
5. Demo
SentenceBERT
 SentenceBERT
 SentenceBERT
https://colab.research.google.com/drive/1JiiMKFIsnRmESeS3
GWLR3nf8J5iI-_1b?usp=sharing
 SentenceBERT in Publications Paper
https://colab.research.google.com/drive/1zyjffqQZVViCH79RP
UZGEK-slN3LX3A9?usp=sharing
 VietnameseBERT
 Paraphase detection in Vietnamese using SBERT
https://colab.research.google.com/drive/1vff8gXZufZ70_GF2Xr
M1f1GzNXVkmWyT?usp=sharing
Q&A
References
1. Convolutional Neural Network for Paraphrase Identification (Yin &
Schütze, NAACL 2015)
2. William B. Dolan and Chris Brockett. 2005. Automatically Constructing
a Corpus of Sentential Paraphrases. In Proceedings of the Third
International Workshop on Paraphrasing (IWP2005).
3. First Quora Dataset Release: Question Pairs - Data @ Quora
4. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding (Wang et al., 2018)
5. Rahul Bhagat, Eduard Hovy; What Is a Paraphrase?. Computational
Linguistics 2013; 39 (3): 463–472.
doi: https://doi.org/10.1162/COLI_a_00166
6. Vrbanec, T.; Meštrović, A. Corpus-Based Paraphrase Detection
Experiments and Review. Information 2020, 11, 241.
https://doi.org/10.3390/info11050241
Thanks

You might also like