Professional Documents
Culture Documents
Report - Text Paraphrase Detection
Report - Text Paraphrase Detection
Report - Text Paraphrase Detection
Paraphrase Detection
Nhóm 6
21C11030 Lê Trung Thành
21C11001 Lại Việt Anh
21C11009 Nguyễn Lê Quang Hùng
21C11029 Hoàng Minh Thanh
1. Introduction
2. Related work
4. Optimization techniques
5. Demo
6. Q & A
1. Introduction
Introduction
❑ Paraphrase Detection
❑ Common Evaluation Metrics
❑ Common Corpora
❑ Text Paraphrase Detection Challenge
What is Paraphrase Detection?
❑ Given two sentences, determine whether they
roughly have the same meaning [1].
❑ Usually formalized as a binary classification
problem [1].
Examples:
❑ Mary gave birth to a son in 2000 [1].
❑ He is 14 years old, and his mother is Mary [1].
Common Evaluation Metrics
❑ Accuracy
❑ F1 score
Common Corpora
2018
❑ Evaluation metric:
▪ F1 score
❑ Baseline:
▪ sent_tokenize from nltk.tokenize
▪ SBERT embeddings
▪ PyNNDescent for fast Approximate Nearest
Neighbors
2. Related work
Methods
❑ Rule Based [5]
❑ Machine Learning Based [6]
❑ Deep Learning Based
Rule Based
Deep Learning Based
❑ Consider sentences as a sequence of
characters or terms
❑ Represent given sentences into vector
space
❑ Lexical
❑ Syntactic
❑ Semantic
❑ Compare similarity between vectors
❑ Euclidean distance
❑ Cosine distance
Deep Learning Based
❑ CNN
❑ RNN-based
LSTM
❑ Transformer-based
3. Sentence BERT (SBERT)
BERT in Paraphrase Detection
BERT using token embeddings.
BERT present each token as an embedding vector
Why Sentence BERT?
65 hours
Why Sentence BERT?
SBERT?
https://github.com/UKPLab/sentence-transformers/issues/924
Compare SBERT vs BERT
SBERT (Bi-Encoder) vs BERT (Cross-Encoder)
BERT use token embeddings, SBERT use sentence
embedding.
BERT use Classifier, SBERT use Cosine-similarity
Sentence BERT vs BERT
• SBERT use sentence embeddings.
Sentence BERT vs BERT
Retrieve & Re-Rank
For complex semantic search scenarios, a retrieve & re-
rank pipeline is advisable:
Semantic Search
Embed all data in your corpus.
Ex:
How to learn Python online?
How to learn Python on the web?
What is Python ?
Type :
Symmetric Semantic search : SBERT
Asymmetric Semantic search : Marco
Method :
Elastic Search
Approximate Nearest Neighbor
4. Optimization techniques
Optimization techniques
Cross-encoder
Clustering
Embedding in Knowledge Graph
Concurrent Paraphrase Mining
Model Distillation
Augmented SBERT (Domain-transfer)
Cross-encoder
SBERT (Bi-Encoder) vs BERT (Cross-Encoder)
SBERT
Weight
Weight
Weight
SBERT
Vector
Vector Cosine-
similarity
Clustering and BERTopic
Clustering and BERTTopic
Paraphase detect on new topic
Embedding in Knowledge Graph
PyNNDescent : for fast Approximate Nearest Neighbors
Building neighbor graphs
Searching a nearest neighbor graph
Concurrent Paraphrase Mining
Concurrent Paraphrase Mining
top_k – For each sentence, we retrieve up to top_k
other sentences
Dimensionality Reduction
Quantization
5. Demo
SentenceBERT
SentenceBERT
SentenceBERT
https://colab.research.google.com/drive/1JiiMKFIsnRmESeS3
GWLR3nf8J5iI-_1b?usp=sharing
SentenceBERT in Publications Paper
https://colab.research.google.com/drive/1zyjffqQZVViCH79RP
UZGEK-slN3LX3A9?usp=sharing
VietnameseBERT
Paraphase detection in Vietnamese using SBERT
https://colab.research.google.com/drive/1vff8gXZufZ70_GF2Xr
M1f1GzNXVkmWyT?usp=sharing
Q&A
References
1. Convolutional Neural Network for Paraphrase Identification (Yin &
Schütze, NAACL 2015)
2. William B. Dolan and Chris Brockett. 2005. Automatically Constructing
a Corpus of Sentential Paraphrases. In Proceedings of the Third
International Workshop on Paraphrasing (IWP2005).
3. First Quora Dataset Release: Question Pairs - Data @ Quora
4. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding (Wang et al., 2018)
5. Rahul Bhagat, Eduard Hovy; What Is a Paraphrase?. Computational
Linguistics 2013; 39 (3): 463–472.
doi: https://doi.org/10.1162/COLI_a_00166
6. Vrbanec, T.; Meštrović, A. Corpus-Based Paraphrase Detection
Experiments and Review. Information 2020, 11, 241.
https://doi.org/10.3390/info11050241
Thanks