Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Sentence embeddings

Applications in scientific paper


classification and clustering

IT 21

111608023 Devashish Gaikwad


111608031 Atharva Jadhav
111608077 Venkatesh Yelnoorkar
Embeddings in NLP
01
Word Embeddings
02 Review of embedding
methods
Sentence
embeddings
Contents 04
03 Applications of
Sentence
embeddings
05
Proposed Application
and POC
06
01
Embeddings in NLP.
What are they?
Embeddings in NLP

● The text embedding endpoint transforms text into a


numerical representation (an embedding) of the text’s
semantic meaning.
● If two words or documents have a similar embedding,
they are semantically similar.
● For example, “anchor” and “boat” have close embeddings,
while “anchor” and “koala” do not.
● Similarly, the same word in different languages like
“amore” and “love” have close embeddings.
Embeddings in NLP
Types of Embeddings in NLP.

Sentences are represented as


Vectors.
Eg. BOW, USE, Infersent

Word Embeddings Sentence Embeddings Document Embeddings


Word are represented as vectors. Entire Document is represented as a
Eg. Word2Vec, Glove, FastText Vector.
Eg. Doc2Vec, LSI, LDA
02
Word Embeddings
KING - MAN + WOMAN = QUEEN
Word Embeddings

● Word embeddings are vector representations of a particular


word.
● It is capable of capturing context of a word in a document,
semantic and syntactic similarity, relation with other words,
etc.
● Word embedding is one of the most popular representation
of document vocabulary.
● Vectors of semantically similar words are closer to each
other
● This closeness in vectors can be measured by various
distance measures like cosine distance, manhattan
distance, etc.
Word Embedding
Algorithms for word embeddings

Jeffrey Pennington, Richard Socher,


Christopher D. Manning @ Stanford
NLP group.

Word2Vec Glove FastText


Mikolov et al. @ Google research Piotr Bojanowski
and Edouard Grave
and Armand Joulin and Tomas
Mikolov @ facebook Research
Word2Vec
● Word2Vec is one of the most popular
techniques to learn word embeddings
using shallow neural network.
● It was developed by Tomas Mikolov in
2013 at Google. It can be obtained
using two methods (both involving
Neural Networks): Skip Gram and
Common Bag Of Words (CBOW)
● In both cases, the network uses
backpropagation to learn.
● Here, the weights of the hidden layers
are considered as word embeddings for
the target word.
.

CBOW SKIP GRAM


Glove
● GloVe stands for “Global Vectors”.
● GloVe captures both global statistics and local statistics of
a corpus, in order to come up with word vectors.
● GloVe method is built on an important idea that you can
derive semantic relationships between words from the
co-occurrence matrix. The co-occurrence matrix denotes
how many times a word has co-occurred with another word.
● After that a mathematical equation of cost is optimized to
get the word embedding.
● GloVe does not use any neural network models, but it relies
on stochastic gradient descent to optimize
Fast Text
● FastText is an extension to Word2Vec proposed by
Facebook in 2016.
● Instead of feeding individual words into the Neural Network,
FastText breaks words into several n-grams (sub-words).
● The word embedding vector for a word will be the sum of all
its n-grams. After training the Neural Network, we get word
embeddings for all the n-grams given the training dataset.
● Rare words can now be properly represented since it is
highly likely that some of their n-grams also appears in
other words.
● Although it takes a longer time to train a FastText model
(number of n-grams > number of words), it performs better
than Word2Vec and allows rare words to be represented
appropriately.
03
Sentence Embeddings
Sentence Embedding

● Embed a full sentence into a vector space.


● Inherit features from their underlying word
embeddings.
● Used to capture similarity between sentences,
predict text and classify.
● Sentence embeddings are created by different
methods such as Bag of Words, Power Mean and
SIF etc.
Sentence Embedding
Algorithms

Daniel Cera, Yinfei Yanga, Sheng-yi


Kong et al. @ Google Research

Universal
Sentence
BoW + Power Mean Encoder Infersent
Andreas Ruckle, Steffen Eger†, Alexis Conneau, Douwe Kiela et al.
Maxime Peyrard, Iryna Gurevych @ Facebook Research
@ Technical University of
Darmstadt
Bag of Words + Power Mean

● Novel approach by Technische Universitat Darmstadt


● Retrieves many well-known means such as AM, GM,
and HM.
● In the extreme cases, the power mean generalizes to
−∞ and +∞ of the sequence
● Considerably closes the gap to state-of-the-art methods
● Outperforms different recently proposed base-lines such
as SIF and Sent2Vec by a solid margin
Universal Sentence Encoder

● Proposed by Google in 2018


● Encodes text into high dimensional vectors that can be used
for text classification, semantic similarity, clustering, and
other natural language tasks
● Trained on a range of supervised and unsupervised tasks,
in order to capture most universal semantic information.
● Two variations i.e. one trained with Transformer encoder
and other trained with Deep Averaging Network (DAN)
Infersent

● Proposed by Facebook in 2018


● Provides semantic representations for English sentences.
● Trained on natural language inference data and generalizes well to many different
tasks
● Generates sentence vectors using a sentence encoding architecture and word
vectors as input.
● Followed by a classifier that takes the encoded sentences as input and trains the
sentence vectors.
04
Critical analysis of
Sentence Embeddings
Critical Analysis of Sentence Embeddings
Parameters Methods to Create sentence Embeddings

Word2vec p-Mean USE (DAN) USE (Transformer) Infersent


(BOW)

Learning Method Unsupervised Unsupervised Unsupervised Unsupervised Supervised


augmented to augmented to
Supervised Supervised

Order of Words NOT considered NOT considered Considered Considered Considered

Word Frequency NOT considered NOT considered NOT considered NOT considered NOT considered

Semantic relation NOT considered NOT considered Considered Considered Considered


between text

Needs Training NO NO Yes Yes Yes

Performance 5 4 3 2 1
Ranking
Accuracy Comparison
05
Applications of
Sentence embeddings
Applications

● Capturing similarity between phrases:


Based on the phrases use we try to determine the similarities.
● Predicting upcoming text:
Using the sentence vector based on trained data, given a phrase we can predict the
possible outcomes thereby completing the incomplete text.
● Text classification:
Classifying text together based on singular word’s similarities fall short of consistent
accuracy, so expanding the domain to sentences improves the classification.
● Summarization:
Text can be summarized using clusters of sentence vectors and getting their
representative vectors. And by choosing the vectors closest to centroid.
Objectives of Project

1. To Create Sentence Embeddings for Scientific Papers


2. To implement a neural net based similarity operator for Sentence
Embeddings
3. To Create Sentence Embeddings Classifier
4. To create Scientific Paper Classifier using Sentence Embeddings
5. To create Scientific Paper Clusterings using Sentence Embeddings
6. To create POC application demonstrating the use of the above
objectives along with metadata retrieval
06
Proposed Application
and POC
Current Scientific Paper Lookup System

● Systems to search research papers exist


○ ArXiv-Sanity Checker
○ Google Scholar
● No Classification by Paper Content
● By topic modeling LDA, LSI
● By keywords
● Error Prone
● Has no Semantic Meaning
● No equivalence between semantically equivalent terms
Proposed Scientific Paper Classification

● Classification model using sentence embeddings


● Sentence embeddings
○ Trained on scientific papers corpus
○ Input abstracts
● Sentence vectors comparison
● Paper Fields/Streams
Proposed Scientific Paper Clustering

● Unsupervised Clustering
● Abstract Comparison
● Semantically similar
○ Not just keyword tagging
○ SImilar in application and content
● Different subtopics under the same topic
● Metadata extraction
○ Lookup on Metadata
Questions?

You might also like