Semantic Text Similarity

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Semantic Text Similarity

Introduction

The machine learning model built will predict a score to show the relativeness of two sentences rather
than their surface appearance. In other words, it measures how similar two texts are in terms of the
concepts, ideas, or information they convey. When comparing texts for semantic similarity, it involves
understanding the context and meaning of the words and sentences rather than just looking for exact
word matches or character-level similarities. It requires a deeper understanding of the content,
including synonyms, paraphrases, and related concepts.

Problem Statement:

Develop an algorithm/model to measure the Semantic Textual Similarity (STS) between two given
sentences and provide a similarity score ranging from 0 (highly dissimilar) to 1 (highly similar). The STS
model should assess the degree of semantic equivalence between the sentences, allowing for more
accurate comparisons of their meaning and context, regardless of surface-level variations in wording or
structure. The objective is to enable applications to quantify the level of similarity between pairs of
sentences for various natural language processing (NLP) tasks, such as information retrieval, paraphrase
identification, and question answering.

Core Approach:

Considering the complexity and problem statement, the BERT offers efficient pre-trained transformers
that would help us easily build our own model, hence 'bert-base-uncased’ due to its ability to capture
complex contextual dependencies and semantic meaning within sentences.

Note the above step, we had conducted several pre-processing techniques by using regular expression
and replace method. After which we made use of the Lemmatizer from NLTK module, which deduces
several inflected forms that eventually helped reduce the burden on our model.

Using the above specified BERT model, we tokenized the texts into 3000 parts by BERT tokenizer to
convert them into PyTorch tensors.

Finally, we made use of the Cosine Similarity method from scikit-learn to compute or measure the
semantic similarity between two sentences.

Deployment Journey:
To be transparent, I have never deployed an api on cloud, so I spent one full day researching
deployment for free and narrowed down to AWS Lambda and Streamlit (Heroku and Azure requires
credit card and my CIBIL is low).

After this due to heavy dependencies incorporated on our project, I faced issues on AWS Lambda and
hence was left with Streamlit. I modified my API based on the required and deployed on Streamlit using
Github.

You might also like