Text Prediction Analysis

Next-Word Prediction
Contents
● Problem Statement
● Motivation
● Implementation
● Data Preprocessing
● Techniques/Algorithms used
● References
Problem Statement
To help user write better while maintaining the context of the statement and also making it look professional.
Motivation
● A lot of youngsters and teenagers now-a-days use most of the time completing their incomplete mails,
essays, assignments, projects etc.
● This project could help reduce time required by these tasks by providing recommendations for the text
while making it more professional and hence providing a unique touch to the user’s text.
Preprocessing
Remove stop words,

remove irrelevant words to
Implementation make the sentence simpler
Stemming/
Named Entity Recognition Tokenizer
Lemmatization
Recognizes the parts of speech of the Word tokenizer splits the
These processes will return the split words for better prediction sentences into relevant words
root word of the words in the list
Vectorization Sentiment Analysis
Using TF-IDF vectorizer to find Knowing the sentiments of the

Prediction
the term frequency for better previous and current sentence cna
understanding of repetitive help in finding the perfect sentimental
words match for the words needed
Data Preprocessing
Text preprocessing is necessary to clean the text data and make it ready to feed data to the model. Text data
contains noise in various forms like emotions, punctuation.
Techniques for preprocessing may include:
● Expand Contractions: changing “don’t” to “do not”

● Removing digits
● Remove extra spaces
Word Tokenization
Word Tokenization algorithm splits sentences into words or paragraphs into sentences based on the need. It can also be used
for preprocessing by cleaning the sentences by removing unnecessary words(e.g. Stop words).
E.g: Consider the statement, “The monkeys are eating bananas on the tree!”,
The output will return an array of the words in the input statement, [“The”,
“monkeys”,”are”,”eating”,”bananas”,”on”,”the”,”tree”,”!”].
Named Entity Recognition(NER)
After splitting done by Tokenization, NER is applied. NER identifies the entity of every word sent to it.
Where, NNS = plural noun,
VBP = verd for non-3rd person singular present
VBG = gerund/present particile verb
and so on
Stemming and Lemmatization
Stemming and Lemmatization return a word to its simpler root form. Both stemming and lemmatization are
similar to each other but the results are a bit different.
E.g.: For the word “studies”
Stemming returns “studi” as the root form, whereas
Lemmatization returns “study” as the root form.

Term Frequency-Inverse Document Frequency(TF_IDF)
TD-IDF creates a data frame with features of tokenized words(similar to Bag of Words(BoW)), but it tries to
scale up the reare terms and scale down the frequent terms.
E.g: Consider the statement, “The monkeys are small. The ducks are also small”,
Here, the word “the” and “are” does have a frequency of 2 and thus scales down such words while scaling
up the words with lower frequency like “monkeys” or “ducks”.
Sentiment Analysis
Sentiment Analysis can be used for selecting the word which fits best with the context of the statement. It can be
run by using TextBlob or training a Machine Learning model. TextBlob does not require training. It can tell the
polarity and subjectivity of the reviews which ranges from 1 to -1 expressing positive to negative sentiment.
References
● analyticsvidhya.com/blog/2021/06/must-known-techniques-for-text-preprocessing-in-nlp/
● E. Chan, J. Ginsburg, B. Ten Eyck, J. Rozenblit and M. Dameron, "Text analysis and entity extraction in asymmetric threat response and
prediction," 2010 IEEE International Conference on Intelligence and Security Informatics, 2010, pp. 202-207, doi:
10.1109/ISI.2010.5484737.
● E. Chan, J. Ginsburg, B. Ten Eyck, J. Rozenblit and M. Dameron, "Text analysis and entity extraction in asymmetric threat response and
prediction," 2010 IEEE International Conference on Intelligence and Security Informatics, 2010, pp. 202-207, doi:
10.1109/ISI.2010.5484737.
● C. -z. Liu, Y. -x. Sheng, Z. -q. Wei and Y. -Q. Yang, "Research of Text Classification Based on Improved TF-IDF Algorithm," 2018 IEEE
International Conference of Intelligent Robotic and Control Engineering (IRCE), 2018, pp. 218-222, doi: 10.1109/IRCE.2018.8492945.

Text Prediction Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Prediction Analysis

Uploaded by

Copyright:

Available Formats

Next-Word Prediction

Remove stop words,

Implementation make the sentence simpler

Vectorization Sentiment Analysis

Using TF-IDF vectorizer to find Knowing the sentiments of the

Techniques for preprocessing may include:

● Expand Contractions: changing “don’t” to “do not”

Where, NNS = plural noun,

VBP = verd for non-3rd person singular present

VBG = gerund/present particile verb

E.g.: For the word “studies”

Stemming returns “studi” as the root form, whereas

Lemmatization returns “study” as the root form.

You might also like