Professional Documents
Culture Documents
Keyword Extraction Methods From Documents in NLP
Keyword Extraction Methods From Documents in NLP
×
[FREE Webinar] How to Forecast New Product Launches using Data-Centric Approach
Register For FREE!
Home
Pradeep T — Published On March 22, 2022 and Last Modified On August 24th, 2022
Intermediate Libraries NLP Python
Introduction
Keyword extraction is commonly used to extract key information from a series of paragraphs or documents. Keyword
extraction is an automated method of extracting the most relevant words and phrases from text input. It is a text analysis
method that involves automatically extracting the most important words and expressions from a page. It assists in the
summarization of a text’s content and the identification of key issues being discussed – For example, meeting minutes (MOM).
Source:https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0
Assume you wish to search the internet for a large number of product evaluations (perhaps hundreds of thousands). To go
through all of the data and find the terms that best define each review, keyword extraction can be employed. You’ll be able to
see what topics are causing the most discussion among your consumers, and automating the process will save your personnel a
lot of time. I’m going to show you how to extract keywords from documents using natural language processing in this blog.
Those are the ones.
Rake_NLTK
Spacy
Textrank
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Word cloud
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 1/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
KeyBert
Keyword Extraction Methods from Documents in NLP
Yake
MonkeyLearn API
Textrazor API
Rake_NLTK
RAKE (Rapid Automatic Keyword Extraction) is a well-known keyword extraction method that finds the most relevant words or
phrases in a piece of text using a set of stopwords and phrase delimiters. Rake nltk is an expanded version of RAKE that is
supported by NLTK. The steps for Rapid Automatic Keyword Extraction are as follows:
For installation
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 2/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
Please read this official document to learn more about the RAKE algorithm.
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 3/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
For installation
import spacy
from collections import Counter
from string import punctuation
nlp = spacy.load("en_core_web_sm")
def get_hotwords(text):
result = []
pos_tag = ['PROPN', 'ADJ', 'NOUN']
doc = nlp(text.lower())
for token in doc:
if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
continue
if(token.pos_ in pos_tag):
result.append(token.text)
return result
new_text = """
When it comes to evaluating the performance of keyword extractors, you can use some of the standard
metrics in machine learning: accuracy, precision, recall, and F1 score. However, these metrics don’t
reflect partial matches. they only consider the perfect match between an extracted segment and the
correct prediction for that tag.
Fortunately, there are some other metrics capable of capturing partial matches. An example of this is
ROUGE.
"""
output = set(get_hotwords(new_text))
most_common_list = Counter(output).most_common(10)
for item in most_common_list:
print(item[0])
Output
accuracy
precision
capable
partial
prediction
score
correct
extractors
matches
perfect
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 4/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
Textrank
Keyword Extraction Methods from Documents in NLP
Textrank is a Python tool that extracts keywords and summarises text. The algorithm determines how closely words are related
by looking at whether they follow one another. The most important terms in the text are then ranked using the PageRank
algorithm. Textrank is usually compatible with the Spacy pipeline. Here are the primary processes Textrank does while
extracting keywords from a document.
Step – 1: In order to find relevant terms, the Textrank algorithm creates a word network (word graph). This network is created
by looking at which words are linked to one another. If two words appear frequently next to each other in the text, a link is
established between them. The link is given more weight if the two words appear more frequently next to each other.
Step – 2:To identify the relevance of each word, the Pagerank algorithm is applied to the formed network. The top third of each
of these terms is kept and considered important. Then, if relevant terms appear in the text after one another, a keywords table
is constructed by grouping them together.
TextRank is a Python implementation that allows for fast and accurate phrase extraction as well as extractive summarization for
use in spaCy workflows. The graph method isn’t reliant on any specific natural language and doesn’t require domain knowledge.
The tool we’ll use for Keyword extraction is PyTextRank (a Python version of TextRank as a spaCy pipeline plugin). Please see
the base paper here to learn more about Textrank.
For installation
import spacy
import pytextrank
# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of
compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict
inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms
of construction of minimal generating sets of solutions for all types of systems are given. These
criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can
be used in solving all the considered types systems and systems of mixed types."
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)
# examine the top-ranked phrases in the document
for phrase in doc._.phrases[:10]:
print(phrase.text)
Output
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 5/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
mixed types
Keyword Extraction Methods from Documents in NLP
minimal generating sets
systems
nonstrict inequations
strict inequations
natural numbers
linear Diophantine equations
solutions
linear constraints
a minimal supporting set
Word Cloud
The magnitude of each word represents its frequency or relevance in a word cloud, which is a data visualization tool for
visualizing text data. A word cloud can be used to emphasise important textual data points. Data from social networking
websites are frequently analyzed using word clouds.
The greater and bolder a term appears in the word cloud, the more times it appears in a source of textual data (such as a speech,
blog post, or database) (Also known as a tag cloud or a text cloud). A word cloud is a collection of words shown in different sizes.
The more frequently a term appears in a document and the more important it is, the larger and bolder it is. These are great ways
for extracting the most important parts of textual data, such as blog posts, and databases.
For installation
For extracting the keywords and showing their relevancy using Wordcloud
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 6/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
import collections
Keyword Extraction Methods from Documents in NLP
import numpy as np
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from matplotlib import rcParams
from wordcloud import WordCloud, STOPWORDS
all_headlines = """
When it comes to evaluating the performance of keyword extractors, you can use some of the standard
metrics in machine learning: accuracy, precision, recall, and F1 score. However, these metrics don’t
reflect partial matches; they only consider the perfect match between an extracted segment and the
correct prediction for that tag.
Fortunately, there are some other metrics capable of capturing partial matches. An example of this is
ROUGE.
"""
stopwords = STOPWORDS
wordcloud = WordCloud(stopwords=stopwords, background_color="white",
max_words=1000).generate(all_headlines)
rcParams['figure.figsize'] = 10, 20
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
filtered_words = [word for word in all_headlines.split() if word not in stopwords]
counted_words = collections.Counter(filtered_words)
words = []
counts = []
for letter, count in counted_words.most_common(10):
words.append(letter)
counts.append(count)
colors = cm.rainbow(np.linspace(0, 1, 10))
rcParams['figure.figsize'] = 20, 10
plt.title('Top words in the headlines vs their count')
plt.xlabel('Count')
plt.ylabel('Words')
plt.barh(words, counts, color=colors)
plt.show()
Output
Source: Author
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 7/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
Source: Author
KeyBert
KeyBERT is a basic and easy-to-use keyword extraction technique that generates the most similar keywords and keyphrases to a given
document using BERT embeddings. It uses BERT-embeddings and basic cosine similarity to locate the sub-documents in a document that
BERT is used to extract document embeddings in order to obtain a document-level representation. The word embeddings for
N-gram words/phrases are then extracted. Finally, it uses cosine similarity to find the words/phrases that are most similar to the
document. The most comparable terms can then be identified as the ones that best describe the entire document.
Because it is built on BERT, KeyBert generates embeddings using huggingface transformer-based pre-trained models. The all-
MiniLM-L6-v2 model is used by default for embedding.
For installation
For extracting the keywords and showing their relevancy using KeyBert
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 8/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
Output
Keyword Extraction Methods from Documents in NLP
[('supervised', 0.6676), ('labeled', 0.4896), ('learning', 0.4813), ('training', 0.4134), ('labels',
0.3947)]
Unsupervised approach
Corpus-Independent
Domain and Language Independent
Single-Document
Python implementation of keyword extraction using Yake
For installation
For extracting the keywords and showing their relevancy using Yake
import yake
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred
function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias)."""
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(doc)
for kw in keywords:
print(kw)
Output
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 9/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
MonkeyLearn API
MonkeyLearn is a user-friendly text analysis tool with a pre-trained keyword extractor that you can use to extract important
phrases from your data using MonkeyLearn’s API. APIs are available in all major programming languages, and developers can
extract keywords with just a few lines of code and obtain a JSON file with the extracted keywords. MonkeyLearn also has a free
word cloud generator that works as a simple ‘keyword extractor,’ allowing you to construct tag clouds of your most important
terms. Once you’ve created a Monkeylearn account, you’ll be given an API key and a Model ID for extracting keywords from the
text.
Check out the official Monkeylearn API docs for additional information.
Product descriptions, customer feedback, and other sources can all be used to extract keywords.
Determine which terms are most frequently used by customers.
Monitoring of brand, product, and service references in real-time
It is possible to automate and speed up data extraction and entry.
Python implementation of keyword extraction using MonkeyLearn API
For installation
Output
Keyword Extraction Methods from Documents in NLP
performance of keyword
standard metric
f1 score
partial match
correct prediction
extracted segment
machine learning
keyword extractor
perfect match
metric
Textrazor API
Another API for extracting keywords and other useful elements from unstructured text is Textrazor. The Textrazor API can be
accessed using a variety of computer languages, including Python, Java, PHP, and others. You will receive the API key for
extracting keywords from the text once you have made an account with Textrazor. Visit the official website for additional
information.
Textrazor is a good choice for developers that need speedy extraction tools with comprehensive customization options. It’s a
keyword extraction service that may be used locally or in the cloud. The TextRazor API may be used to extract meaning from
text and can be easily connected with our necessary programming language. We can design custom extractors and extract
synonyms and relationships between entities in addition to extracting keywords and entities in 12 different languages.
For installation
For extracting the keywords with relevance_score and confidence_score from a webpage using Textrazor API
import textrazor
textrazor.api_key = "your_api_key"
client = textrazor.TextRazor(extractors=["entities", "topics"])
response = client.analyze_url("https://www.textrazor.com/docs/python")
for entity in response.entities():
print(entity.id, entity.relevance_score, entity.confidence_score)
Output
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 11/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
Conclusion
Keyword extraction is an automated method of extracting the most relevant words and phrases from text input. Important
points to remember are given below.
Keyword extraction is commonly used when we need to extract key information from a batch of documents.
In this article, I have tried to expose you to some of the most popular tools for automatic keyword extraction tasks in NLP.
Rake NLTK, Spacy, Textrank, Word cloud, KeyBert, and Yake are the tools and MonkeyLearn and Textrazor are the APIs
that I mentioned here.
Each of these tools has its own advantaged and specific use cases.
These are the most effective keyword extraction techniques currently in use in the data science field.
Endnotes
The goal of keyword extraction is to find phrases that best describe the content of a document automatically. Key phrases, key
terms, key segments, or simply keywords are the terminologies used to define the terms that indicate the most relevant
information contained in the document. My Github page contains the entire codebase for keyword extraction methods. If you
have any problems when using these tools, please let us know in the comments section below.
Happy coding
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Weblogathon
use cookies onNLP
Analyticspython
Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 12/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
view more
Download
Analytics Vidhya App for the Latest blog/Article
Bivariate Feature Analysis in Python Visualising Published Articles and Analysing them Using
Plotly
Nitin says:
September 01, 2022 at 3:15 pm
Which of the keyword extraction techniques works the best for extracting the product type from product title in ecommerce data?
eg. "Adidas womens Hoops 2.0 Basketball Shoe" should return "shoe" or even better "Basketball Shoe".
Reply
John says:
September 13, 2022 at 3:37 pm
Nice post! Thanks for sharing the post about the keywords extraction method using documents in NLP. Is there any other method?
Reply
Shine says:
December 19, 2022 at 3:04 pm
Nice Blog. Could you please share github link. Though its mentioned in blog, hyperlink is missing.
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 13/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
Keyword
Name* Extraction Methods from Documents in NLP
Email*
Website
Submit
Top Resources
30 Best Data Science Books to Read in 2023 How to Read and Write With CSV Files in Python:..
Swati Sharma - FEB 28, 2023 Harika Bonthu - AUG 21, 2021
Understand Random Forest Algorithms With Examples Feature Selection Techniques in Machine Learning (Updated
(Updated 2023) 2023)
Careers Discussions
Download App Contact us Apply Jobs
Companies Visit us
Post Jobs
Trainings
Hiring Hackathons
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Advertising
agree to our Privacy Policy and Terms of Use. Accept
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 14/15
03/03/2023, 04:27 Keyword Extraction Methods from Documents in NLP
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
https://www.analyticsvidhya.com/blog/2022/03/keyword-extraction-methods-from-documents-in-nlp/ 15/15