Introduction To Text Mining

BROUGHT TO YOU IN PARTNERSHIP WITH
CONTENTS
∙ What Is Text Mining?
Introduction to ∙ How Does Text Mining Work?
∙ What Problems Does

Text Mining Solve?
Text Mining
∙ How Can It Differentiate
Products and Services?
∙ Getting Started
∙ Development Environment
∙ Key Methods and Techniques
∙ Conclusion
CHRIS LAMB
PROFESSOR AND PRINCIPAL SCIENTIST
WHAT IS TEXT MINING? unstructured, and not accessible to modern data analysis techniques
Text mining is an ambiguous term for extracting useful information as a result. Mining text allows you to extract underlying information
from otherwise unstructured text. There are two particular terms from these unstructured data sources that you can then structure,
we need to pay close attention to when defining “text mining”— analyze, and process.
extracting useful information and unstructured text.

WHAT PROBLEM DOES TEXT MINING
Useful information, in this context, could be anything from basic
SOLVE?
facts expressed by the text to advanced sentiment analysis indicating
With text mining, you can extract information from written text. This
the state of mind of the author at the time the text was created.
is something we do, naturally, every day, in conversations or when
Unstructured text means that the information is not stored in a we read. Like driving a car, once we learn how to do it, we take it for
structured format like XML or a database table. The text is still granted.
structured in some way, usually dictated by the language in which it’s

Like driving a car, it has been resistant to automation, and has only
written and the custom of the medium.
recently become more tractable and automatable with the recent
For example, you may be analyzing English text, but that text may explosion of computation power, additional algorithm development,
be a series of tweets, so it’s not grammatically correct and contains

abbreviations and emojis. Nevertheless, while there is some
underlying structure, the document isn’t formally structured.
Speed time to market
Embed OpenText intelligent information features into your solutions
Many of these examples draw heavily from a variety of sources that

address text mining and natural language processing. One of the
most influential is Loper’s Natural Language Processing with Python1, Capture and Digitize Analyze, Report Search and Discover Transform, View
and Predict and Communicate
which is still considered the canonical reference for text analysis in
Python today.
Store, Manage Process and Integrate

HOW DOES TEXT MINING WORK? and Migrate Automate and Access
Essentially, text mining allows an analyst to extract meaningful

Learn more
information from documents filled with unstructured data. Keep in
mind, most information that we deal with day-to-day is completely
1 Loper, Edward, Klien, Ewan, and Bird, Steven. Natural Language Processing with Python. O’Reilly, 2009
1
Reduce cost
and time to market
Embed OpenText intelligent information
features into your solutions
Capture and Digitize Analyze, Report Search and Discover Transform, View
and Predict and Communicate
Store, Manage Process and Integrate

and Migrate Automate and Access
Build better software products

Learn more
REFCARD | INTRODUCTION TO TEXT MINING
and machine learning techniques. New advances in this area promise Due to that complexity, these topics are out of the scope of this
to revolutionize customer service, business intelligence, and a myriad Refcard, and instead, we will focus on traditional text analytic
of other fields. techniques.
HOW CAN IT DIFFERENTIATE DEVELOPMENT ENVIRONMENT

PRODUCTS AND SERVICES? In this Refcard, we’re going to focus on using Python 3 and the
Effective text mining opens up new application areas while Natural Language Toolkit,2 for the most part. R is another popular
improving the quality of existing ones. Customer service systems platform for text processing, but I prefer using Python because of its
with integrated text analysis and effective voice-to-text capabilities extensive collection of libraries. I suggest using Anaconda3 for this
can build analytic pipelines supporting real-time sentiment analysis, kind of work as well, as it will allow you to create custom isolated
allowing representatives to engage with customers knowing their Python environments that you can use for a variety of things.
emotional state prior to saying a word. Internet analysis tools can

Setting up your environment is well documented on the Anaconda
plumb the web for information, visiting competitors, and extracting
site. Follow those instructions4, and then create a new environment
information regarding current capabilities to feed business
— you can name it whatever you’d like. I’m going to name mine
intelligence analysis systems. Overall, text mining tools can make
text_analysis .
business systems more robust, insightful, and powerful.
I use a variety of other tools too, like iPython and Jupyter. I suggest
GETTING STARTED you install those too (conda install python jupyter should do
Text mining is a complex area moving quickly into machine learning the trick). We’ll use Jupyter notebooks to track our examples, and I’ll
and artificial intelligence. This Refcard will walk you through the text- make copies available for you to work through as well on GitHub.
specific topics you need to know to start to move into more complex
In order to practice, the NLTK includes a downloader that will
areas like semantic analysis and meaning extraction by showing you
install a variety of texts from the Gutenberg project5, an excellent
the underlying techniques specific to unstructured text analysis.
source of material for practicing text analysis techniques. When we
A typical text analysis workflow mirrors the outline in Figure 1. Here, demonstrate such techniques, we download and load a give book
we have two groups of inter-related activities – traditional text using the following code snippet:
analysis and AI-enabled text analysis. With tools like Tensorflow and
import nltk
Pytorch, AI software development is more straightforward than ever, nltk.download('gutenberg')
but it’s still far from simple. paradise_lost = nltk.corpus.gutenberg.words('milton-
paradise.txt')
Figure 1: A typical text analysis workflow
2 Natural Language Toolkit, retrieved from https://www.nltk.org/ on April 5, 2020.

3 Anaconda, retrieved from https://www.anaconda.com/ on April 5, 2020.
4 Anaconda Distributions, retrieved from https://www.anaconda.com/distribution on April 16, 2020.
5 Project Gutenberg, retrieved from http://www.gutenberg.org/ on April 5, 2020.
3 BROUGHT TO YOU IN PARTNERSHIP WITH

I’ll omit this in the following examples, but you can insert this You can derive N-grams using native Python tools. However, this can
wherever needed. The downloader won’t download the books if be time-consuming, and the resulting list of N-grams needs to be
you’ve already done so, so you can include this at the top of any code processed into something useful.
you write.
Here, we generate an initial collection of Trigrams, and then we count
the most common ones, sorting the resulting list in descending
KEY METHODS AND TECHNIQUES
order. Note that we’ve converted all words into lower case prior to
WORD FREQUENCY processing to avoid differences between ‘Word’ and ‘word’.
Word frequency measures a given text and provides insight into the
COLLOCATION
topics discussed and key concepts.
A collocation is a sequence of words that occur more frequently than
from nltk.probability import FreqDist
you’d expect. They can provide insight into common terminology,
distribution = FreqDist(paradise_lost)
print(distribution.most_common(50))
overall sentiment, and the primary theme of a given corpus. Bigrams
distribution.plot(50, cumulative=False) and trigrams are examples of collocations. When we covered
N-grams, we did things the hard way. Now that we understand more
Here, we’re extracting the tokens from the text and graphing them.
clearly what they are, we can lean on NLTK to find these for us:
This allows you to see the most common tokens.
import nltk
We can also examine the characteristics of words, like so: fromnltk.collocationsimportBigramAssocMeasures,TrigramAssocMeasures,
BigramCollocationFinder, TrigramCollocationFinder
long_words = [w for w in paradise_lost if len(w) > 10] nltk.download('gutenberg')
distribution = FreqDist(long_words) paradise_lost=nltk.corpus.gutenberg.words('milton-paradise.txt')
print(distribution.most_common(50))
bigram_measures = BigramAssocMeasures()
The group of words in paradise_lost can be treated as a Python list, trigram_measures = TrigramAssocMeasures()
and then used as an argument to the various distribution tools, like
FreqDist or ConditionalFreqDist . bigram_finder=BigramCollocationFinder.from_words(paradise_lost)
bigrams = bigram_finder.nbest(bigram_measures.raw_freq, 10)
N-GRAMS
trigram_finder=TrigramCollocationFinder.from_words(paradise_lost)
N-grams are essentially lists of N size contiguous tokens in a corpus. trigrams = trigram_finder.nbest(trigram_measures.raw_freq, 10)
paradise_lost = [w.lower() for w in paradise_lost]

print('---By Frequency ---')
print(bigrams)
def generate_ngrams(n=1, corpus=[]):
print(trigrams)
return [tuple(corpus[i:i+n]) for i in range(len(corpus) - (n + 1))]
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigrams = generate_ngrams(n=3, corpus=paradise_lost)
trigram_measures = nltk.collocations.TrigramAssocMeasures()
distribution = [[list(t), 0] for t in trigrams]

bigram_finder=BigramCollocationFinder.from_words(paradise_lost)
bigrams = bigram_finder.nbest(bigram_measures.pmi, 10)
for lhs in distribution:
for rhs in distribution:
trigram_finder=TrigramCollocationFinder.from_words(paradise_lost)
if lhs[0] == rhs[0]:
trigrams = trigram_finder.nbest(trigram_measures.pmi, 10)
lhs[1] = lhs[1] + 1
print('---By Pointwise Mutual Information ---')

results = set([(tuple(t[0]), t[1]) for t in distribution])
print(bigrams)
print(trigrams)
results = list(results)
Here, we are sorting our collocations by two different algorithms.

results.sort(key=lambda v: v[1], reverse=True)
One is by raw frequency, which is what we did in the N-gram
results = filter(lambda v: v[1] > 1, results) section. We are also sorting them by mutual information. There’s a
slew of other sorting techniques available in the *AssocMeasures
classes as well.

TEXT CLASSIFICATION REAL-WORLD APPLICATIONS

Text classification is the process of classifying various tokens with HEALTHCARE
particular types. These types could be relatively simple, like nouns
Healthcare, especially tele-health, is a field ripe for disruption via
or verbs. Or, they can be much more abstract. Nouns and other
text analysis. Anything that can be spoken can be transcribed into
grammatical types are well understood linguistically. Semantic types
text and then analyzed. Furthermore, any correspondence between a
are a bit more difficult to handle, due to additional complexity and
physician and patient can be examined as well. This information can
the sheer scale of language semantics.
be used in any number of domains.
Classifying text is similar to using typical machine learning classifiers.
Fraud is always an issue, whether it be in a hospital or private
You need a corpus, you need a training set to generate a model, and
practice. By evaluating and extracting information from medical
then you need to run that model against a test set, the set of words
records, doctor’s notes, and correspondence, systems can help
you’re classifying.
local instances of fraudulent prescriptions, for example, or more
import nltk egregious fraud.
import random
Extracting information from written records can be fed into expert
from nltk.corpus import names systems or neural networks as well. This kind of system could
help healthcare providers diagnose rare disorders in patients that
from sklearn.svm import SVC
otherwise may not be diagnosed properly.
nltk.download('names')
BANKING
feature_extractor = lambda name: {'feature' : name[-1]} Banking is a particular area where text analysis and information
extraction can be key to providing strong, accurate customer service.
names = (
Financial firms are notorious for leaving detailed and complete paper
[(name, 'male') for name in names.words('male.txt')]
+ [(name, 'female') for name in names.words('female.txt')] trains for any and all transactions tons. Often, these records are very
) detailed, a potential treasure trove of untapped information.
random.shuffle(names) Particular areas of interest include sentiment analysis and

information extraction. Information extracted from financial records
data = [(feature_extractor(n), g) for (n,g) in names]
can be saved for later access in alternate, more heavily structured
demark = int(len(data) * 0.1)
train_data, test_data = data[demark:], data[:demark]
forms. This makes identifying links between seemingly unrelated
pieces of information much easier, and makes the data much more
nb_classifier = nltk.NaiveBayesClassifier.train(train_data) tractable.

dt_classifier = nltk.DecisionTreeClassifier.train(train_data)
me_classifier = nltk.MaxentClassifier.train(train_data) Financial companies are more susceptible to fraud than medical
sk_classifier=nltk.SklearnClassifier(SVC(),sparse=False).train(train_data) practices. The ability to discover when employees may be vulnerable
print(nltk.classify.accuracy(nb_classifier, test_data)) to temptation is key to preventing financial losses to the firm and
print(nltk.classify.accuracy(dt_classifier, test_data))
significant professional harm to the employee. Early intervention
print(nltk.classify.accuracy(me_classifier, test_data))
on the part of an organization in cases like this doesn’t need to be
print(nltk.classify.accuracy(sk_classifier, test_data))
heavy-handed if the potential fraud is discovered early enough.
This simple example shows how you can use classifiers to classify
MANUFACTURING
names by gender. Here, we’re using the names dataset available
through nltk, and four different classifiers. Manufacturing environments typically have many, many manuals
covering equipment and overall system technical details, standard
Essentially, this follows the pattern you’d expect in any supervised
operating procedures, and the like. This information is usually sitting
machine learning example in that we start with data, partition into
on a shelf, out of reach, and infrequently referenced once staff is
test and training sets based on identified features, train the models,
trained. Extracting this information from PDFs, MS Word documents,
and then evaluate the models.
or RTF files can make this information much more accessible.

Once extracted, it can be structured into more formalized CONCLUSION

representations transformed into estimation systems, work Text mining has more and more potential to impact day-to-day
validation systems, or used to enhance overall project management. operations for enterprises now more than ever. The kinds of
Many of the small details inherent with using industrial equipment techniques we’ve covered in this Refcard are as applicable to
can be more closely monitored and checked to ensure process insurance as they are to social media analysis.
compliance, overall cost adherence, and personal safety.
The ability to extract structured, meaningful information from
INSURANCE otherwise opaque unstructured text opens new capabilities for
The insurance industry has long been focused on two conflicting customer interaction, operational refinement, or crime prevention.
objectives — customer service and fraud detection. Luckily, text The ability to review customer service interactions with call-center
mining can help with both. personnel provides new ways to increase consumer satisfaction and
identify new sales relationships that had been hidden in the mess of
Most interactions with insurance companies are recorded and can be unstructured data that organizations collect every day.
mined for information both online and/or offline. Online analysis can
help customer service agents understand the customer mindset and Finally, since the vast majority of data out there today is unstructured
with whom they interact, improving overall customer intimacy and and unapproachable, text mining can supply ways to finally examine
rapport. Offline systems can extract information from the recorded all of the data organizations have available.
conversations, structure the data, and identify possible fraud either
Whether to make us safer, improve our customer experiences, or
via sentiment analysis or potential sales via information correlation
make our workplaces run more smoothly, text mining is here to stay.
with prior customers.
Written by Chris Lamb,

Professor and Principal Scientist
Sandia National Laboratories
Dr. Lamb currently serves as a cyber-security

research scientist with Sandia National DZone, a Devada Media Property, is the resource software developers,
engineers, and architects turn to time and again to learn new skills,
Laboratories. He is also a Research Assistant Professor affiliated
solve software development problems, and share their expertise. Every
with the Electrical and Computer Engineering department at the day, hundreds of tousands of developers come to DZone to read about
University of New Mexico. His current research interests center the latest technologies, methodologies, and best practices. That makes
around industrial control system cybersecurity, particularly in DZone the ideal place for developer marketers to build product and
brand awareness and drive sales. DZone clients include some of the most
reference to nuclear power plants, machine learning, artificial
innovative technology and tech-enabled companies in the world including
intelligence, and their intersections. Red Hat, Cloud Elements, Sensu, and Sauce Labs.
He has extensive experience designing and developing mission-

critical distributed systems for a wide range of government
departments and agencies. Prior to joining Sandia National Devada, Inc.
600 Park Offices Drive
Laboratories and working with the University of New Mexico, Dr. Suite 150
Lamb served in executive roles and as a principal consultant for a Research Triangle Park, NC 27709
variety of technology companies in the southwest. Dr. Lamb has a 888.678.0399 919.678.0300
B.S. in Mechanical Engineering from New Mexico State University,
an M.S. in Computer Science from the University of New Mexico, Copyright © 2020 Devada, Inc. All rights reserved. No part of this publication
may be reporoduced, stored in a retrieval system, or transmitted, in any form
as well as a Ph.D. in Computer Engineering with a focus on or by means of electronic, mechanical, photocopying, or otherwise, without
Computational Intelligence from the University of New Mexico. prior written permission of the publisher.

Introduction To Text Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Text Mining

Uploaded by

Copyright:

Available Formats

BROUGHT TO YOU IN PARTNERSHIP WITH

∙ What Is Text Mining?

Introduction to ∙ How Does Text Mining Work?

∙ What Problems Does

∙ Key Methods and Techniques

extracting useful information and unstructured text.

structured in some way, usually dictated by the language in which it’s

be a series of tweets, so it’s not grammatically correct and contains

Many of these examples draw heavily from a variety of sources that

Store, Manage Process and Integrate

Essentially, text mining allows an analyst to extract meaningful

Store, Manage Process and Integrate

Build better software products

HOW CAN IT DIFFERENTIATE DEVELOPMENT ENVIRONMENT

emotional state prior to saying a word. Internet analysis tools can

Figure 1: A typical text analysis workflow

2 Natural Language Toolkit, retrieved from https://www.nltk.org/ on April 5, 2020.

3 BROUGHT TO YOU IN PARTNERSHIP WITH

paradise_lost = [w.lower() for w in paradise_lost]

distribution = [[list(t), 0] for t in trigrams]

print('---By Pointwise Mutual Information ---')

Here, we are sorting our collocations by two different algorithms.

4 BROUGHT TO YOU IN PARTNERSHIP WITH

TEXT CLASSIFICATION REAL-WORLD APPLICATIONS

random.shuffle(names) Particular areas of interest include sentiment analysis and

nb_classifier = nltk.NaiveBayesClassifier.train(train_data) tractable.

5 BROUGHT TO YOU IN PARTNERSHIP WITH

Once extracted, it can be structured into more formalized CONCLUSION

Written by Chris Lamb,

Dr. Lamb currently serves as a cyber-security

He has extensive experience designing and developing mission-

6 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like