What Is Text Mining?

Introduction to

What Problems Does

Text Mining Solve?

Text Mining
How Can It Differentiate
Products and Services?

Getting Started

Development Environment

Key Methods and Techniques

Conclusion

WHAT IS TEXT MINING? unstructured, and not accessible to modern data analysis techniques

Text mining is an ambiguous term for extracting useful information as a result. Mining text allows you to extract underlying information

from otherwise unstructured text. There are two particular terms from these unstructured data sources that you can then structure,

we need to pay close attention to when defining “text mining”— analyze, and process.

extracting useful information and unstructured text.

Useful information, in this context, could be anything from basic
facts expressed by the text to advanced sentiment analysis indicating
With text mining, you can extract information from written text. This
the state of mind of the author at the time the text was created.
is something we do, naturally, every day, in conversations or when

Unstructured text means that the information is not stored in a we read. Like driving a car, once we learn how to do it, we take it for

structured format like XML or a database table. The text is still granted.

structured in some way, usually dictated by the language in which it’s

Like driving a car, it has been resistant to automation, and has only
written and the custom of the medium.
recently become more tractable and automatable with the recent

For example, you may be analyzing English text, but that text may explosion of computation power, additional algorithm development,

be a series of tweets, so it’s not grammatically correct and contains

abbreviations and emojis. Nevertheless, while there is some
underlying structure, the document isn’t formally structured.
and machine learning techniques. New advances in this area promise Due to that complexity, these topics are out of the scope of this
to revolutionize customer service, business intelligence, and a myriad Refcard, and instead, we will focus on traditional text analytic
of other fields. techniques.


PRODUCTS AND SERVICES? In this Refcard, we’re going to focus on using Python 3 and the

Effective text mining opens up new application areas while Natural Language Toolkit,2 for the most part. R is another popular

improving the quality of existing ones. Customer service systems platform for text processing, but I prefer using Python because of its

with integrated text analysis and effective voice-to-text capabilities extensive collection of libraries. I suggest using Anaconda3 for this

can build analytic pipelines supporting real-time sentiment analysis, kind of work as well, as it will allow you to create custom isolated

allowing representatives to engage with customers knowing their Python environments that you can use for a variety of things.

emotional state prior to saying a word. Internet analysis tools can

Setting up your environment is well documented on the Anaconda
plumb the web for information, visiting competitors, and extracting
site. Follow those instructions4, and then create a new environment
information regarding current capabilities to feed business
— you can name it whatever you’d like. I’m going to name mine
intelligence analysis systems. Overall, text mining tools can make
text_analysis .
business systems more robust, insightful, and powerful.
I use a variety of other tools too, like iPython and Jupyter. I suggest
GETTING STARTED you install those too (conda install python jupyter should do
Text mining is a complex area moving quickly into machine learning the trick). We’ll use Jupyter notebooks to track our examples, and I’ll
and artificial intelligence. This Refcard will walk you through the text- make copies available for you to work through as well on GitHub.
specific topics you need to know to start to move into more complex
In order to practice, the NLTK includes a downloader that will
areas like semantic analysis and meaning extraction by showing you
install a variety of texts from the Gutenberg project5, an excellent
the underlying techniques specific to unstructured text analysis.
source of material for practicing text analysis techniques. When we
A typical text analysis workflow mirrors the outline in Figure 1. Here, demonstrate such techniques, we download and load a give book
we have two groups of inter-related activities – traditional text using the following code snippet:
analysis and AI-enabled text analysis. With tools like Tensorflow and
import nltk
Pytorch, AI software development is more straightforward than ever,'gutenberg')
but it’s still far from simple. paradise_lost = nltk.corpus.gutenberg.words('milton-

Figure 1: A typical text analysis workflow

2 Natural Language Toolkit, retrieved from on April 5, 2020.

3 Anaconda, retrieved from on April 5, 2020.
4 Anaconda Distributions, retrieved from on April 16, 2020.
5 Project Gutenberg, retrieved from on April 5, 2020.



I’ll omit this in the following examples, but you can insert this You can derive N-grams using native Python tools. However, this can
wherever needed. The downloader won’t download the books if be time-consuming, and the resulting list of N-grams needs to be
you’ve already done so, so you can include this at the top of any code processed into something useful.
you write.
Here, we generate an initial collection of Trigrams, and then we count
the most common ones, sorting the resulting list in descending
order. Note that we’ve converted all words into lower case prior to
WORD FREQUENCY processing to avoid differences between ‘Word’ and ‘word’.
Word frequency measures a given text and provides insight into the
topics discussed and key concepts.
A collocation is a sequence of words that occur more frequently than
from nltk.probability import FreqDist
you’d expect. They can provide insight into common terminology,
distribution = FreqDist(paradise_lost)
overall sentiment, and the primary theme of a given corpus. Bigrams
distribution.plot(50, cumulative=False) and trigrams are examples of collocations. When we covered
N-grams, we did things the hard way. Now that we understand more
Here, we’re extracting the tokens from the text and graphing them.
clearly what they are, we can lean on NLTK to find these for us:
This allows you to see the most common tokens.
import nltk
We can also examine the characteristics of words, like so: fromnltk.collocationsimportBigramAssocMeasures,TrigramAssocMeasures,
BigramCollocationFinder, TrigramCollocationFinder
long_words = [w for w in paradise_lost if len(w) > 10]'gutenberg')
distribution = FreqDist(long_words) paradise_lost=nltk.corpus.gutenberg.words('milton-paradise.txt')

bigram_measures = BigramAssocMeasures()
The group of words in paradise_lost can be treated as a Python list, trigram_measures = TrigramAssocMeasures()
and then used as an argument to the various distribution tools, like
FreqDist or ConditionalFreqDist . bigram_finder=BigramCollocationFinder.from_words(paradise_lost)
bigrams = bigram_finder.nbest(bigram_measures.raw_freq, 10)

N-grams are essentially lists of N size contiguous tokens in a corpus. trigrams = trigram_finder.nbest(trigram_measures.raw_freq, 10)

paradise_lost = [w.lower() for w in paradise_lost]

print('---By Frequency ---')
def generate_ngrams(n=1, corpus=[]):
return [tuple(corpus[i:i+n]) for i in range(len(corpus) - (n + 1))]

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigrams = generate_ngrams(n=3, corpus=paradise_lost)
trigram_measures = nltk.collocations.TrigramAssocMeasures()

distribution = [[list(t), 0] for t in trigrams]

bigrams = bigram_finder.nbest(bigram_measures.pmi, 10)
for lhs in distribution:
for rhs in distribution:
if lhs[0] == rhs[0]:
trigrams = trigram_finder.nbest(trigram_measures.pmi, 10)
lhs[1] = lhs[1] + 1

print('---By Pointwise Mutual Information ---')

results = set([(tuple(t[0]), t[1]) for t in distribution])
results = list(results)

Here, we are sorting our collocations by two different algorithms.

results.sort(key=lambda v: v[1], reverse=True)
One is by raw frequency, which is what we did in the N-gram
results = filter(lambda v: v[1] > 1, results) section. We are also sorting them by mutual information. There’s a
slew of other sorting techniques available in the *AssocMeasures
classes as well.




Text classification is the process of classifying various tokens with HEALTHCARE
particular types. These types could be relatively simple, like nouns
Healthcare, especially tele-health, is a field ripe for disruption via
or verbs. Or, they can be much more abstract. Nouns and other
text analysis. Anything that can be spoken can be transcribed into
grammatical types are well understood linguistically. Semantic types
text and then analyzed. Furthermore, any correspondence between a
are a bit more difficult to handle, due to additional complexity and
physician and patient can be examined as well. This information can
the sheer scale of language semantics.
be used in any number of domains.
Classifying text is similar to using typical machine learning classifiers.
Fraud is always an issue, whether it be in a hospital or private
You need a corpus, you need a training set to generate a model, and
practice. By evaluating and extracting information from medical
then you need to run that model against a test set, the set of words
records, doctor’s notes, and correspondence, systems can help
you’re classifying.
local instances of fraudulent prescriptions, for example, or more
import nltk egregious fraud.
import random
Extracting information from written records can be fed into expert
from nltk.corpus import names systems or neural networks as well. This kind of system could
help healthcare providers diagnose rare disorders in patients that
from sklearn.svm import SVC
otherwise may not be diagnosed properly.'names')
feature_extractor = lambda name: {'feature' : name[-1]} Banking is a particular area where text analysis and information
extraction can be key to providing strong, accurate customer service.
names = (
Financial firms are notorious for leaving detailed and complete paper
[(name, 'male') for name in names.words('male.txt')]
+ [(name, 'female') for name in names.words('female.txt')] trains for any and all transactions tons. Often, these records are very
) detailed, a potential treasure trove of untapped information.

random.shuffle(names) Particular areas of interest include sentiment analysis and

information extraction. Information extracted from financial records
data = [(feature_extractor(n), g) for (n,g) in names]
can be saved for later access in alternate, more heavily structured
demark = int(len(data) * 0.1)
train_data, test_data = data[demark:], data[:demark]
forms. This makes identifying links between seemingly unrelated
pieces of information much easier, and makes the data much more

nb_classifier = nltk.NaiveBayesClassifier.train(train_data) tractable.

dt_classifier = nltk.DecisionTreeClassifier.train(train_data)
me_classifier = nltk.MaxentClassifier.train(train_data) Financial companies are more susceptible to fraud than medical
sk_classifier=nltk.SklearnClassifier(SVC(),sparse=False).train(train_data) practices. The ability to discover when employees may be vulnerable
print(nltk.classify.accuracy(nb_classifier, test_data)) to temptation is key to preventing financial losses to the firm and
print(nltk.classify.accuracy(dt_classifier, test_data))
significant professional harm to the employee. Early intervention
print(nltk.classify.accuracy(me_classifier, test_data))
on the part of an organization in cases like this doesn’t need to be
print(nltk.classify.accuracy(sk_classifier, test_data))
heavy-handed if the potential fraud is discovered early enough.
This simple example shows how you can use classifiers to classify
names by gender. Here, we’re using the names dataset available
through nltk, and four different classifiers. Manufacturing environments typically have many, many manuals
covering equipment and overall system technical details, standard
Essentially, this follows the pattern you’d expect in any supervised
operating procedures, and the like. This information is usually sitting
machine learning example in that we start with data, partition into
on a shelf, out of reach, and infrequently referenced once staff is
test and training sets based on identified features, train the models,
trained. Extracting this information from PDFs, MS Word documents,
and then evaluate the models.
or RTF files can make this information much more accessible.



Once extracted, it can be structured into more formalized CONCLUSION

representations transformed into estimation systems, work Text mining has more and more potential to impact day-to-day
validation systems, or used to enhance overall project management. operations for enterprises now more than ever. The kinds of
Many of the small details inherent with using industrial equipment techniques we’ve covered in this Refcard are as applicable to
can be more closely monitored and checked to ensure process insurance as they are to social media analysis.
compliance, overall cost adherence, and personal safety.
The ability to extract structured, meaningful information from
INSURANCE otherwise opaque unstructured text opens new capabilities for
The insurance industry has long been focused on two conflicting customer interaction, operational refinement, or crime prevention.
objectives — customer service and fraud detection. Luckily, text The ability to review customer service interactions with call-center
mining can help with both. personnel provides new ways to increase consumer satisfaction and
identify new sales relationships that had been hidden in the mess of
Most interactions with insurance companies are recorded and can be unstructured data that organizations collect every day.
mined for information both online and/or offline. Online analysis can
help customer service agents understand the customer mindset and Finally, since the vast majority of data out there today is unstructured
with whom they interact, improving overall customer intimacy and and unapproachable, text mining can supply ways to finally examine
rapport. Offline systems can extract information from the recorded all of the data organizations have available.
conversations, structure the data, and identify possible fraud either
Whether to make us safer, improve our customer experiences, or
via sentiment analysis or potential sales via information correlation
make our workplaces run more smoothly, text mining is here to stay.
with prior customers.

Written by Chris Lamb,

Professor and Principal Scientist
Sandia National Laboratories

Dr. Lamb currently serves as a cyber-security

