Professional Documents
Culture Documents
Text As Data: Computational Methods of Understanding Written Expression Using SAS (Wiley and SAS Business Series) 1st Edition Deville
Text As Data: Computational Methods of Understanding Written Expression Using SAS (Wiley and SAS Business Series) 1st Edition Deville
https://ebookmeta.com/product/statistical-data-analysis-using-
sas-intermediate-statistical-methods-springer-texts-in-
statistics-marasinghe/
https://ebookmeta.com/product/data-science-and-machine-learning-
for-non-programmers-using-sas-enterprise-miner-1st-edition-
dothang-truong/
https://ebookmeta.com/product/encryption-in-sas-9-4-sixth-
edition-sas-institute-inc/
https://ebookmeta.com/product/visual-data-insights-using-sas-ods-
graphics-a-guide-to-communication-effective-data-
visualization-1st-edition-leroy-bessler-2/
Visual Data Insights Using SAS ODS Graphics: A Guide to
Communication-Effective Data Visualization 1st Edition
Leroy Bessler
https://ebookmeta.com/product/visual-data-insights-using-sas-ods-
graphics-a-guide-to-communication-effective-data-
visualization-1st-edition-leroy-bessler/
https://ebookmeta.com/product/applied-regression-and-anova-using-
sas-1st-edition-patricia-f-moodie-dallas-e-johnson/
https://ebookmeta.com/product/mining-author-cocitation-data-with-
sas-enterprise-guide-1st-edition-sean-b-eom/
https://ebookmeta.com/product/structural-equation-modeling-using-
r-sas-a-step-by-step-approach-with-real-data-analysis-1st-
edition-ding-geng-chen/
Text as Data
Wiley and SAS
Business Series
The Wiley and SAS Business Series presents books that help senior
level managers with their critical management decisions.
Titles in the Wiley and SAS Business Series include:
For more information on any of the above titles, please visit www
.wiley.com.
Text as Data
Computational Methods
of Understanding Written Expression
Using SAS
By
Barry deVille and
Gurpreet Singh Bawa
Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
For general information on our other products and services or for technical support,
please contact our Customer Care Department within the United States at (800) 762-
2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that
appears in print may not be available in electronic formats. For more information
about Wiley products, visit our website at www.wiley.com.
9781119487128 (hardback)
9781119487173 (ePDF)
9781119487159 (ePub)
Preface xi
Acknowledgments xiii
About the Authors xv
Introduction 1
Chapter 1 Text Mining and Text Analytics 3
Chapter 2 Text Analytics Process Overview 15
Chapter 3 Text Data Source Capture 33
Chapter 4 Document Content and Characterization 43
Chapter 5 Textual Abstraction: Latent Structure, Dimension
Reduction 73
Chapter 6 Classification and Prediction 103
Chapter 7 Boolean Methods of Classification and
Prediction 125
Chapter 8 Speech to Text 139
ix
Preface
xi
Acknowledgments
xiii
About the Authors
xv
Text as Data
Introduction
1
2 ▸ Introduction
3
T
his chapter describes some of the background and recent history
of text analytics and provides real-world examples of how text
analytics works and solves business problems. This treatment
provides examples of common forms of text analytics and exam-
ples of solution approaches. The discussion ranges from a history
of the analytical treatment of text expression up to the most recent
developments and applications.
4
T e x t M i n i n g a n d T e x t A n a ly t i c s ◂ 5
Figure 1.2 Example of cuneiform recording the distribution of beer in southern Iraq,
3100–3000 bce.
Source: BabelStone, Licensed under CC BY-SA 3.0.
Figure 1.3 Shang oracle bone script for character “Eye.” Modern character is 目.
Source: Tomchen1989. Public Domain.
T e x t M i n i n g a n d T e x t A n a ly t i c s ◂ 9
Mù
Writing systems of the world that have evolved from ancient times
to the present day can be organized into five categoriesiii: alphabets,
abjads, abugidas, syllabaries, and logo-syllabaries.
SEND RECEIVE
SEND RECEIVE
Component words/tokens
Document Class wheels rudder diesel sail track sea land water
1 TRAIN x x x x x x
2 BOAT x x
3 TRAIN x x x x x
4 BOAT x x x x
5 TRAIN x
6 BOAT x x x x
7 TRAIN x x
8 BOAT x x x
9 TRAIN x
10 BOAT x x x
11 TRAIN x x x x x
12 BOAT x x x x x
T e x t M i n i n g a n d T e x t A n a ly t i c s ◂ 13
H X p log 2 p q log 2 q
NOTES
i. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding Google AI Language (Ithaca, NY: Cornell
University: 2019). https://arxiv.org/abs/1810.04805v2.
ii. H. T.O. Dong, A History of the Chinese Language (London and New York: Rout-
ledge, 2014).
iii. F. Coulmas, The Writing Systems of the World (Hoboken, NJ: Wiley-Blackwell, 1919).
iv. J. J. Soni and R. Goodman, A Mind at Play: How Claude Shannon Invented the Informa-
tion Age (New York: Simon & Schuster, 2017).
C H A P T E R 2
Text Analytics
Process Overview
15
TEXT ANALYTICS PROCESSING
Figure 2.1 describes the life cycle of text analytics from capture to
deployment in six major processes. We can map document capture,
test-to-data transfer, and characterization in the preparation phase.
Preparation/
Engineering
Capture
Text Data Characterization
Documents
Prediction
Utilization/
Discovery
Classification
Composite
Latent Structure Document
Scoring Model
16
T e x t A n a ly t i c s P r o c e s s O v e r v i e w ◂ 17
Preparation
Utilization
PROCESS DESCRIPTION
There are many kinds of text data repositories and many kinds of
original textual data sources. For applied business and industrial appli-
cations, the data source may often be a host website or perhaps a
social media data selection. The text mining and text analytics site at
UC Berkeley provides an example of the many different data sources
available: https://guides.lib.berkeley.edu/text-mining. Links to many
different data sources are provided on this web location, including
books (including over 50,000 volumes available on Project Guten-
berg), newspapers and magazines, scholarly journals, government
documents, linguistic corpora, literature, social media archives, and
historical collections. Github also contains hundreds of text databases:
https://github.com/awesomedata/awesome-public-datasets.
20 ▸ T EX T A S D A T A
Capture
Category Business
Folder
Structure Sports
Music
to sports would be in the Sports folder, and so on. Later, when the
text-learning system needs to find linguistic rules that characterize
and distinguish Business documents from Music documents, these
folders and associated documents will be used as training data to
learn the rules.
LINGUISTIC PROCESSING
Once we have stored and identified the text that we want to work, we
are ready to rework the qualitative, textual data into quantitative data
products that supports more robust computation. The general term for
this stage of the text analytic process is linguistic processing.
Linguistic processing is the text analytic ability to perform detailed
linguistic operations on a term-by-term basis as the linguistic pro-
cessor moves through the document in line-by-line and term-by-term
sequence. Although this is the first step in creating the term by docu-
ment matrix that is the basis for the higher-dimension, numeric linear
algebra approaches that are the hallmark of advanced text analytics,
linguistic processing is a key enabler and also a significant approach
in its own right.
Once the text has been assembled, it is viewed by the parse engine
as a sequence of characters that are encoded in some text or image
representation.
A process overview of text treatments, transformations, deriva-
tions, and extractions are listed in Figure 2.3 and briefly described
as follows:
JJ Tokenization. Here we use punctuation and character encod-
Apply
Tokenization
Taxonomies
Consolidation Filter
Weight and
Disambiguate
Normalize
The packaging was not good. Unigram The, packaging, was, not, good.
Bigram The packaging, Packaging was, was not,
not good.
where nij represents the count of times ith word is present in the
jth document.
Language: Italian
IL TRAMONTO DI UNA
CIVILTÀ
O
VOLUME SECONDO
FIRENZE
FELICE LE MONNIER
EDITORE
Lo sforzo demografico.
La guerra, la pastorizia,
l’agricoltura.
La guerra e il commercio.
Ancora de la guerra e il
commercio.