Text As Data: Computational Methods of Understanding Written Expression Using SAS (Wiley and SAS Business Series) 1st Edition Deville

Text as Data: Computational Methods
of Understanding Written Expression

Using SAS (Wiley and SAS Business
Series) 1st Edition Deville
Visit to download the full and correct content document:
https://ebookmeta.com/product/text-as-data-computational-methods-of-understanding
-written-expression-using-sas-wiley-and-sas-business-series-1st-edition-deville/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
Statistical Data Analysis Using SAS Intermediate

Statistical Methods Springer Texts in Statistics
Marasinghe
https://ebookmeta.com/product/statistical-data-analysis-using-
sas-intermediate-statistical-methods-springer-texts-in-
statistics-marasinghe/
Data Science and Machine Learning for Non-Programmers:

Using SAS Enterprise Miner 1st Edition Dothang Truong
https://ebookmeta.com/product/data-science-and-machine-learning-
for-non-programmers-using-sas-enterprise-miner-1st-edition-
dothang-truong/
Encryption in SAS 9 4 Sixth Edition Sas Institute Inc
https://ebookmeta.com/product/encryption-in-sas-9-4-sixth-
edition-sas-institute-inc/
Visual Data Insights Using SAS ODS Graphics: A Guide to

Communication-Effective Data Visualization 1st Edition
Leroy Bessler
https://ebookmeta.com/product/visual-data-insights-using-sas-ods-
graphics-a-guide-to-communication-effective-data-
visualization-1st-edition-leroy-bessler-2/
Visual Data Insights Using SAS ODS Graphics: A Guide to
Communication-Effective Data Visualization 1st Edition
Leroy Bessler
https://ebookmeta.com/product/visual-data-insights-using-sas-ods-
graphics-a-guide-to-communication-effective-data-
visualization-1st-edition-leroy-bessler/
Risk Modeling: Practical Applications of Artificial

Intelligence, Machine Learning, and Deep Learning
(Wiley and SAS Business Series) 1st Edition Terisa
Roberts
https://ebookmeta.com/product/risk-modeling-practical-
applications-of-artificial-intelligence-machine-learning-and-
deep-learning-wiley-and-sas-business-series-1st-edition-terisa-
roberts/
Applied Regression and ANOVA Using SAS 1st Edition

Patricia F Moodie Dallas E Johnson
https://ebookmeta.com/product/applied-regression-and-anova-using-
sas-1st-edition-patricia-f-moodie-dallas-e-johnson/
Mining Author Cocitation Data with SAS Enterprise Guide

1st Edition Sean B Eom
https://ebookmeta.com/product/mining-author-cocitation-data-with-
sas-enterprise-guide-1st-edition-sean-b-eom/
Structural Equation Modeling Using R/SAS: A Step-by-

Step Approach with Real Data Analysis 1st Edition Ding-
Geng Chen
https://ebookmeta.com/product/structural-equation-modeling-using-
r-sas-a-step-by-step-approach-with-real-data-analysis-1st-
edition-ding-geng-chen/
Text as Data
Wiley and SAS
Business Series
The Wiley and SAS Business Series presents books that help senior
level managers with their critical management decisions.
Titles in the Wiley and SAS Business Series include:
The Analytic Hospitality Executive: Implementing Data Analytics in Hotels

and Casinos by Kelly A. McGuire
Analytics: The Agile Way by Phil Simon
The Analytics Lifecycle Toolkit: A Practical Guide for an Effective Analytics
Capability by Gregory S. Nelson
Anti-Money Laundering Transaction Monitoring Systems Implementation:
Finding Anomalies by Derek Chau and Maarten van Dijck Nemcsik
Artificial Intelligence for Marketing: Practical Applications by Jim Sterne
Business Analytics for Managers: Taking Business Intelligence Beyond
Reporting (Second Edition) by Gert H. N. Laursen and Jesper Thorlund
Business Forecasting: The Emerging Role of Artificial Intelligence and
Machine Learning by Michael Gilliland, Len Tashman, and Udo
Sglavo
The Cloud-Based Demand-Driven Supply Chain by Vinit Sharma
Consumption-
Based Forecasting and Planning: Predicting Changing
Demand Patterns in the New Digital Economy by Charles W. Chase
Credit Risk Analytics: Measurement Techniques, Applications, and
Examples in SAS by Bart Baesen, Daniel Roesch, and Harald Scheule
Demand-Driven Inventory Optimization and Replenishment: Creating a
More Efficient Supply Chain (Second Edition) by Robert A. Davis
Economic Modeling in the Post Great Recession Era: Incomplete Data,
Imperfect Markets by John Silvia, Azhar Iqbal, and Sarah Watt House
Enhance Oil & Gas Exploration with Data- Driven Geophysical and
Petrophysical Models by Keith Holdaway and Duncan Irving
Fraud Analytics Using Descriptive, Predictive, and Social Network
Techniques: A Guide to Data Science for Fraud Detection by Bart Baesens,
Veronique Van Vlasselaer, and Wouter Verbeke
Intelligent Credit Scoring: Building and Implementing Better Credit Risk
Scorecards (Second Edition) by Naeem Siddiqi
JMP Connections: The Art of Utilizing Connections in Your Data by John
Wubbel
Leaders and Innovators: How Data-Driven Organizations Are Winning
with Analytics by Tho H. Nguyen
On-Camera Coach: Tools and Techniques for Business Professionals in a
Video-Driven World by Karin Reed
Next Generation Demand Management: People, Process, Analytics, and
Technology by Charles W. Chase
A Practical Guide to Analytics for Governments: Using Big Data for Good
by Marie Lowman
Profit from Your Forecasting Software: A Best Practice Guide for Sales
Forecasters by Paul Goodwin
Project Finance for Business Development by John E. Triantis
Smart Cities, Smart Future: Showcasing Tomorrow by Mike Barlow and
Cornelia Levy-Bencheton
Statistical Thinking: Improving Business Performance (Third Edition) by
Roger W. Hoerl and Ronald D. Snee
Strategies in Biomedical Data Science: Driving Force for Innovation by
Jay Etchings
Style and Statistics: The Art of Retail Analytics by Brittany Bullard
Text as Data: Computational Methods of Understanding Written Expression
Using SAS by Barry deVille and Gurpreet Singh Bawa
Transforming Healthcare Analytics: The Quest for Healthy Intelligence by
Michael N. Lewis and Tho H. Nguyen
Visual Six Sigma: Making Data Analysis Lean (Second Edition) by
Ian Cox, Marie A. Gaudard, and Mia L. Stephens
Warranty Fraud Management: Reducing Fraud and Other Excess Costs in
Warranty and Service Operations by Matti Kurvinen, Ilkka Töyrylä,
and D. N. Prabhakar Murthy
For more information on any of the above titles, please visit www
.wiley.com.
Text as Data
Computational Methods
of Understanding Written Expression
Using SAS
By
Barry deVille and
Gurpreet Singh Bawa
Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or

transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, scanning, or otherwise, except as permitted under Section 107 or 108 of
the 1976 United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978)
750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the
Publisher for permission should be addressed to the Permissions Department, John
Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201)
748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have

used their best efforts in preparing this book, they make no representations or
warranties with respect to the accuracy or completeness of the contents of this book
and specifically disclaim any implied warranties of merchantability or fitness for a
particular purpose. No warranty may be created or extended by sales representatives
or written sales materials. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a professional where appropriate.
Neither the publisher nor author shall be liable for any loss of profit or any other
commercial damages, including but not limited to special, incidental, consequential, or
other damages.
For general information on our other products and services or for technical support,
please contact our Customer Care Department within the United States at (800) 762-
2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that
appears in print may not be available in electronic formats. For more information
about Wiley products, visit our website at www.wiley.com.
Library of Congress Cataloging-in-Publication Data is Available:
9781119487128 (hardback)
9781119487173 (ePDF)
9781119487159 (ePub)
Cover Design: Wiley

To all those who unconditionally love and support authors
and their writing processes – especially our life partners, Maya
McNeilly and Dilpreet Kaur, who go above and beyond.
Contents
Preface xi
Acknowledgments xiii
About the Authors xv
Introduction 1
Chapter 1 Text Mining and Text Analytics 3
Chapter 2 Text Analytics Process Overview 15
Chapter 3 Text Data Source Capture 33
Chapter 4 Document Content and Characterization 43
Chapter 5 Textual Abstraction: Latent Structure, Dimension
Reduction 73
Chapter 6 Classification and Prediction 103
Chapter 7 Boolean Methods of Classification and
Prediction 125
Chapter 8 Speech to Text 139
Appendix A Mood State Identification in Text 157

Appendix B A Design Approach to Characterizing Users
Based on Audio Interactions on a Conversational
AI Platform 175
Appendix C SAS Patents in Text Analytics 189
Glossary 197
Index 203
ix
Preface
This book provides an end-to-end description of the text analytics

process with examples drawn from a range of case studies using

various capabilities of SAS text analytics and the associated SAS com-
puting environment. Qualitative and quantitative approaches within
the SAS environment are covered across the entire text analytics life
cycle from document capture, document characterization, document
understanding, through operational deployments.
We cover procedure- based, engineering approaches to text
analytics, as well as more discovery-based quantitative approaches.
Since much of the text analytics process depends on the text capture
and text preprocessing environment, these aspects of text analytics are
covered as well.
xi
Acknowledgments
This work was initiated and promoted by Julie Palmieri, serving

as editor- in-
chief of SAS Press. James Allen Cox has consistently
offered advice and review throughout and gave a detailed review
of early versions of the draft. Tom Sabo gave advice and review and
made significant contributions to the chapter on Boolean rules. Our
colleagues Saratendu Sethi, Terry Woodfield, and Sanford Gayle

have provided decades of advice on text analytics in general. Elisha
Benjamin of John Wiley & Sons was a great source of advice and
assistance throughout the project. Wiley executive editor Sheck Cho
is the consummate professional and both a rock and a beacon for us
aspiring authors.
As authors, we acknowledge their invaluable advice, assistance,
encouragement, and also humbly acknowledge that any remaining
faults are ours alone.
xiii
About the Authors
Barry deVille is a practitioner, developer, and author in the fields

of statistics, data science, and text analytics. During a decades-long
career at SAS, he collaborated extensively with the text analytic R&D
development team, deploying text mining solutions to a variety of
global clients in various industrial, financial, health, and social media
applications. This work resulted in the award of numerous US pat-
ents on decision tree induction algorithms and multidimensional text
analytics. Prior to joining SAS, he worked with the National Research
Council and other government and commercial entities in Canada in
the development and commercialization of statistical and machine
learning algorithms.
Gurpreet Singh Bawa has practiced internationally in the areas

of statistics with an emphasis on artificial intelligence (AI) and
machine learning (ML). He was awarded a PhD at Panjab Univer-
sity, C
handigarh, India, in the fields of AI and ML. He has authored
numerous publications in national and international journals.
His research in the areas of unstructured data analysis have led to
numerous patent applications and awards (including one with co-
author deVille on social community identification and automatic
document classification). He also works in breakeven analysis and
portfolio optimization. He is currently authoring a book on advanced
mathematics.
xv
Text as Data
Introduction
Text analytics are a collection of computer methods that use semantic

and numerical processing to convert collections of text into identified
components that carry meaning and function and can be manipulated
quantitatively. Meaning assignment is a semantic process that leads
to greater understanding of the text. Numerical manipulation leads
to a range of data summarization approaches that typically reduce
complexity, capture multiple relationships, and highlight tendencies.
Text analytics incorporates semantic and numerical text processing in
a synergistic process that leads to greater understanding of various
collections of text.
In this treatment we also touch on speech applications so we can
see how spoken words, like written words, can be transformed into rep-
resentations that can be manipulated and summarized quantitatively.
Chapter 1 expands our definition of text analytics and provides
some background on the development of written language and sys-
tems of writing that are used to capture and communicate meaning.
Chapter 2 provides an overview of the end- to-end process of
text analytics. A generic template is described that can enhance our
understanding of the various aspects of text analytics and that can also
serve as an organizing framework for discussing text analytics. These
processes are further described in Chapter 3.
Linguistic processing and associated forms of document character-
ization are discussed in Chapter 4. Linguistic processing is the front-
end text analytics intake process to read and parse the incoming text
stream to identify useful and interesting textual components such as
parts of speech, phrases, expressions, and special terms.
Chapter 5 shows how numerical approaches to data, including the
production of dimensional summaries and data reduction approaches,
can be productively applied to creating meaningful textual summaries
and dimensional products, like text topics, that help us understand the
content of text collections.
1
2 ▸ Introduction
In Chapter 6 we provide examples of how quantitative text prod-

ucts can be used for classification and prediction tasks. A real-world
industrial use case is discussed.
Chapter 7 discusses the architecture within SAS that unifies
linguistic and quantitative processing and so blends the strengths of
these two approaches. We show how Boolean rules are constructed,
how these are derived from quantitative operations, and how they
serve a linguistic purpose.
Chapter 8 provides a case study in speech processing and shows
how audio signals can be analyzed and manipulated much like text
products to create analytical reports.
There is also a glossary of specialized terms and three appendices.
Appendix A expands on the discussion of text characterization and
provides an example of how mood state extracted from text can be
used in text analytics. Appendix B provides a discussion and archi-
tectural approach to using audio processing to infer end user persona
characteristics in the construction of artificial intelligence computer-
user interaction interfaces. Appendix C provides an annotated
summary description of critical patents that have been assigned to
SAS. A range of important patents are covered, including an initial
patent awarded to extract dimensional products from text and some of
the more recent patents that address the unified approach to linguistic
and numerical processing.
C H A P T E R 1
Text Mining and
Text Analytics
3
T
his chapter describes some of the background and recent history
of text analytics and provides real-world examples of how text
analytics works and solves business problems. This treatment
provides examples of common forms of text analytics and exam-

ples of solution approaches. The discussion ranges from a history
of the analytical treatment of text expression up to the most recent
developments and applications.
BACKGROUND AND TERMINOLOGY
The analysis of written and spoken expression has been developing

as a computer application over several decades. Some of the earliest
research in machine learning and artificial intelligence dealt with the
problem of reading and interpreting text as well as in text transla-
tion (machine translation). These early activities gave rise to a field
of computer science known as natural language processing (NLP). The
recent rapid development of computer power – including processing
power, large data, high bandwidth communication, and cloud-based,
high-capacity computer memory – has provided a major new (and
considerably broadened) emphasis on computerized text processing
and text analysis.
TEXT ANALYTICS: WHAT IS IT?
Text processing and text analysis are components of the developing

area of understanding written and spoken expression. Commonly
occurring text documents – such as traditional newspapers, journals
and periodicals, and, more recently, electronic documents, such as
social media posts and emails – are forms of written expression. This
active, multilayered area in current computer applications joins well-
established, traditional fields such as linguistics and literary analysis to
form the outline of the emerging field we call text analytics.
Current approaches to text analytics operate in two reinforcing
directions that incorporate traditional forms of linguistic and literary
analysis with a wide range of statistical, artificial intelligence (AI),
and cognitive computing techniques to effectively process written
and spoken expressions. The decoded expressions are used to drive
4
T e x t M i n i n g a n d T e x t A n a ly t i c s ◂ 5
a wide range of computer- mediated inference tasks that includes

artificial intelligence, cognitive computing, and statistical inference.
An everyday example is when we speak or type in a destination in
order to receive an optimal driving route. Similarly, a call center
agent might decipher multiple forms of common requests in order to
construct the most effective solution approach.
Our treatment throughout the chapters to come includes exam-
ples of common forms of text analytics and examples of solution
approaches. The discussion ranges from a history of the analytical
treatment of text expression up to the most recent developments and
applications. Since speech is quickly becoming an important form
of unstructured data, a final chapter takes up the topic of rendering
speech to text.
Computer science and AI emerged as formal disciplines in the
aftermath of World War II. An early application of computers to the
analysis of written expression, natural language processing, took a
universal approach, designed to apply regardless of what language
the text was written in – English, Spanish, or Chinese. The tech-
niques that have been developed also apply regardless of the source
of the text to be analyzed. With the widespread availability of speech-
to-text engines, it is also possible to consider a wide variety of spoken
documents as potential sources for text analytics.
An important goal of NLP is to decompose text constructs (sen-
tences, paragraphs, articles, chapters) into various kinds of entities,
verbs, semantic constructs (like articles and conjunctions), and so
on. The sentence “See Spot run” may be processed and encoded into
an NLP representation as: declarative sentence (intransitive); Spot –
Subject (Animal/Dog); run – Verb (motion).
Historically, NLP relied on various linguistic analysis capabilities,
including extensive logical processing and reasoning capabilities.
As computing capabilities have expanded, NLP has increasingly relied
on a range of computational approaches to enhance the range of NLP
results. An emerging area of NLP includes statistical natural language
processing (SNLP). This form of NLP can be used to craft high-level
representations of textual documents so that relationships between
and among the documents can be computed statistically. The statistical
capability also improves the accuracy of the NLP processing itself.
6 ▸ T EX T AS DA T A
One recent area of written language processing includes statistical

document analysis (SDA). Like SNLP, SDA enables us to show the
statistical relationships between and among the various components
of a textual document. Further, it enables us to summarize the docu-
ment using multivariate statistical techniques like cluster analysis and
latent class analysis. Predictive analytics such as regression analysis,
decision trees, and neural networks can also be used.
As computer processing and storage have continued to grow, so too
have a variety of deep learning applications. One such application is
the Bidirectional Encoder Representations from Transformers (BERT),
a deep-learning application for research at Google AI language.i
BERT can be leveraged for tasks such as categorization, entity
extraction, and natural language generation. Deep learning approaches
require significant computing power and training. As the area of text
analytics continues to unfold, we will likely see how deep learning
approaches complement the capabilities offered in traditional text
analytics, which are less computationally intensive and more than
adequate for a wide range of tasks.
The fields of text mining and text analytics are recent applied areas of
SDA used in a variety of general-purpose social and economic settings.
Text mining often refers to the construction of statistical or numerical
models or predictions. Common sources of data include customer
service logs and emails, customer use records for warranty issue anal-
ysis and defect detection. Text analytics often refers to semantically
based applications – for example, customer analytics (who talks to
whom and what do they say?), competitive analysis (brand metrics,
mentions), and content management (the creation of taxonomies,
web page characterization).
Brief History of Text
Language is a form of communication, and text is a written form of

language. Text comes in a variety of symbolic forms. In addition to
the alphabetic representation we see capturing the written expression
in this text, there are other encoding systems such as syllabaries that
capture spoken syllables and logograms that capture pictographic rep-
resentations. Linguistics distinguishes between phonograms – which
Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.

Source: Shot November 11, 2007. By Uyvsdi. License: Public Domain.
capture parts of words like syllables in written expression – and

logograms – which capture entire concepts.
Figure 1.1 shows an example of a pictographic representation –
the STOP sign itself – an alphabetic representation (in Latin script)
that spells the word “STOP” and a syllabary – in this case, one used to
record the Cherokee language.
One of the earliest true writing systems, dating to the third millen-
nium bce, was cuneiform, originally a pictographic writing system that
eventually evolved into a variety of alphabetic representations. One
intermediate form of simplified cuneiform was Old Persian. It included
a semi-alphabetic syllabary, using far fewer wedge strokes than earlier
Assyrian versions of cuneiform. It included a handful of logograms for
frequently occurring words such as “god” and “king” (see Figure 1.2).
Chinese characters evolved in the second millennium bce and,
according to sources such as Dong,ii were first organized into a com-
prehensive writing system during the Qin dynasty (259–210 bce).
Figure 1.2 Example of cuneiform recording the distribution of beer in southern Iraq,
3100–3000 bce.
Source: BabelStone, Licensed under CC BY-SA 3.0.
These characters eventually gave rise to the widespread use of the

characteristic logograms of Chinese in Asia (see Figure 1.3).
The representation of different writing systems is important for
mapping language meanings between languages. Figure 1.4 shows a
modern representation of the Chinese character for eye and the asso-
ciated Latin script representation to show the translation between a
pictograph (logogram) and syllabary.
Figure 1.3 Shang oracle bone script for character “Eye.” Modern character is 目.
Source: Tomchen1989. Public Domain.
Mù
Figure 1.4 Modern Chinese representation of “eye” (mù).

Source: B. deVille.
Writing Systems of the World
Writing systems of the world that have evolved from ancient times
to the present day can be organized into five categoriesiii: alphabets,
abjads, abugidas, syllabaries, and logo-syllabaries.
1. Alphabets. Each letter represents a sound which can be either

a consonant or a vowel. English uses an alphabet as do such
related languages as French, German, and Spanish.
2. Abjads. Similar to alphabets except they are made up primarily
of consonants. Vowel markings are absent or partial and may
or may not be present. Hebrew and Arabic are the two main
abjads in use today.
3. Abugidas. These are writing systems where consonant-vowel
sequences are written as a unit. Consonants form the main
units in the system and may stand alone or carry vowel nota-
tions with them. Abugidas evolved from a pre-Common Era
Indian script called Brahmi and are prevalent in Southeast Asia.
4. Syllabaries. Here each character represents an entire syllable.
A syllable is normally one consonant and one vowel. Japanese
is an example.
5. Logo-syllabary. Each character can stand for a unique symbol
or an entire word or idea. Chinese is an example.
Meaning and Ambiguity
Much of the work that we do in text mining – both hidden in the

various text analytics engines we use as well as in the explicit user
interventions we employ – will be directed at getting the best, most
unambiguous meaning from the words or terms we use in the analysis.
Numbers have relatively unambiguous properties and this facilitates
their use in analytics. When we use test, however, it is normal to have
a certain level of ambiguity in meaning. One of the main reasons for
this is that textual terms are polysemous – one term may have multiple
meanings. As an example of polysemy, think about the question, “Did
you get it?” The question could be asking about understanding (“Yes,
I understood!”), fetching an object (“I picked up the ladder this morn-
ing”), or receiving goods or services (“I got the vaccine last Tuesday”).
Spoken and written forms of communication are prone to other
breakdowns in communication. Figure 1.5 provides a rough illus-
tration of the key features that are part of communication. One the
earliest approaches to capturing and quantifying the information loss
or gain contained in communication was the concept of entropy, for-
mulated by Claude Shannon.iv Shannon borrowed the concept from
thermodynamics and used it to rigorously engineer the communica-
tions properties of a range of communications methods and devices
while working for Bell Labs. His contributions have placed him in
the ranks of major figures in the establishment of the “computer age”
along with such figures as Von Neuman, Alan Turing, Robert Noyce,
Norbert Weiner, and Geoffrey Moore. As shown in the example in
Figure 1.5, this approach is still used today.
The send–receive communications model reflects the notion of
communication as capturing some kind of representation of an object
that has been identified and passing the representation through var-
ious processing stages until the object representation has been received
and decoded. In each of the processing stages, there is an opportunity
for representation error to creep in so the message can degrade.
We can all informally observe the operation of entropy as we play
the parlor game of passing a message from ear-to-ear in a circle of
people. Words perfectly communicate when all the elements of the
SEND RECEIVE
The train arrives The train arrives

at 16:00 hours at 16:00 hours
Score: Tokens Sent / Tokens Received

6 sent / 6 received 6/6 1 100%
SEND RECEIVE
The train arrives The train arrives

at 16:00 hours at xx:mm PM
Score: Tokens Sent / Tokens Received

6 sent / 4 received 4/6 2/3 66%
Figure 1.5 Encode–decode send–receive communications model.

Source: B. deVille.
sender’s message are completely and accurately received and inter-

preted by the receiver.
Figure 1.5 provides an illustration of how entropy is calculated.
When the full sentence (six tokens) is fully sent and received cor-
rectly there is 100 percent communication (zero entropy). When parts
of the sentence are miscommunicated, for example, only four tokens
are received; then communication drops to 66 percent. In this simple
example, the first communication is better than the second commu-
nication (and there is an associated information gain of one-third, or
over 33 percent).
Since text is a form of communication, we can gauge the accu-
racy and interpretation of the meaning of text using notions and
measures of information entropy and information gain. Upon closer
examination, we can also see that entropy and the statistical notion
of correlation or association are related. The lower the entropy, the
higher the correlation. As we move more deeply into text analytics,

the precision of textual meaning – sometimes reflected by low entropy
and high association measures – becomes important, especially when
we use text analytics to analyze large volumes of data. As with all data
analysis tasks, the greater the accuracy of the analysis, the more useful
the insights.
As a simple example of how we might calculate entropy, let’s say
we have documents about trains and boats. For purpose of illustration,
we will use a radically simplified example in Table 1.1.
The “train” documents have the following words or tokens:
train wheels diesel track land
The “boat” documents have the following words or tokens:
boat rudder sail sea water
At this point, we can reframe the collection of text documents

into a classification scheme that provides us with the ability to explore
the communicative properties of words in a structured, reproduc-
ible fashion.
Table 1.1 Trains and Boats Example: Document Collection
Component words/tokens
Document Class wheels rudder diesel sail track sea land water
1 TRAIN x x x x x x
2 BOAT x x
3 TRAIN x x x x x
4 BOAT x x x x
5 TRAIN x
6 BOAT x x x x
7 TRAIN x x
8 BOAT x x x
9 TRAIN x
10 BOAT x x x
11 TRAIN x x x x x
12 BOAT x x x x x
We can see that some terms/tokens appear in both kinds of doc-

uments: trains and boats. We can use entropy calculations to tell us
which terms have the least entropy and are therefore most useful in
classifying a document.
Shannon’s formula for entropy (information theory) is . . .
H X p log 2 p q log 2 q
where H(X) is the expected value of the entropy calculation. It mea-

sures the difference in the probability of two outcomes, p and q.
In our example, p and q indicate whether the vehicle class is
“train” or “boat.” Logarithms have many useful properties, and the
base 2 is used to support binary outcomes. This formula tells us that
the expected entropy calculation of the features wheels through
water in the above table will be formed by taking the logarithms
of the probability of the features associated with one class – train –
minus the probability of the features associated with the alternative
class – boat.
If we calculate the entropy of the various terms in the example
document (Table 1.2), we will see that the most useful term to unam-
biguously classify a document has the lowest entropy. This term is
“sail.” A high-entropy term like “diesel” is highly ambiguous, since it
is applied to trains and boats with equal frequency.
Table 1.2 Entropy Calculation for Trains and Boats Example
Feature Proportion Proportion pr(train) pr(boat) Entropy

(train) (boat)
wheels 6/8 2/8 0.75 0.25 0.811
rudder 1/4 3/4 0.25 0.75 0.811
diesel 2/4 2/4 0.50 0.50 1
sail 0/3 3/3 0 1 0
track 3/5 2/5 0.60 0.40 0.971
sea 1/6 5/6 0.17 0.83 0.65
land 4/6 2/6 0.67 0.33 0.918
water 2/5 3/5 0.40 0.60 0.971
NOTES
i. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding Google AI Language (Ithaca, NY: Cornell
University: 2019). https://arxiv.org/abs/1810.04805v2.
ii. H. T.O. Dong, A History of the Chinese Language (London and New York: Rout-
ledge, 2014).
iii. F. Coulmas, The Writing Systems of the World (Hoboken, NJ: Wiley-Blackwell, 1919).
iv. J. J. Soni and R. Goodman, A Mind at Play: How Claude Shannon Invented the Informa-
tion Age (New York: Simon & Schuster, 2017).
C H A P T E R 2
Text Analytics
Process Overview
15
TEXT ANALYTICS PROCESSING
In this chapter, we identify a number of best practices in the areas

of machine learning, data and text mining, and analytics processing.
A few processing templates have evolved for data mining and machine
learning.i The cloud-enabled approach adopted by SAS is summarized
in SAS Institute Inc.ii This is a fast-moving area where new practices
evolve constantly.
PROCESS BUILDING BLOCKS
A high-level view of processing for text analytics resembles many

solution approaches in information technology. This section looks at
the primary building blocks often used in text analytics:
JJ Preparation. Getting the text ready for analysis (data capture,

text decomposition, mapping to a data representation)
JJ Utilization. Interpretation and deployment.
Figure 2.1 describes the life cycle of text analytics from capture to
deployment in six major processes. We can map document capture,
test-to-data transfer, and characterization in the preparation phase.
Preparation/
Engineering
Capture
Text Data Characterization
Documents
Prediction
Utilization/
Discovery
Classification
Composite
Latent Structure Document
Scoring Model
Figure 2.1 Main stages of the text-mining process.

Source: B. deVille.
16
T e x t A n a ly t i c s P r o c e s s O v e r v i e w ◂ 17
We can map latent structure development, composite document

assembly, and prediction/classification in the utilization phase.
Preparation
JJ Capture documents. First, assemble the documents. Usually,

text documents require some kind of preprocessing to bring
them into the analysis environment. For example, articles may
be scraped from the web (or from blogs), document repositories
may be exported, a range of formats such as .pdf, .txt, and .doc
may need to be imported. The inclusion of written text symbols
in standard alphabetic and often pictographic form presents an
additional layer of complexity over and above the collection of
metric or numerical data. Audio input, in the form of speech,
can also be preprocessed to create phonetic data that can be
converted into a number of text products.
JJ Text-to-data. Once the document is placed in text format,
it may be parsed using a variety of techniques. Traditional
methods include natural language processing (NLP). The most
common form of NLP tags the terms in text fragments by a part
of speech identifier such as noun, verb, adjective, adverb, and
so on. NLP also tags the document with entities – so names,
brands, addresses, quantities, and even specialized descrip-
tors such as machine parts are identified. Other common steps
include stemming, assigning synonyms, correcting spelling, and
such macro-document processes as identifying topics, s ubtopics,
and other parts of documents.
JJ Characterization. This step includes the procedural
construction of text products based primarily on engineered
NLP processes. It also includes, in a feedback fashion, the text
products of dimensional reduction carried on in downstream
processes. Mood or sentiment scores are often calculated in this
step. When mood or sentiment scores are calculated for the
document collection, then a high-level summary of mood or
sentiment can also be produced. Dimensional products, such
as clusters and topics, can also be inserted into the documents
collection; this can facilitate global collection summarization by
18 ▸ T EX T A S D A T A
incorporating topics or clusters in corpus summaries. It is also

possible to use various network representations to surface the
linkages between terms and clusters. Cluster and topic prod-
ucts can also form the basis for the production of an abstract
in standardized form (e.g., with all the entities identified, mis-
spellings corrected, and the synonyms applied). Entities can be
identified through built in named-entity recognition facilities
or through rules-based approaches such as those provided by
the SAS language interpretation for textual information (LITI).
Another rich set of operations in this area includes the identification

of sequences of terms, words, or tokens taken in combination with
one another. One promising area of research is reported by Cox and
Allbright, which attempts to construct unambiguous combinations of
terms that have greater meaning and predictive power than simple
words or word- combinations taken in literal sequence (without
contextual processing).iii
Utilization
JJ Latent structure/dimension reduction. This step incor-

porates statistical algorithms to form high- level, compressed
representations of the text as factor scores or clusters. This is
typically a statistical approach that takes the parsed represen-
tation of the document in order to represent the meaning in
a summarized form. This process is called dimension reduction
because the numerical approaches take many objects and attrib-
utes and represent them into a smaller number of objects with
a smaller number of shared attributes. A common example is
a representation of the full continuous spectrum of light into
red, yellow, green, blue categories. Typical dimensional reduc-
tions employed here are clusters, factorizations (which result
in factor scores and latent attributes), and roll-up terms. Other
possibilities include topics and term groupings, which combine
multiple terms together as “n-grams.”
JJ Composite document. The composite document contains

the original text, the preprocessed text, and the results of the
dimensional reduction. We can also provide for the introduc-
tion of other data sources and contextual information. Typical
dimension reduction products include singular value decom-
position (SVD) scores (which may be treated as latent semantic
constructs or can be transformed into topics), rollup terms, and
document-clustering products.
JJ Prediction and classification. This is used to employ text-
mined results in the generation of a predictive score or
classification. Customer intent may be estimated (likelihood to
buy or defect), or document type may be assigned (warranty
classification, defect type).
JJ Scoring model. Many scoring models are available and may be
in functional form (e.g., equations) or rules form (if . . . then . . .
else). The scoring models can be deployed in a variety of environ-
ments and are capable of processing and displaying the contents
of new, previously unprocessed text documents.
PROCESS DESCRIPTION
Text Mining Data Sources
There are many kinds of text data repositories and many kinds of
original textual data sources. For applied business and industrial appli-
cations, the data source may often be a host website or perhaps a
social media data selection. The text mining and text analytics site at
UC Berkeley provides an example of the many different data sources
available: https://guides.lib.berkeley.edu/text-mining. Links to many
different data sources are provided on this web location, including
books (including over 50,000 volumes available on Project Guten-
berg), newspapers and magazines, scholarly journals, government
documents, linguistic corpora, literature, social media archives, and
historical collections. Github also contains hundreds of text databases:
https://github.com/awesomedata/awesome-public-datasets.
Capture
Regardless of the data source, a number of data on-boarding tasks are

required to turn the raw textual data into useful sources of analyt-
ical insight:
JJ Ensure code compatibility. Increasingly, in an era of global
formats it is important to ensure compatibility of all types of
textual data – including logograms and even emoticons. Usu-
ally this means turning on the input encoding to a more robust
format than standard American ASCII and to use a format such
as UCS-8, for example.
JJ Determine access method/structure. There are many util-
ities, such as SAS’s tmfilter, that provide transparent access to
various data sources and websites. Some locations provide their
own API (application programmer’s interface), such as https://
dev.elsevier.com/tecdoc_text_mining.html.
Another common method uses the computer’s folder or directory
structure to store text and examples. This is useful for training
classification tasks where the structure is used as the target class and
the text documents in the location are used as training instances.
As shown in Figure 2.2, the top-level folder/directory name is Cat-
egory Folder Structure. It contains three subfolders in this example:
Business; Sports, and Music. In typical applications, documents most
related to business would be in the Business folder; those relating
Category Business
Folder
Structure Sports
Music
Figure 2.2 Category-oriented folder structure.

Source: B. deVille.
to sports would be in the Sports folder, and so on. Later, when the
text-learning system needs to find linguistic rules that characterize
and distinguish Business documents from Music documents, these
folders and associated documents will be used as training data to
learn the rules.
LINGUISTIC PROCESSING
Once we have stored and identified the text that we want to work, we
are ready to rework the qualitative, textual data into quantitative data
products that supports more robust computation. The general term for
this stage of the text analytic process is linguistic processing.
Linguistic processing is the text analytic ability to perform detailed
linguistic operations on a term-by-term basis as the linguistic pro-
cessor moves through the document in line-by-line and term-by-term
sequence. Although this is the first step in creating the term by docu-
ment matrix that is the basis for the higher-dimension, numeric linear
algebra approaches that are the hallmark of advanced text analytics,
linguistic processing is a key enabler and also a significant approach
in its own right.
Once the text has been assembled, it is viewed by the parse engine
as a sequence of characters that are encoded in some text or image
representation.
A process overview of text treatments, transformations, deriva-
tions, and extractions are listed in Figure 2.3 and briefly described
as follows:
JJ Tokenization. Here we use punctuation and character encod-
ing to identify document sections, words, terms, and images

(where appropriate). Figure 2.4 provides an example of text
that is used to illustrate tokenization in the context of docu-
ment parsing.
JJ Consolidation. Spell checking is usually performed as part of
this step. Here we stem words (look for the root word) and
expand or lemmatize words (map terms to a common root).
The main goal of stemming and lemmatization is to create a
common representation of various forms of the same word.
Apply
Tokenization
Taxonomies
Consolidation Filter
Raw Text Quantified

Data Identify POS Abstract, Infer Text Data
Weight and
Disambiguate
Normalize
Transform Identify Factors,

and Map Themes, Topics
Figure 2.3 Text treatments, transformations, derivations, and extractions.

Source: B. deVille.
JJ Identify parts of speech and named entities. Isolate and

identify parts of speech (POS) and associate with words and
terms. Isolate special words and named entities (NE) like dates,
proper names, locations, and noun phrases.
JJ Disambiguate. Word sense disambiguation is used to infer
the semantic meaning of a word. The text segment “bank,” for
example, can have many meanings; e.g., it could be a financial
institution, the physical building that houses a financial institu-
tion, a ridge on a pool table, or the edge of a river. Text terms
are disambiguated based on the context as found in the source
document, sometimes called a semantic field. For example, the
phrase “you can bank on that” contains a verb that probably
relates to financial institutions, while “he banked it off the left
bunker” is more likely related to the game of pool.
JJ Transform. The most common transformation is stemming and

lemmatization. Other transformations include correcting spelling
errors, resolving upper and lower case variations as well as
applying synonyms and acronyms.
JJ Apply mappings. Extract common term expansions for stan-
dard tables such as metropolitan area, geography (country,
state, and so on), and business codes such as Standard Industrial
Classification (SIC).
JJ Apply taxonomy, typology, or ontology. A wide range of
taxonomies, typologies, ontologies, and classification systems
may also be used in a more general application of the mapping
and transformation process.
JJ Filter. Most frequently, this includes the use of start lists
and stop lists to indicate which terms will “only” be used – i.e.,
a start list – and which words will be dropped or hidden – i.e., a
stop list. Particular terms may also be hidden based on various
criteria such as word frequency.
JJ Abstract, infer. Abstractions take word or token sequences and
build them into special-purpose indicators. Common abstrac-
tions are n-grams, which are concatenated words or terms and
are typically adjacent to one another in the text stream. Infer-
ences are used to identify special text products – like mood,
sentiment, or author inferences (e.g., gender).
JJ N-grams. In cases where the individual documents are lengthy
and consist of lots of content, such as in the case of books, it
makes sense to analyze at a paragraph or sentence level, since
the latent idea most likely remains stable across sentences and
paragraphs. However, in our example, we have short, terse
feedback for products consumed by users; thus, a word or
n-gram level tokenization is suitable to capture this finer level
of detail.
N-grams are contiguous sequences of n objects from a piece of

text. These objects can be anything – from words to phonemes, or any
aggregated text element depending on the task requirement. Table 2.1
illustrates n-grams with words as objects.
Table 2.1 N-gram Illustration
Sentence: “A swimmer likes swimming, thus he swims.”

Unigram (1-gram) A, Swimmer, likes, swimming, thus, he, swims, . . .
Bigram (2-gram) A swimmer, swimmer likes, likes swimming, swimming thus, . . .
Trigram (3-gram) A swimmer likes, swimmer likes swimming, likes swimming thus, . . .
Table 2.2 Advantage of Using n-grams vs. Unigrams
The packaging was not good. Unigram The, packaging, was, not, good.
Bigram The packaging, Packaging was, was not,
not good.
The simple and easiest method of n- gram tokenization is

to use unigrams. Here, individual words are considered as fea-
tures that might explain some consumer behavior. Although this is
an easy approach, one downside is that it is easy to lose the capa-
bility of differentiating “not good” and “not” and “good” separately.
Table 2.2 shows a comparison of Unigrams and Bigrams and demon-
strates the extra information that is encoded in the Bigram represen-
tation. Since n-grams consist of contiguous co-occurring words, they
maintain a level of context in the original document text, such as the
capability to distinguish between “white house” in a context of home
paints or politics. N-grams are also useful for maintaining context, for
example, maintaining “not” with “good.”
JJ Weight and normalize. Upweight and downweight terms for
emphasis; for example, to highlight rare terms or deemphasize
frequent terms. Terms may also be standardized or normalized
so, for example, their frequency in a document is converted to a
normal score (z-score), high frequency or low frequency terms
may be weight-adjusted; this impacts the computation of var-
ious text products such as clusters and topics.
JJ Term frequency inverse document frequency (TFIDF).
One of the most useful weighting approaches is TFIDF because
the frequency of a term is adjusted by dividing by the overall
frequency of the term in the entire collection. This affects
specificity so rare terms are upweighted while common terms

that would otherwise swamp the analysis are downweighed.
The TFIDF weight comprises of two parts:
1. Normalized term frequency (TF). Enumeration of how
many times the token occurs within a document, divided by
the token count of the document.
2. Inverse document frequency (IDF). The logarithmic
transformation of the total count of documents in the cor-
pus divided by the count of documents in which the specific
token in question occurs.
To summarize:
TF t # times token t occurs in a document /

# tokens in the document .
TFij log tfij 1 ; tfij nij / N j
where nij represents the count of times ith word is present in the
jth document.
IDF(t) = log e (# documents / # documents containing token t).

IDF i = log |D| / ni
Here |D| is the total cardinality of documents in the corpus and

ni represents the count of documents in which the ith word
is present.
Thus, the measure TFIDF(t) = TF(t) * IDF(t).
JJ Identify factors, themes, topics. Many data summariza-

tion techniques are designed to compress multiple data dimen-
sions into a lower number of dimensions to make the data
more tractable and understandable. Many of these techniques
are appropriate for text once the textual data has been parsed,
transformed, and assigned to a data structure with associated
meta data. These techniques are discussed later, primarily in
Chapter 5.
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Il tramonto di
una civiltà, vol. 2 (di 2)
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Title: Il tramonto di una civiltà, vol. 2 (di 2)

O la fine della Grecia antica
Author: Corrado Barbagallo
Release date: May 4, 2024 [eBook #73535]
Language: Italian
Original publication: Firenze: Le Monnier, 1923
Credits: Barbara Magni and the Online Distributed Proofreading

Team at http://www.pgdp.net (This file was produced from
images made available by the HathiTrust Digital Library)
*** START OF THE PROJECT GUTENBERG EBOOK IL

TRAMONTO DI UNA CIVILTÀ, VOL. 2 (DI 2) ***
IL TRAMONTO DI UNA CIVILTÀ
VOLUME SECONDO
CORRADO BARBAGALLO
IL TRAMONTO DI UNA
CIVILTÀ
O
LA FINE DELLA GRECIA ANTICA

«..... Indagheremo e conosceremo
insieme le ragioni per cui Sparta ed
Atene, dal colmo della gloria, cui, fra
i Greci, erano dal nulla pervenute,
rischiarono poscia di precipitare nella
servitù; le ragioni per cui i Tessali,
straordinariamente cresciuti in
ricchezza ed in potenza, sono ora
ridotti allo stremo della
disperazione». «Occorre all’uopo
risalire alle cause prime, non già
richiamare gli eventi, che da quelle
sono proceduti: alle cause prime dei
mali che ci hanno condotti allo
sbaraglio attuale».
(Isocr., La Pace, 116-17; 101).
VOLUME SECONDO
FIRENZE
FELICE LE MONNIER
EDITORE
PROPRIETÀ LETTERARIA RISERVATA

49-1924 — Firenze, Stab. Tip. E. Ariani, Via S. Gallo,
38
INDICE
CAPITOLO PRIMO.
LA GUERRA
Le guerre nella Grecia antica.
La Grecia antica fu, per tutta la sua non lunghissima, ma neanche

brevissima, esistenza storica, profondamente afflitta dal male,
endemico e inguaribile, della guerra. Sotto questo riguardo, le sue
gloriose repubbliche non hanno che un solo termine di paragone (e il
ripetersi del fenomeno non fu casuale): i Comuni italiani del Medio
Evo. La sua vita fu tutta una serie ininterrotta di lunghe ostilità e di
brevi armistizi, un affilare, un brandire, un incrociare, un risonare
incessante di armi.
L’età, che potremmo dire preistorica, della Grecia antica si dischiude
al nostro pensiero con la evocazione di due grandi serie di guerre: la
guerra troiana e le altre infinite, che vanno sotto il nome di
«migrazione dorica». Poi, in età cronologicamente più sicura,
troviamo, nei secc. VII-VI, le incessanti guerre contro Messeni,
Argivi, Arcadi, ecc., attraverso le quali Sparta conquista l’alta
sovranità sul Peloponneso. Poi, dal 500 al 494, si ha la insurrezione
e la guerra delle colonie greche di Asia, aiutate da Atene e da
Eretria, contro la Persia; dal 492 al 479, le prime paurose invasioni
persiane; dal 478 al 449, la controffensiva greca ai danni della
Persia, mentre contemporaneamente, nella Sicilia, e nell’Italia greca
— la Magna Grecia — si svolgono lotte, lunghe e cruente, fra colonie
e colonie elleniche — Crotoniati contro Sibariti, Siracusani contro
Agrigentini, Siracusani contro Crotoniati —, nonchè fra Greci e
Cartaginesi, Greci ed Etruschi, Greci ed Italici.... Nel 466 o nel 471,
nella Grecia vera e propria, si ha la insurrezione, fieramente domata,
di Nasso contro Atene; dal 466 al 464, la guerra di Atene contro
Taso; dal 459 al 451, mentre la guerra della Lega ateniese contro la
Persia continua, si susseguono una duplice serie di ostilità ateniesi-
corinzio-spartane; dal 449 al 446, una guerra beotico-ateniese-
spartana; nel 440-39, le ribellioni di Samo e di Bisanzio contro Atene;
nel 437-34, una spedizione ateniese contro le città greche della
Tracia e del Ponto; dal 435 al 433 la guerra corinzio-corcirese-
ateniese; dopo di che, l’anno successivo scoppia l’insurrezione di
Potidea e di una parte della Calcidica contro Atene, che inaugura i
quasi ininterrotti ventisette anni della tremenda Guerra
peloponnesiaca (431-04), che avvolse nelle sue fiamme l’intero
mondo ellenico. Tra il 404 e il 403, segue la prima riscossa di Atene
e la campagna di Trasibulo contro «I Trenta»; fra il 400 e il 387, una
nuova guerra spartano-persiana, intramezzata da ostilità di Sparta
contro l’Elide e contro Tebe, mentre in Occidente si svolgono le
conquiste del primo Dionigi su territorio siciliano ed italico, nonchè
una lunga guerra di Siracusa contro i Cartaginesi. Nel 394 si apre, e
continua fino al 387, la grande, così detta, Guerra corinzio-beotica,
che travolse anch’essa nel suo turbine Tebe, Atene, Corinto, Argo,
Sparta, l’Eubea, la Grecia centrale, la Calcidica, mentre, in
Occidente, Dionigi il grande ripigliava la guerra contro Cartagine
(383 ?) e i suoi tentativi di espansione in Italia, che adesso lo fanno
entrare in lotta persino con gli Etruschi. Dal 386 al 380, s’incalzano e
intrecciano guerre spartano-mantineesi, guerre olintiaco-calcidesi,
guerre spartano-olintiache. Dal 377 al 362, si distende l’êra epica dei
grandiosi conflitti tebano-spartano-ateniesi-tessalo-epirotici, e, in
Occidente, si scatena una nuova offensiva di Siracusa contro
l’eterna sua nemica: Cartagine. Dal 362 al 357, mentre le armi non
posano nel Peloponneso, Sparta e Atene tornano a guerreggiare
contro la Persia, e poi, fallita l’impresa, Atene ritenta la violenta
annessione al suo impero dell’Eubea, della Calcidica, del
Chersoneso tracico, testè perduti, entrando in conflitto con la
Macedonia, di cui ora è divenuto re Filippo II. Dal 357 al 355, si
svolge la così detta Guerra degli Alleati contro Atene; dal 355 al 346,
la Prima sanguinosissima Guerra sacra; nel 353, le prime guerre di
Filippo II per la conquista della Tessaglia; dal 346 al 340, una serie,
quasi ininterrotta, di ostilità fra Atene e la Macedonia; dal 339 al 338,
la Seconda Guerra sacra, che suggella la fine dell’indipendenza
greca sotto l’egemonia macedone. Contemporaneamente, nei
trent’anni che scorrono dal 367 al 337, tutta la Sicilia greca arde di
un incendio di guerre civili tra città e città, solo intramezzato da
tentativi, or fortunati, or infelici, di Cartaginesi contro Greci, di Greci
contro Cartaginesi. Fra il 336 e il 335, si susseguono due nuove
invasioni macedoni in Grecia, che epilogano nella catastrofe di Tebe.
Al 334 si apre la grandiosa gesta macedone-greca per la decisiva
conquista della Persia, che durerà fino al 326. Intanto dal 333 al 330,
durante l’assenza di Alessandro Magno, impegnato in Oriente,
Sparta guerreggia contro la Macedonia, e, tra il 323 e il 322, spento
Alessandro, si combatte, fra Greci e Macedoni, la disastrosa Guerra
lamiaca. Poi seguono sino al 239 le infinite contese fra i successori
di Alessandro, disputate e risolute, in gran parte, su suolo greco, e,
dal 239 al 146, il viluppo più intricato di guerre macedono-acheo-
etolo-spartano-romane. Il 146 registra il lugubre epilogo della
distruzione di Corinto e della fine dell’indipendenza della Grecia,
sotto il calcagno romano.
In sette secoli di storia, dunque, circa settecento ininterrotti anni di
guerre, di cui ognuna non sconvolse soltanto una breve regione, ma
attirò nel suo vortice quasi tutti gli Stati ellenici della penisola e gli
altri d’Italia e d’Asia, i quali, del resto, figurarono consuetamente, o
nelle più o meno regolari simmachie ateniese, spartana, italica, o
nelle temporanee alleanze, che le nazioni greche stipulavano e
dissolvevano con mirabolante disinvoltura.
Nè il sesquisecolare impero repubblicano di Roma riesce foriero di
migliore fortuna. Dalla fine politica della Grecia al tramonto della
Repubblica romana, le più grandi operazioni militari del tempo
avvengono su terreno greco: e le Guerre mitridatiche, continuate,
salvo brevi armistizi, dall’88 al 66, e la campagna contro i pirati (67),
e le guerre civili di Sulla contro Fimbria (85-84), di Pompeo contro
Cesare (49-48), di Antonio e di Ottaviano contro gli uccisori di
Cesare, nonchè contro Sesto Pompeo (42-35), e l’ultima di
Ottaviano contro Antonio (32-30). Solo allora, finalmente, la pace, di
cui, alla guisa del Secondo Impero Napoleonico, Roma ebbe a
vantarsi dispensatrice, spiegò i suoi ozi ristoratori sull’Ellade
malaugurata.
Questo fenomeno — tutto greco — della guerra perenne non fu nè
arbitrario, nè casuale. Le sue profonde ragioni giacevano nella
Grecia stessa, ossia nella natura, essenzialmente municipale, della
sua organizzazione politica. È verità notissima questa, che la Grecia
non conobbe altra forma di Stato, che il municipio, i cui confini di
regola non valicavano il territorio di una città. Ma si è raramente
badato alle conseguenze enormi — benefiche e malefiche — che
una tale situazione portava seco. La nobiltà e la grandezza dello
spirito greco, come dello spirito dei Comuni medioevali italiani del
Medio Evo, nacque appunto dal tanto deprecato fenomeno del
municipalismo, che esaltava tutte le potenze morali dei cittadini,
rinchiusi entro breve confine, per cui la loro città era tutta la patria,
era tutto il mondo. Nacquero da questo stato di fatto il patriottismo
ardente, la svariata, meravigliosa molteplicità di sviluppi culturali,
artistici, spirituali, che caratterizzano la storia greca. Ma nacque
anche il male endemico della guerra in permanenza. Ogni grande
Stato possiede mezzi sufficienti, o quasi, alla sua prosperità; ha porti
di mare, terre fertili, pianure, montagne, varietà di colture, sbocchi
fluviali, centri naturalmente adatti all’industria e centri naturalmente
adatti all’agricoltura. Ogni sua contrada può aiutare le consorelle e
riceverne vicendevolmente aiuto. Una città isolata, uno Stato
municipale, no. Essi sono di regola mancanti di qualcuno, o di più
d’uno di tali beni. Quella o quello che possiede il legname non ha il
porto in cui scaricarlo; chi fabbrica merci non ha libero il passo ai
centri di importazione delle materie prime; chi ha il monte non
domina il piano; il municipio, cui sorride l’abbondanza dei suoi
vigneti, non dispone di popolazione sufficiente al consumo del suo
vino. Lo Stato municipale è, in conseguenza, per sua natura, mutilo
e paralitico. Onde il bisogno continuo, ch’è sua ragione e mezzo
d’esistenza, di aggregarsi, assoggettarsi, strappare altrui i beni di cui
abbisogna. In questo profondo terreno sta la radice del guerreggiare
continuo, rabbioso, delle repubbliche greche, così come dei Comuni
medioevali italiani. Poi il successo fortunato o l’irritazione dello
scacco mal tollerato, la gloria, l’ambizione, il cocente dolore, i danni,
subìti o temuti, complicavano il problema, lo inciprignivano, lo
avvelenavano. La guerra perciò, nella Grecia antica, fu, al pari
dell’imperialismo cittadino, elemento, vitale e fatale, della sua
esistenza secolare. Senza di essa la storia non conoscerebbe che
una Grecia oscura e vegetante nella mediocrità e nel silenzio. Senza
di essa lo splendore e la gloria di Atene e di Sparta non sarebbero
mai stati. Il che non impedì che gli effetti della guerra continua si
ritorcessero tremendi contro coloro che li avevano scatenati, e che,
una volta generati, non assumessero un incalcolabile potere di
distruzione.
La Grecia, che non poteva vivere senza guerra, era condannata a
perire della sua guerra perpetua. Qui, dove gli Stati sovrani erano
infiniti, innumeri dovevano essere ogni giorno i conflitti interstatali.
Qui i viventi dovevano rodersi, l’un l’altro, da muro a muro, da fossa
a fossa. Anch’ella, questa antica nave senza nocchiero in gran
tempesta, era destinata ad infrangersi tra i marosi giganteschi, che il
suo violento procedere andava sollevando. E come i Comuni
medioevali finirono con invocare un Signore, che desse loro
finalmente la pace, così l’Ellade antica finì col preferire una signoria
— quella dell’Impero romano — alla sua selvaggia libertà, madida di
lacrime e di sangue. Pur troppo, il rimedio eroico giungeva, questa
volta, troppo tardi!
Lo sforzo demografico.
Per rilevare compiutamente di quali malefici effetti la guerra sia stata

cagione nel mondo greco, noi dovremmo a rigore andare
esaminando le singole ripercussioni del fenomeno in tutti gli Stati,
che composero il mondo ellenico. Purtroppo, questo ci è
assolutamente impedito dalla scarsezza e dalla oscurità enorme
delle notizie, che riguardano la loro vita interiore. Noi possiamo però
scegliere l’esempio tipico di qualcuno dei numerosi Stati greci —
quello ateniese, per esempio — intorno a cui siamo meglio informati,
e da quest’analisi indurre tutte le analogie, che vedremo man mano
spontaneamente emergere, e intorno ad esse collocare tutte le altre
minori, assai più rade notizie, che ci provengono da altri Stati. Tale il
procedimento, che siamo costretti a seguire. Ma da esso ci illudiamo
di ricavare suggestioni bastevoli a formarci un’idea esatta di quello
che, per la Grecia antica, furono i mali infiniti, arrecati dalla guerra.
Come è necessario avvenga d’ogni piccolo Stato, che aspira a

grandi scopi, Atene era costretta a guerreggiare con il massimo
sacrificio di uomini di cui essa disponeva. La popolazione libera
dell’Attica si aggirava, nel suo periodo migliore, intorno alle 250.000
anime. Eppure noi troviamo che alla battaglia di Maratona, nel 490 a.
C., Atene partecipava con 9-10.000 opliti, e probabilmente con
altrettanti armati alla leggera (gimniti) [1]; a Platea (479), con 8000
opliti e altrettanti gimniti [2]; mentre almeno 25.000 Ateniesi erano
imbarcati sulla flotta [3]. Noi troviamo che gli Ateniesi, alla battaglia di
Tanagra (457 a. C.), schierarono circa 14.000 opliti e altrettanti
gimniti, mentre altri contingenti erano stati spediti ad Egina e in
Egitto [4]; che, durante la Guerra del Peloponneso, Atene, nel 431,
mobilitò per la difesa dell’Attica, oltre 30.000 fra opliti e cavalieri [5] e
una cifra non certo minore di gimniti [6], e che nel 424 invase la
Beozia con circa 20.000 uomini [7]. Questo, nel V secolo, ossia
nell’età di maggior floridezza demografica dell’Attica. Nel quarto
secolo Atene partecipa alla prima invasione di Epaminonda nel
Peloponneso (370 o 369) con 12.000 uomini [8]; l’anno successivo,
gli Ateniesi guerreggiano contro la lega beotica in numero di circa
10.000 [9]; finalmente, in occasione della Seconda Guerra sacra
(339-38), la città eroica mobilitava tutti i suoi uomini fino ai 50 anni,
armando da 9 a 10.000 opliti [10].
Or bene, queste cifre, di cui nessuna può dirsi esaurisca tutto lo
sforzo della mobilitazione nell’Attica antica, e da cui di regola
rimangono esclusi gli equipaggi e i marinai delle grandi flotte
ateniesi, ci riportano da sole a una percentuale, ossia a una
mobilitazione del 10%, del 12%, talora, persino, del 24%, della
popolazione complessiva: proporzioni assolutamente inaudite, e che,
ripetute e prolungate per secoli, dovevano necessariamente esaurire
la vitalità di qualsiasi popolo [11].
La guerra, la pastorizia,
l’agricoltura.
Un siffatto sistema di guerra continua devastava in egual misura la

popolazione dello Stato e l’intera economia della nazione. Devastò e
spense in sul nascere l’agricoltura e la pastorizia dell’Attica. Il giorno,
in cui l’Impero ateniese fu costituito, e fu palese come occorressero
molti uomini, ossia un abbondante macchinario umano per
difenderlo, il governo della Repubblica dovette cercar di persuadere
la popolazione dell’Attica a lasciare i campi e a venire in città, ove
tutti — si diceva — troverebbero da vivere nella milizia e
nell’esercizio dei pubblici uffici [12]. Scambiando forse una
responsabilità di cose con una responsabilità di persone, taluno degli
antichi attribuì tale consiglio senza di meno ad Aristide, il capo del
partito agrario, divenuto, per singolare ironia della sorte, il primo
fondatore dell’Impero ateniese [13]. Ma, fosse Aristide o fossero altri,
è certo che da questo momento comincia l’esodo dei contadini
dell’Attica dalla campagna nella città; ossia l’inurbamento di tanta
parte della popolazione, che fatalmente avrebbe portato seco
l’arresto e la decadenza della pastorizia e dell’agricoltura nella
contrada. Poi la guerra, che non tarderà ad accendersi infinite volte,
farà il resto.
Il valore del bestiame, che pascolava nell’Attica, era tutt’altro che
insignificante. L’Attica nudriva in gran copia pecore, capre, asini,
muli, e gli stessi buoi ed i cavalli, dapprima scarsi, vi figurarono più
tardi numerosi, in grazia specialmente dei pascoli dell’Eubea [14]. Or
bene, il sopravvenire della guerra recava l’annunzio della fine di
tanta ricchezza, così come, sur un campo florido di messi l’infuriare
del vento prima ancora dell’irrompere della gragnuola. Non era per
questo necessario che il nemico invadesse il Paese. «Quando il
nemico è vicino», scriverà ad altra occasione un antico, «il fatto che
l’invasione non è avvenuta non impedisce che il bestiame venga
lasciato alla ventura» [15]. Ma assai peggio, naturalmente, seguiva
allorchè l’invasione aveva veramente luogo. Gravissimi erano allora
gli effetti dell’antica — o dell’eterna? — maniera di condurre la
guerra. Il più delle volte questa si riduceva a incursioni, saccheggi,
depredazioni brigantesche [16]. «Farsi leva degli interessi dei
proprietari, devastarne sistematicamente le terre, distruggerne le
messi, menar bottino degli schiavi e del bestiame, ecco un
espediente press’a poco infallibile per istrappare delle condizioni
vantaggiose» [17]. Nè v’era mezzo alcuno ad impedire
quest’affondamento dell’artiglio nemico nelle carni vive del Paese.
La ristrettezza del territorio di ciascuno staterello greco portava
l’invasore diritto al cuore dello Stato, lo conduceva rapidamente a
distruggerne in una volta sola tutta la prosperità agricola. Una
invasione fortunata era, dunque, un danno profondo, che talora non
riesciva possibile riparare. È facile perciò misurare come e quanto la
presenza degli Spartani nell’Attica, durante la guerra del
Peloponneso, abbia nociuto all’esistenza economica del Paese.
Le colture dell’olivo, della vite, degli svariati alberi da frutto, che
avevano formato la ricchezza della campagna ateniese, furono o
interamente rovinate o non mai più ricostituite. Gli armenti, per lunghi
anni allevati, curati, migliorati, diventarono preda e macello
dell’invasore; e chi in tale lavoro aveva speso la propria ricchezza, e
impegnato la propria attività, vide in un’ora sola distrutte le fatiche di
lunghi anni [18].
Ma tutto questo, oltre che agli agricoltori, riescì di danno inestimabile
alla turba dei consumatori, i quali, come sempre, costituivano la
grande massa della popolazione. Sotto la concorrenza dei vini
forestieri, la felice supremazia dell’Attica cominciò, dopo la guerra
del Peloponneso, a declinare via via, terminando per cedere il passo
a quella di tutte le nazioni rivali. I prezzi salirono a proporzioni
vertiginose. Mentre le risorse della popolazione diminuivano, il costo
del vino passò da 10 a 35 lire l’ettolitro [19], con una media, fors’anco
un minimo, di L. 25 circa [20]. Le qualità prelibate vennero ora
importate a caro prezzo dall’estero, e il vino di Chio fu, sui mercati di
Atene, pagato a L. 300 circa l’ettolitro [21].
Gli effetti disastrosi di una simile guerra non erano soltanto
temporanei. Da un lato, per l’imminenza continua del pericolo, ai
villaggi, sparsi o disciolti in fattorie, la popolazione rurale venne
preferendo l’agglomeramento nelle cittadine fortificate, tra il caro dei
viveri e l’incomoda lontananza dai centri naturali di lavoro [22],
dall’altro finì col preferire le culture inferiori a quelle superiori. A che
pro, infatti, indugiarsi in colture lunghe, costose, difficili, sia pure
remunerative, le quali abbisognano assolutamente di una pace
sicura e tranquilla, dacchè lo stato permanente è la guerra, e basta
un attimo di odio a distruggere l’opera paziente di lustri? Meglio
dunque abbandonare le coltivazioni, che richiedono lavoro lungo e
intensivo, e lasciare che la terra arida produca da sè quel poco che
le talenta.
Poteva succedere di peggio, e successe di fatto. L’incertezza annua
del raccolto, che non si sapeva mai se sarebbe toccato ai cittadini o
agli invasori, finì talora col determinare — letteralmente —
l’abbandono dell’agricoltura. «Non si semina», scrive un economista
moderno [23], «che nella speranza di raccogliere. Non si dissoda, non
si pianta, non si costruisce che a patto di non avere quotidianamente
a paventare la perdita dei propri capitali. L’agricoltura più prospera
non tarderebbe a deperire se il suolo venisse a mancare sotto i piedi
di coloro che lo possiedono....; la decadenza sarebbe tanto più
rapida quanto più imminente e grave ne fosse il pericolo. Certo la
sicurezza del possesso non è sempre bastevole a imprimere ai
lavori agricoli un impulso singolare, ma è senza esempio che questi
abbiano prosperato facendone a meno....».
Fu allora che, perduta, ogni speranza, gli agricoltori ruinati si
rovesciarono a schiere — spontaneamente, senza più bisogno di
sollecitazioni — entro le mura cittadine a sollevare un’altra ondata di
concorrenza ai danni della popolazione operaia o, peggio ancora, a
imporre allo Stato ch’esso fornisse agli indigenti e ai disoccupati i
mezzi per vivere. Per questo, appunto, con la fine del V secolo a. C.,
comincia veramente la grande curée delle indennità pubbliche.
Ora non è più possibile restituire alla campagna tanta parte della
popolazione d’improvviso inurbata. Ora, secondo il calcolo di un
antico, più di 20.000 cittadini succhiano quotidianamente alle
mammelle dello Stato ateniese [24], e l’indennità pubblica, da
sanzione naturale della democrazia, diventa un «cancro roditore» [25]
della Repubblica.
Ma noi moderni non siamo più in grado di formarci un’idea adeguata
di tutto il disastro, che per gli antichi veniva dalla decadenza o dalla
rovina dell’agricoltura. Presso di noi questa può riescire uno degli
elementi secondari del vivere sociale. Per qualcuno degli Stati
moderni, la società, più che sullo sviluppo dell’agricoltura, poggia sui
progressi della sua industria e, specialmente, su la portata dei suoi
commerci. Nel mondo antico accadeva precisamente l’opposto [26].
Allorchè quindi vi si discorre di rovina delle culture della terra e degli
agricoltori, si può giurare di trovarsi dinanzi al crollo della maggiore e
della miglior parte dell’edifizio economico. «Si enuncia una solenne
verità», scriveva Senofonte, «quando si afferma che l’agricoltura è
madre e nudrice di tutte le arti. Allorchè essa prospera, prosperano
anche queste; allorchè il suolo deve rimanere incolto, può dirsi che
ogni altra attività, praticata sulla terra e sul mare, si spenga» [27].
Restavano i rigagnoli dell’industria e del commercio, ma anche su di
questi la guerra, permanente e devastatrice, non mancava di
esercitare le sue conseguenze funeste.
La guerra e il commercio.
Le conseguenze delle interrotte comunicazioni marittime dovevano,

naturalmente, essere più gravi pei Paesi che erano costretti a
importare granaglie, ossia, che mancavano dell’elemento
fondamentale della alimentazione quotidiana. L’Attica antica
produceva cereali in pochissima quantità, onde difficilmente riesciva
a fare a meno della importazione, così come più tardi non lo potrà
l’Italia antica o non lo può oggi, ad esempio, l’Inghilterra. Atene
ricavava dalla campagna appena 20.000 hl. di frumento [28], e ne
abbisognava di una quantità per lo meno venti volte maggiore [29].
Gran parte della draconiana legislazione cittadina era diretta
appunto ad assicurare tale rifornimento. Perciò lo Stato ateniese,
mentre da un canto proibiva rigorosamente l’esportazione del grano,
imponeva che due terzi almeno dei cereali esteri, approdati al Pireo,
venissero devoluti al consumo cittadino, che nessun residente
nell’Attica ne scaricasse altrove se non nel porto di Atene; ne
limitava, oltre che da Atene, dal Ponto e da Bisanzio, l’esportazione;
esentava — pare — da determinati oneri i commercianti di granaglie,
impediva rigorosamente l’incetta [30]. Ma tutto questo non bastava;
per procurarsi il grano, occorreva avere facile il passo ai varî centri
d’importazione: il Mar Nero, l’Egitto, la Fenicia, la Tracia, la
Macedonia, la Tessaglia, la Siria, ecc. [31].
In tali condizioni il libero uso del mare era, per Atene, non solo la
premessa necessaria di ogni ulteriore grandezza, ma una questione
di vita e di morte. È evidente perciò a quali colpi lo stato di guerra e
le incerte fortune della medesima abbiano dovuto replicatamente
esporre l’economia del Paese. Se gli sbalzi dei prezzi delle cose
venali furono, nell’Attica antica, assai più gravi e frequenti, che non
in qualsiasi altra contrada del mondo contemporaneo, questo
dovette seguire in modo particolare per il genere, di tutti più
necessario alla vita: i cereali [32].
Allora si potè assistere a questa singolare e tormentosa tragedia
economica: mentre la guerra in permanenza poneva ogni giorno in
serio pericolo la campagna dell’Attica; mentre ormai non era lecito
curare le produzioni più adatte al Paese e più remunerative — la
viticoltura e l’olivicoltura — occorse insistere, fino al limite estremo
possibile, nella coltivazione dei cereali, cui l’indole del terreno
repugnava; il che voleva dire nella semina di terre, che non potevano
lasciare alcun margine di profitto.... Per tal guisa sulla fatale
decadenza dell’agricoltura dell’Attica non operavano soltanto i
pericoli delle invasioni imminenti, ma anche il semplice terrore del
commercio limitato o impacciato.
Tutto ciò, si potrebbe pensare, poteva giovare a risollevare i prezzi
all’interno e a procurare l’agiatezza di buona parte della popolazione
agricola. Mera illusione! Chi era costretto a vendere i prodotti della
propria terra al primo offerente, o si era già in precedenza gravato di
debiti, non riesciva a sostenere la concorrenza della grande
proprietà. Il rialzo dei prezzi non giovava quindi che a una piccola
parte dei proprietari della terra, e nuoceva contemporaneamente alla
grande massa della popolazione produttrice e consumatrice. La
carestia e quello che oggi si dice il caroviveri si delineavano come un
fatto economico quotidiano, recando seco — infallibilmente — la
febbre affamatrice dell’incetta e della speculazione [33].
Ancora de la guerra e il
commercio.
Ma il commercio ateniese non riguardava soltanto

l’approvvigionamento vittuario del Paese. L’ignoto autore de La
repubblica ateniese ribatte ad ogni piè sospinto su quello, che
potremmo dire cosmopolitismo mercantile della vita economica
dell’Attica. «Ciò che di squisito è in Sicilia, in Italia, in Cipro, in Egitto,
in Libia, sulle rive del Mar Nero, nel Peloponneso o in qualsiasi altra
regione, tutto, in grazia dell’impero marittimo, che noi teniamo,
affluisce presso di noi» [34]. E Tucidide aveva scritto: «La potenza
della nostra città fa sì che godiamo agevolmente non solo dei nostri
prodotti, ma di quelli di ogni parte del mondo» [35]. Vi affluivano di
fatti le materie prime d’ogni genere e gli elaborati di ogni perfezione:
legname per navigli e per costruzioni, lane, tele, pece, cuoio, papiro,
pelli, cera, miele, metalli, pesci e carni salate, cacio, strutto, sego,
bestiame, frutta, avorio, incenso, unguenti, droghe, silfio, vini,
tessuti, tappeti, stoffe di seta, di lana e di porpora [36], oggetti di

Text As Data: Computational Methods of Understanding Written Expression Using SAS (Wiley and SAS Business Series) 1st Edition Deville

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text As Data: Computational Methods of Understanding Written Expression Using SAS (Wiley and SAS Business Series) 1st Edition Deville

Uploaded by

Copyright:

Available Formats

Text as Data: Computational Methods

of Understanding Written Expression

Statistical Data Analysis Using SAS Intermediate

Data Science and Machine Learning for Non-Programmers:

Encryption in SAS 9 4 Sixth Edition Sas Institute Inc

Visual Data Insights Using SAS ODS Graphics: A Guide to

Risk Modeling: Practical Applications of Artificial

Applied Regression and ANOVA Using SAS 1st Edition

Mining Author Cocitation Data with SAS Enterprise Guide

Structural Equation Modeling Using R/SAS: A Step-by-

The Analytic Hospitality Executive: Implementing Data Analytics in Hotels

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

No part of this publication may be reproduced, stored in a retrieval system, or

Limit of Liability/Disclaimer of Warranty: While the publisher and author have

Library of Congress Cataloging-­in-­Publication Data is Available:

Cover Design: Wiley

Appendix A Mood State Identification in Text 157

This book provides an end-­to-­end description of the text analytics

This work was initiated and promoted by Julie Palmieri, serving

Barry deVille is a practitioner, developer, and author in the fields

Gurpreet Singh Bawa has practiced internationally in the areas

Text analytics are a collection of computer methods that use semantic

In Chapter 6 we provide examples of how quantitative text prod-

BACKGROUND AND TERMINOLOGY

The analysis of written and spoken expression has been developing

TEXT ANALYTICS: WHAT IS IT?

Text processing and text analysis are components of the developing

a wide range of computer-­ mediated inference tasks that includes

One recent area of written language processing includes statistical

Brief History of Text

Language is a form of communication, and text is a written form of

Figure 1.1 Traffic sign in Cherokee syllabary, Tahlequah, Oklahoma.

capture parts of words like syllables in written expression – and

These characters eventually gave rise to the widespread use of the

Figure 1.4 Modern Chinese representation of “eye” (mù).

Writing Systems of the World

1. Alphabets. Each letter represents a sound which can be either

Meaning and Ambiguity

Much of the work that we do in text mining – both hidden in the

The train arrives The train arrives

Score: Tokens Sent / Tokens Received

The train arrives The train arrives

Score: Tokens Sent / Tokens Received

Figure 1.5 Encode–decode send–receive communications model.

sender’s message are completely and accurately received and inter-

higher the correlation. As we move more deeply into text analytics,

train wheels diesel track land

The “boat” documents have the following words or tokens:

boat rudder sail sea water

At this point, we can reframe the collection of text documents

Table 1.1 Trains and Boats Example: Document Collection

We can see that some terms/tokens appear in both kinds of doc-

where H(X) is the expected value of the entropy calculation. It mea-

Table 1.2 Entropy Calculation for Trains and Boats Example

Feature Proportion Proportion pr(train) pr(boat) Entropy

In this chapter, we identify a number of best practices in the areas

PROCESS BUILDING BLOCKS

A high-­level view of processing for text analytics resembles many

JJ Preparation. Getting the text ready for analysis (data capture,

Figure 2.1 Main stages of the text-­mining process.

We can map latent structure development, composite document

JJ Capture documents. First, assemble the documents. Usually,

incorporating topics or clusters in corpus summaries. It is also

Library of Congress Cataloging-in-Publication Data is Available:

This book provides an end-to-end description of the text analytics

a wide range of computer- mediated inference tasks that includes

A high-level view of processing for text analytics resembles many

Figure 2.1 Main stages of the text-mining process.

Regardless of the data source, a number of data on-boarding tasks are

Figure 2.2 Category-oriented folder structure.

N-grams are contiguous sequences of n objects from a piece of

Table 2.1 N-gram Illustration

Table 2.2 Advantage of Using n-grams vs. Unigrams

The simple and easiest method of n- gram tokenization is