Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Text and Web Mining

INTRODUCTION TEXT ANALYTICS

Text analytics, sometimes alternately referred to as text data mining


or text mining

the process of deriving high-quality information from text


TEXT MINING CONCEPTS
85-90 percent of all corporate data is in some kind
of unstructured form (e.g., text)
Unstructured corporate data is doubling in size
Tapping into these information sources is not an option,
but a need to stay competitive
Answer: text mining
 A semi-automated process of extracting knowledge from
unstructured data sources
 a.k.a. text data mining or knowledge discovery in textual
databases
TEXT MINING CONCEPTS
Benefits of text mining are obvious especially in text-
rich data environments
 e.g., law (court orders), academic research (research articles),
finance (quarterly reports), medicine (discharge summaries),
biology (molecular interactions), technology (patent files),
marketing (customer comments), etc.
Electronic communization records (e.g., Email)
 Spam filtering
 Email prioritization and categorization
 Automatic response generation
In the 1970s and early 1980s, text analytics started with Bag of Words extraction. For example, consider
the following sentence:
Cstmr not happy with his bank account - Customer wants to switch to Yes Bank.
Text analytics tools would extract the following words:
Cstmr
Customer
Yes
Bank
happy
not
switch
bank
account
Sentiment Analysis

Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text
analytics to identify and extract subjective information in source materials. Generally speaking, sentiment analysis aims to determine
the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be
his or her judgment or evaluation, affective state (that is to say, the emotional state of the author when writing), or the intended
emotional communication (that is to say, the emotional effect the author wishes to have on the reader) [Source: Wikipedia].

Sentiment analysis answers the question: is what being said "positive" or "negative"?

A sophisticated text analytics tool can identify the sentiments associated with the named entities, concepts as well as themes being
discussed in the text data. Examining our example once again, we note the following sentiments associated with named entities,
concepts and themes:

(Reuters) - Research In Motion Ltd said on Tuesday its subscriber base has risen to 80 million from the 78 million it reported earlier this
year, surprising many on Wall Street and sending its shares up more than 3 percent.
Most analysts had expected RIM, for the first time in its history, to begin losing subscribers in the recently completed quarter as it has
rapidly lost market share in North America to Apple's snazzier iPhone and Samsung's Galaxy devices.
Document Summarization

Document summarization is the creation of a shortened version of a text by a computer


program. The product of this procedure still contains the most important points of the original
text. Once again, let us take the two paragraphs from Reuters as an example:

(Reuters) - Research In Motion Ltd said on Tuesday its subscriber base has risen to 80 million
from the 78 million it reported earlier this year, surprising many on Wall Street and sending its
shares up more than 3 percent.
Most analysts had expected RIM, for the first time in its history, to begin losing subscribers in
the recently completed quarter as it has rapidly lost market share in North America to Apple's
snazzier iPhone and Samsung's Galaxy devices.

A summary of these paragraphs is as follows:

Research In Motion subscriber base has risen to 80 million sending its shares up more than 3
percent. Most analysts had expected RIM, for the first time in its history, to begin losing
subscribers.

As can be seen, the summary captures the gist of the conversation. While this may not be
impressive in the case of a two paragraph article, the ability to rapidly summarize large
volumes of text data is a very useful output from sophisticated text mining applications.
From this we are able to gather that the sentence relates to a bank
account customer but not much else.
We were able to gather that the same sentence now contained the following expressions:

Cstmr
Customer
Yes
Bank
not happy
switch
bank account

As you will appreciate, the expression "not happy" conveys a very different meaning
than the word "happy"!
Another breakthrough in text analytics with the ability to extract
Named Entities. This helped identify what was being discussed as can
be seen below:

customer --> CRM term


Yes Bank --> Bank (not the affirmative
SENTIMENT ANALYSIS

Sentiment Analysis helps us identify subjective information in textual


data. We are now able gather the following information:

Customer (cstmr) --> bank account --> unhappy (Negative)


Switch to (negative) --> Yes Bank (competition)
CATEGORIES OF TEXT MINING

text mining application can extract from unstructured text data?


These are as follows:
Named Entities Extraction
Document Summarization
Theme Extraction
Concept Extraction
Sentiment Analysis
NAME ENTITY
Named Entities Extraction helps answer the question
"who, what and where" is being discussed. Let us
take the following paragraphs from a recent Reuters
news article as an example:

(Reuters) - Research In Motion Ltd said on


Tuesday its subscriber base has risen to 80 million
from the 78 million it reported earlier this year,
surprising many on Wall Street and sending its
shares up more than 3 percent.

Most analysts had expected RIM, for the first time in


its history, to begin losing subscribers in the recently
completed quarter as it has rapidly lost market share
i n N o r t h A m e r i c a t o A p p l e ' s
snazzieriPhone and Samsung's Galaxy devices.
TEXT MINING APPLICATION AREA

Information extraction
Topic tracking
Summarization
Categorization
Clustering
Concept linking
Question answering
TEXT MINING APPLICATIONS
Marketing applications
 Enables better CRM

Security applications
 ECHELON, OASIS
 Deception detection (…)

Medicine and biology


 Literature-based gene identification (…)

Academic applications
 Research stream analysis
TEXT MINING APPLICATION
(RESEARCH TREND IDENTIFICATION IN LITERATURE)

Mining the published IS literature


 MIS Quarterly (MISQ)
 Journal of MIS (JMIS)
 Information Systems Research (ISR)

 Covers 12-year period (1994-2005)


 901 papers are included in the study
 Only the paper abstracts are used
 9 clusters are generated for further analysis
TEXT MINING APPLICATION
(RESEARCH TREND IDENTIFICATION IN LITERATURE)
Journal Year Author(s) Title Vol/No Pages Keywords Abstract
MISQ 2005 A. Malhotra, Absorptive capacity 29/1 145-187 knowledge management The need for continual value
S. Gosain and configurations in supply chain innovation is driving supply
O. A. El Sawy supply chains: absorptive capacity chains to evolve from a pure
Gearing for partner- interorganizational transactional focus to
enabled market information systems leveraging interorganizational
knowledge creation configuration approaches partner ships for sharing
ISR 1999 D. Robey and Accounting for the 2-Oct 167-185 organizational Although much contemporary
M. C. Boudreau contradictory transformation thought considers advanced
organizational impacts of technology information technologies as
consequences of organization theory either determinants or enablers
information research methodology of radical organizational
technology: intraorganizational power change, empirical studies have
Theoretical directions electronic communication revealed inconsistent findings to
and methodological mis implementation support the deterministic logic
implications culture implicit in such arguments. This
systems paper reviews the contradictory
JMIS 2001 R. Aron and Achieving the optimal 18/2 65-88 information products When producers of goods (or
E. K. Clemons balance between internet advertising services) are confronted by a
investment in quality product positioning situation in which their offerings
and investment in self- signaling no longer perfectly match
promotion for signaling games consumer preferences, they
information products must determine the extent to
which the advertised features of

… … … … … … … …
TEXT MINING TOOLS

Commercial Software Tools


 SPSS PASW Text Miner
 SAS Enterprise Miner
 Statistica Data Miner
 ClearForest, …

Free Software Tools


 RapidMiner
 GATE
 Spy-EM, …

You might also like