Professional Documents
Culture Documents
Sharda Bi3 PPT 05
Sharda Bi3 PPT 05
A Managerial Perspective on
Analytics (3rd Edition)
Chapter 5:
Text and Web Analytics
Learning Objectives
Describe text mining and understand the need
for text mining
Differentiate between text mining, Web mining,
and data mining
Understand the different application areas for
text mining
Know the process of carrying out a text mining
project
Understand the different methods to introduce
structure to text-based data
(Continued…)
Copyright © 2014 Pearson Education Limited Slide 5- 2
Learning Objectives
Describe Web mining, its objectives, and its
benefits
Understand the three different branches of Web
mining
Web content mining
Web structure mining
Web usage mining
Understand the applications of these three
mining paradigms
Answer Evidence
sources sources
Trained
models 3
4
5
2
1
Dream of AI community
to have algorithms that are capable of automatically
reading and obtaining knowledge from text
Copyright © 2014 Pearson Education Limited Slide 5- 16
Natural Language Processing (NLP)
WordNet
A laboriously hand-coded database of English words,
their definitions, sets of synonyms, and various semantic
relations between synonym sets.
A major resource for NLP.
Need automation to be completed.
Sentiment Analysis
A technique used to detect favorable and unfavorable
opinions toward specific products and services
SentiWordNet
Statements Labeled as
Cues Extracted &
Truthful or Deceptive
Selected
By Law Enforcement
Text Processing
Software Generated
Quantified Cues
D007962
D 016923
Ontology
D 001773
...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.
Word
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523
POS
NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN
Shallow
Parse
NP PP NP NP PP NP NP PP NP
Domain expertise
Tools and techniques
Feedback Feedback
The inputs to the process The output of the Task 1 is a The output of the Task 2 is a The output of Task 3 is a
includes a variety of relevant collection of documents in flat file called term-document number of problem specific
unstructured (and semi- some digitized format for matrix where the cells are classification, association,
structured) data sources such computer processing populated with the term clustering models and
as text, XML, HTML, etc. frequencies visualizations
e nt
er ing
Terms m ine
r isk a ge g t
e nt a n
e e n
m en
m ar
es tm j e ct f tw e lop P
v
Documents inv pro so de SA .. .
Document 1 1 1
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
… … … … … … … …
10
15
20
25
30
35
10
15
20
25
30
35
10
15
20
25
30
35
0
5
0
5
0
5
1994 1994 1994
1995 1995 1995
1996 1996 1996
1997 1997 1997
1998 1998 1998
1999 1999 1999
2000 2000 2000
2001 2001 2001
C LU S TER : 7
C LU S TER : 4
C LU S TER : 1
2002 2002 2002
2003 2003 2003
2004 2004 2004
2005 2005 2005
YEAR
2001 2001 2001
C LU S TE R : 8
C LU S TE R : 5
C LU S TE R : 2
Slide 5- 34
Text Mining Application
(Research Trend Identification in Literature)
100
90
80
70
60
50
40
30
20
10
0
ISR JM IS M ISQ ISR JM IS M ISQ ISR JM IS M ISQ
C LU ST ER : 1 C LU ST ER : 2 C LU ST ER : 3
100
90
80
70
No of Articles
60
50
40
30
20
10
0
ISR JM IS M ISQ ISR JM IS M ISQ ISR JM IS M ISQ
C LU ST ER : 4 C LU ST ER : 5 C LU ST ER : 6
100
90
80
70
60
50
40
30
20
10
0
ISR JM IS M ISQ ISR JM IS M ISQ ISR JM IS M ISQ
C LU ST ER : 7 C LU ST ER : 8 C LU ST ER : 9
JO U R N AL
Web Mining
Web
Analytics
Voice of
Customer
Customer Experience
Management
Questions, comments