Knowledge Discovery and Data Mining (KDD)

Knowledge Discovery
and Data Mining (KDD)
Knowledge Discovery & Data

Mining
process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Machine learning, pattern recognition, statistics,
databases, data visualization.
Traditional techniques may be inadequate
large data
Why Mine Data?

Huge amounts of data being collected and
warehoused
Walmart records 20 millions per day
health care transactions: multi-gigabyte databases
Mobil Oil: geological data of over 100 terabytes
Affordable computing
Competitive pressure
gain an edge by providing improved, customized services
information as a product in its own right
Knowledge Discovery Process

Data mining: the core
of knowledge discovery Knowledge Interpretation
process.
Data Mining
Task-relevant Data
Data transformations
Preprocessed
Data
Data Cleaning
Data Integration
Databases
Selection
Knowledge Discovery Process

Goal
understanding the application domain, and goals of KDD effort
Data selection, acquisition, integration

Data cleaning
noise, missing data, outliers,etc.
Exploratory data analysis
dimensionality reduction, transformations

selection of appropriate model for analysis, hypotheses to test
Data mining
selecting appropriate method that match set goals (classification,

regression, clustering, etc)
selecting algorithm
Testing and verification

Interpretation
Consolidation and use
100
90
80
70
60
50
40
30
20
10
0
Business
Objective
Determination
Data
Preparation
Data
Mining
Analysis of
Results and
Knowledge
Assimilation
Effort for each data-mining process step
Issues and challenges

large data
number of variables (features), number of cases (examples)

multi gigabyte, terabyte databases
efficient algorithms, parallel processing
high dimensionality
large number of features: exponential increase in search space

potential for spurious patterns
dimensionality reduction
Overfitting
models noise in training data, rather than just the general patterns
Changing data, missing and noisy data

Use of domain knowledge
utilizing knowledge on complex data relationships, known facts
Understandability of patterns
Data Mining
Prediction Methods
using some variables to predict unknown or future values of
other variables
Descriptive Methods
finding human-interpretable patterns describing the data
Data Mining Tasks
Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection
Classification
Data defined in terms of attributes, one of which is the class
Find a model for class attribute as a function of the

values of other(predictor) attributes, such that previously
unseen records can be assigned a class as accurately
as possible.
Training Data: used to build the model
Test data: used to validate the model (determine accuracy of the
model)
Given data is usually divided into training and test sets.
Classification:Example
Classification: Direct Marketing

Goal: Reduce cost of soliciting (mailing) by targeting
a set of consumers likely to buy a new product.
Data
for similar product introduced earlier
we know which customers decided to buy and which did
not {buy, not buy} class attribute
collect various demographic, lifestyle, and company
related information about all such customers - as possible
predictor variables.
Learn classifier model
Classification: Fraud detection

Goal: Predict fraudulent cases in credit card
transactions.
Data
Use credit card transactions and information on its accountholder as input variables
label past transactions as fraud or fair.
Learn a model for the class of transactions

Use the model to detect fraud by observing credit
card transactions on a given account.
Clustering
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
data points in one cluster are more similar to one another
data points in separate clusters are less simislar to one
another.
Similarity measures
Euclidean distance if attributes are continuous
Problem specific measures
Clustering: Market Segmentation

Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach:
collect different attributes on customers based on
geographical, and lifestyle related information
identify clusters of similar customers
measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
Association Rule Discovery

Given a set of records, each of which contain
some number of items from a given collection
produce dependency rules which will predict occurrence of
an item based on occurences of other items
Association Rules:Application
Marketing and Sales Promotion:
Consider discovered rule:
{Bagels, } --> {Potato Chips}
Potato Chips as consequent: can be used to determine
what may be done to boost sales
Bagels as an antecedent: can be used to see which
products may be affected if bagels are discontinued
Can be used to see which products should be sold with
Bagels to promote sale of Potato Chips
Association Rules: Application

Supermarket shelf management
Goal: to identify items which are bought together
(by sufficiently many customers)
Approach: process point-of-sale data (collected
with barcode scanners) to find dependencies
among items.
Example
If a customer buys Diapers and Milk, then he is very likely to
but Beer
so stack six-packs next to diapers?
Visualization
complement to other DM techniques like
Segmentation,etc.
Data Mining in CRM:

Customer Life Cycle
Customer Life Cycle
The stages in the relationship between a customer and a
business
Key stages in the customer lifecycle

Prospects: people who are not yet customers but are in
the target market
Responders: prospects who show an interest in a product
or service
Active Customers: people who are currently using the
product or service
Former Customers: may be bad customers who did not
pay their bills or who incurred high costs
Its important to know life cycle events (e.g.

retirement)
20
Data Mining in CRM:

Customer Life Cycle
What marketers want: Increasing customer
revenue and customer profitability
Up-sell
Cross-sell
Keeping the customers for a longer period of time
Solution: Applying data mining
21
Data Mining in CRM

DM helps to
Determine the behavior surrounding a particular lifecycle
event
Find other people in similar life stages and determine which
customers are following similar behavior patterns
22
Data Mining in CRM (cont.)
Data Warehouse Customer Profile
Data Mining
Customer Life Cycle Info.
Campaign Management
23
Data Mining Techniques
Data Mining Techniques

Descriptive
Predictive
Clustering
Classification
Association
Decision Tree
Sequential Analysis
Rule Induction
Neural Networks
Nearest Neighbor Classification
Regression
24
Predictive Data Mining
Honest
Tridas
Vickie
Mike
Wally
Waldo
Barney
Crooked
25
Prediction
Tridas
Vickie
Mike
Honest = has round eyes and a smile
26
Decision Trees
Data
height
short
tall
tall
short
tall
tall
tall
short
hair
blond
blond
red
dark
dark
blond
dark
blond
eyes
blue
brown
blue
blue
blue
blue
brown
brown
class
A
B
A
B
B
A
B
B
27
Decision Trees (cont.)

hair
dark
blond
red
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A}
Completely classifies dark-haired

and red-haired people
short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
Does not completely classify

blonde-haired people.
More work is required
28
Decision Trees (cont.)

hair
dark
blond
red
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A}
Decision tree is complete because

1. All 8 cases appear at nodes
2. At each node, all cases are in
the same class (A or B)
short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
eye
blue
short = A
tall = A
brown
tall = B
short = B
29
Decision Trees:
Learned Predictive Rules
hair
dark
blond
red
eyes
blue
brown
30
Decision Trees:
Another Example
Total list
50% member
0-1 child
$50-75k income
15% member
2-3 child
20% member
$75k+ income
70% member
4+ children
$50-75k income
Age: 20-40
45% member
$20-50k income
85% member
Age: 40-60
80% member
31
Rule Induction
Try to find rules of the form
IF <left-hand-side> THEN <right-hand-side>
This is the reverse of a rule-based agent, where the rules
are given and the agent must act. Here the actions are
given and we have to discover the rules!
Prevalence = probability that LHS and RHS occur

together (sometimes called support factor,
leverage or lift)
Predictability = probability of RHS given LHS
(sometimes called confidence or strength)
32
Association Rules from

Market Basket Analysis
<Dairy-Milk-Refrigerated> <Soft Drinks Carbonated>
<Dry Dinners - Pasta> <Soup-Canned>
prevalence = 0.94%, predictability = 28.14%
<Dry Dinners - Pasta> <Cereal - Ready to Eat>
<Cheese Slices > <Cereal - Ready to Eat>
33
Use of Rule Associations
Coupons, discounts
Product placement
Offer correlated products to the customer at the same time.

Increases sales
Timing of cross-marketing
Dont give discounts on 2 items that are frequently bought

together. Use the discount on 1 to pull the other
Send camcorder offer to VCR purchasers 2-3 months after

VCR purchase
Discovery of patterns
People who bought X, Y and Z (but not any pair) bought W

over half the time
34
Finding Rule Associations

Algorithm
Example: grocery shopping

For each item, count # of occurrences (say out of 100,000)
apples 1891, caviar 3, ice cream 1088,
Drop the ones that are below a minimum support level
apples 1891, ice cream 1088, pet food 2451,
Make a table of each item against each other item:
apples ice cream pet food
apples
1891
685
24
ice cream
-----
1088
322
pet food
-----
-----
2451
Discard cells below support threshold. Now make a cube for

triples, etc. Add 1 dimension for each product on LHS.
35
Clustering
The art of finding groups in data
Objective: gather items from a database into sets
according to (unknown) common characteristics
Much more difficult than classification since the
classes are not known in advance (no training)
Technique: unsupervised learning
36
The K-Means Clustering Method

10
9
8
7
6
5
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K objects as initial
cluster center
10
Assign
each
of the
objects
to
most
similar
center
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
37
10
Opinion Analysis
Word-of-mouth on the Web
The Web has dramatically changed the way that
consumers express their opinions.
One can post reviews of products at merchant
sites, Web forums, discussion groups, blogs
Techniques are being developed to exploit these
sources.
Benefits of Review Analysis
Potential Customer: No need to read many reviews
Product manufacturer: market intelligence, product
benchmarking
38
Feature Based Analysis &

Summarization
Extracting product features (called Opinion
Features) that have been commented on by
customers.
Identifying opinion sentences in each review and
deciding whether each opinion sentence is positive
or negative.
Summarizing and comparing results.
39
Sentiment Analysis and opinion mining
40
Introduction
Two main types of textual information.
Facts and Opinions
Note: factual statements can imply opinions too.
Most current text information processing methods

(e.g., web search, text mining) work with factual
information.
Sentiment analysis or opinion mining
computational study of opinions, sentiments and emotions
expressed in text.
Why opinion mining now? Mainly because of the

Web; huge volumes of opinionated text.
41
Introduction user-generated
media
Importance of opinions:
Opinions are important because whenever we need to

make a decision, we want to hear others opinions.
In the past,
Individuals: opinions from friends and family
businesses: surveys, focus groups, consultants
Word-of-mouth on the Web

User-generated media: One can express opinions on
anything in reviews, forums, discussion groups, blogs ...
Opinions of global scale: No longer limited to:
Individuals: ones circle of friends
Businesses: Small scale surveys, tiny focus groups, etc.
42
A Fascinating Problem!
Intellectually challenging & major applications.

A popular research topic in recent years in NLP and Web data
mining.
20-60 companies in USA alone
It touches every aspect of NLP and yet is restricted

and confined.
Little research in NLP/Linguistics in the past.
Potentially a major technology from NLP.

But not yet and not easy!
Data sourcing and data integration are hard too!
43
An Example Review
I bought an iPhone a few days ago. It was such a
nice phone. The touch screen was really cool. The
voice quality was clear too. Although the battery life
was not long, that is ok for me. However, my mother
was mad with me as I did not tell her before I bought
the phone. She also thought the phone was too
expensive, and wanted me to return it to the shop.
What do we see?
Opinions, targets of opinions, and opinion holders
44
Target Object (Liu, Web Data Mining book, 2006)

Definition (object): An object o is a product, person,
event, organization, or topic. o is represented as
a hierarchy of components, sub-components, and so on.
Each node represents a component and is associated with
a set of attributes of the component.
An opinion can be expressed on any node or attribute of

the node.
To simplify our discussion, we use the term features to
represent both components and attributes.
45
What is an Opinion? (Liu, a Ch. in NLP

handbook)
An opinion is a quintuple
(oj, fjk, soijkl, hi, tl),
where
oj is a target object.
fjk is a feature of the object oj.
soijkl is the sentiment value of the opinion of the opinion holder hi
on feature fjk of object oj at time tl. soijkl is +ve, -ve, or neu, or a
more granular rating.
hi is an opinion holder.
tl is the time when the opinion is expressed.
46
Objective structure the

unstructured
Objective: Given an opinionated document,
Discover all quintuples (oj,
fjk, soijkl, hi, tl),
i.e., mine the five corresponding pieces of information in each

quintuple, and
Or, solve some simpler problems
With the quintuples,

Unstructured Text
Structured Data
Traditional data and visualization tools can be used to slice, dice

and visualize the results in all kinds of ways
Enable qualitative and quantitative analysis.
47
Sentiment Classification: doc-level

(Pang and Lee, et al 2002 and Turney 2002)
Classify a document (e.g., a review) based on the

overall sentiment expressed by opinion holder
Classes: Positive, or negative (and neutral)
In the model, (oj, fjk, soijkl, hi, tl),

It assumes
Each document focuses on a single object and contains
opinions from a single opinion holder.
It considers opinion on the object, oj (or oj = fjk)
48
Subjectivity Analysis
(Wiebe et al 2004)
Sentence-level sentiment analysis has two tasks:

Subjectivity classification: Subjective or objective.
Objective: e.g., I bought an iPhone a few days ago.
Subjective: e.g., It is such a nice phone.
Sentiment classification: For subjective sentences or

clauses, classify positive or negative.
Positive: It is such a nice phone.
However. (Liu, Chapter in NLP handbook)

subjective sentences +ve or ve opinions
E.g., I think he came yesterday.
Objective sentence no opinion

Imply ve opinion: My phone broke in the second day.
49
Feature-Based Sentiment Analysis

Sentiment classification at both document and
sentence (or clause) levels are not sufficient,
they do not tell what people like and/or dislike
A positive opinion on an object does not mean that the
opinion holder likes everything.
An negative opinion on an object does not mean ..
Objective: Discovering all quintuples

(oj, fjk, soijkl, hi, tl)
With all quintuples, all kinds of analyses become
possible.
50
Feature-Based Opinion Summary

(Hu & Liu, KDD-2004)
Feature Based Summary:
I bought an iPhone a few days

ago. It was such a nice phone.
The touch screen was really
cool. The voice quality was
clear too. Although the battery
life was not long, that is ok for
me. However, my mother was
mad with me as I did not tell her
before I bought the phone. She
also thought the phone was too
expensive, and wanted me to
return it to the shop.
Feature1: Touch screen

Positive: 212
The touch screen was really cool.
The touch screen was so easy to
use and can do amazing things.
Negative: 6
The screen is easily scratched.
I have a lot of difficulty in removing
finger marks from the touch screen.
Feature2: battery life
Note: We omit opinion holders

51
Visual Comparison (Liu et al. WWW-2005)
Summary of
reviews of
Cell Phone 1
_
Voice
Comparison of
reviews of
Screen
Cell Phone 1
Cell Phone 2
_
52
Battery
Size
Weight

Knowledge Discovery and Data Mining (KDD)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Knowledge Discovery and Data Mining (KDD)

Uploaded by

Copyright:

Available Formats

Knowledge Discovery

and Data Mining (KDD)

Knowledge Discovery & Data

Why Mine Data?

Knowledge Discovery Process

Knowledge Discovery Process

understanding the application domain, and goals of KDD effort

Data selection, acquisition, integration

Exploratory data analysis

dimensionality reduction, transformations

selecting appropriate method that match set goals (classification,

Testing and verification

Effort for each data-mining process step

Issues and challenges

number of variables (features), number of cases (examples)

large number of features: exponential increase in search space

Changing data, missing and noisy data

Data Mining Tasks

Find a model for class attribute as a function of the

Classification: Direct Marketing

Learn classifier model

Classification: Fraud detection

Learn a model for the class of transactions

Clustering: Market Segmentation

Association Rule Discovery

Association Rules: Application

Data Mining in CRM:

Key stages in the customer lifecycle

Its important to know life cycle events (e.g.

Data Mining in CRM:

Solution: Applying data mining

Data Mining in CRM

Data Mining in CRM (cont.)

Data Warehouse Customer Profile

Customer Life Cycle Info.

Data Mining Techniques

Data Mining Techniques

Predictive Data Mining

Honest = has round eyes and a smile

Decision Trees (cont.)

Completely classifies dark-haired

Does not completely classify

Decision Trees (cont.)

Decision tree is complete because

Prevalence = probability that LHS and RHS occur

Association Rules from

<Dairy-Milk-Refrigerated> <Soft Drinks Carbonated>

<Dry Dinners - Pasta> <Soup-Canned>

prevalence = 0.94%, predictability = 28.14%

<Dry Dinners - Pasta> <Cereal - Ready to Eat>

prevalence = 4.99%, predictability = 22.89%

prevalence = 1.36%, predictability = 41.02%

<Cheese Slices > <Cereal - Ready to Eat>

prevalence = 1.16%, predictability = 38.01%

Use of Rule Associations

Offer correlated products to the customer at the same time.

Dont give discounts on 2 items that are frequently bought

Send camcorder offer to VCR purchasers 2-3 months after

People who bought X, Y and Z (but not any pair) bought W

Finding Rule Associations

Example: grocery shopping

Discard cells below support threshold. Now make a cube for