Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 52

Knowledge Discovery

and Data Mining (KDD)

Knowledge Discovery & Data


Mining
process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Machine learning, pattern recognition, statistics,
databases, data visualization.
Traditional techniques may be inadequate
large data

Why Mine Data?


Huge amounts of data being collected and
warehoused
Walmart records 20 millions per day
health care transactions: multi-gigabyte databases
Mobil Oil: geological data of over 100 terabytes

Affordable computing
Competitive pressure
gain an edge by providing improved, customized services
information as a product in its own right

Knowledge Discovery Process


Data mining: the core
of knowledge discovery Knowledge Interpretation
process.
Data Mining
Task-relevant Data
Data transformations
Preprocessed
Data
Data Cleaning
Data Integration
Databases

Selection

Knowledge Discovery Process


Goal

understanding the application domain, and goals of KDD effort

Data selection, acquisition, integration


Data cleaning
noise, missing data, outliers,etc.

Exploratory data analysis

dimensionality reduction, transformations


selection of appropriate model for analysis, hypotheses to test

Data mining

selecting appropriate method that match set goals (classification,


regression, clustering, etc)
selecting algorithm

Testing and verification


Interpretation
Consolidation and use

100
90
80
70
60
50
40
30
20
10
0

Business
Objective
Determination

Data
Preparation

Data
Mining

Analysis of
Results and
Knowledge
Assimilation

Effort for each data-mining process step

Issues and challenges


large data

number of variables (features), number of cases (examples)


multi gigabyte, terabyte databases
efficient algorithms, parallel processing

high dimensionality

large number of features: exponential increase in search space


potential for spurious patterns
dimensionality reduction

Overfitting

models noise in training data, rather than just the general patterns

Changing data, missing and noisy data


Use of domain knowledge
utilizing knowledge on complex data relationships, known facts

Understandability of patterns

Data Mining
Prediction Methods
using some variables to predict unknown or future values of
other variables

Descriptive Methods
finding human-interpretable patterns describing the data

Data Mining Tasks

Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection

Classification
Data defined in terms of attributes, one of which is the class

Find a model for class attribute as a function of the


values of other(predictor) attributes, such that previously
unseen records can be assigned a class as accurately
as possible.
Training Data: used to build the model
Test data: used to validate the model (determine accuracy of the
model)
Given data is usually divided into training and test sets.

Classification:Example

Classification: Direct Marketing


Goal: Reduce cost of soliciting (mailing) by targeting
a set of consumers likely to buy a new product.
Data
for similar product introduced earlier
we know which customers decided to buy and which did
not {buy, not buy} class attribute
collect various demographic, lifestyle, and company
related information about all such customers - as possible
predictor variables.

Learn classifier model

Classification: Fraud detection


Goal: Predict fraudulent cases in credit card
transactions.
Data
Use credit card transactions and information on its accountholder as input variables
label past transactions as fraud or fair.

Learn a model for the class of transactions


Use the model to detect fraud by observing credit
card transactions on a given account.

Clustering
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
data points in one cluster are more similar to one another
data points in separate clusters are less simislar to one
another.

Similarity measures
Euclidean distance if attributes are continuous
Problem specific measures

Clustering: Market Segmentation


Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach:
collect different attributes on customers based on
geographical, and lifestyle related information
identify clusters of similar customers
measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.

Association Rule Discovery


Given a set of records, each of which contain
some number of items from a given collection
produce dependency rules which will predict occurrence of
an item based on occurences of other items

Association Rules:Application
Marketing and Sales Promotion:
Consider discovered rule:
{Bagels, } --> {Potato Chips}
Potato Chips as consequent: can be used to determine
what may be done to boost sales
Bagels as an antecedent: can be used to see which
products may be affected if bagels are discontinued
Can be used to see which products should be sold with
Bagels to promote sale of Potato Chips

Association Rules: Application


Supermarket shelf management
Goal: to identify items which are bought together
(by sufficiently many customers)
Approach: process point-of-sale data (collected
with barcode scanners) to find dependencies
among items.
Example
If a customer buys Diapers and Milk, then he is very likely to
but Beer
so stack six-packs next to diapers?

Visualization
complement to other DM techniques like
Segmentation,etc.

Data Mining in CRM:


Customer Life Cycle
Customer Life Cycle
The stages in the relationship between a customer and a
business

Key stages in the customer lifecycle


Prospects: people who are not yet customers but are in
the target market
Responders: prospects who show an interest in a product
or service
Active Customers: people who are currently using the
product or service
Former Customers: may be bad customers who did not
pay their bills or who incurred high costs

Its important to know life cycle events (e.g.


retirement)

20

Data Mining in CRM:


Customer Life Cycle
What marketers want: Increasing customer
revenue and customer profitability
Up-sell
Cross-sell
Keeping the customers for a longer period of time

Solution: Applying data mining

21

Data Mining in CRM


DM helps to
Determine the behavior surrounding a particular lifecycle
event
Find other people in similar life stages and determine which
customers are following similar behavior patterns

22

Data Mining in CRM (cont.)

Data Warehouse Customer Profile

Data Mining

Customer Life Cycle Info.

Campaign Management
23

Data Mining Techniques

Data Mining Techniques


Descriptive

Predictive
Clustering

Classification

Association

Decision Tree

Sequential Analysis

Rule Induction
Neural Networks
Nearest Neighbor Classification

Regression
24

Predictive Data Mining

Honest
Tridas

Vickie

Mike

Wally

Waldo

Barney

Crooked

25

Prediction

Tridas

Vickie

Mike

Honest = has round eyes and a smile

26

Decision Trees
Data
height
short
tall
tall
short
tall
tall
tall
short

hair
blond
blond
red
dark
dark
blond
dark
blond

eyes
blue
brown
blue
blue
blue
blue
brown
brown

class
A
B
A
B
B
A
B
B
27

Decision Trees (cont.)


hair
dark

blond
red

short, blue = B
tall, blue = B
tall, brown= B

{tall, blue = A}

Completely classifies dark-haired


and red-haired people

short, blue = A
tall, brown = B
tall, blue = A
short, brown = B

Does not completely classify


blonde-haired people.
More work is required
28

Decision Trees (cont.)


hair
dark

blond
red

short, blue = B
tall, blue = B
tall, brown= B

{tall, blue = A}

Decision tree is complete because


1. All 8 cases appear at nodes
2. At each node, all cases are in
the same class (A or B)

short, blue = A
tall, brown = B
tall, blue = A
short, brown = B

eye
blue

short = A
tall = A

brown
tall = B
short = B
29

Decision Trees:
Learned Predictive Rules
hair
dark

blond

red

eyes
blue

brown

30

Decision Trees:
Another Example
Total list
50% member

0-1 child

$50-75k income
15% member

2-3 child
20% member

$75k+ income
70% member

4+ children

$50-75k income

Age: 20-40
45% member

$20-50k income
85% member

Age: 40-60
80% member

31

Rule Induction
Try to find rules of the form
IF <left-hand-side> THEN <right-hand-side>
This is the reverse of a rule-based agent, where the rules
are given and the agent must act. Here the actions are
given and we have to discover the rules!

Prevalence = probability that LHS and RHS occur


together (sometimes called support factor,
leverage or lift)
Predictability = probability of RHS given LHS
(sometimes called confidence or strength)
32

Association Rules from


Market Basket Analysis

<Dairy-Milk-Refrigerated> <Soft Drinks Carbonated>

<Dry Dinners - Pasta> <Soup-Canned>

prevalence = 0.94%, predictability = 28.14%

<Dry Dinners - Pasta> <Cereal - Ready to Eat>

prevalence = 4.99%, predictability = 22.89%

prevalence = 1.36%, predictability = 41.02%

<Cheese Slices > <Cereal - Ready to Eat>

prevalence = 1.16%, predictability = 38.01%

33

Use of Rule Associations

Coupons, discounts

Product placement

Offer correlated products to the customer at the same time.


Increases sales

Timing of cross-marketing

Dont give discounts on 2 items that are frequently bought


together. Use the discount on 1 to pull the other

Send camcorder offer to VCR purchasers 2-3 months after


VCR purchase

Discovery of patterns

People who bought X, Y and Z (but not any pair) bought W


over half the time

34

Finding Rule Associations


Algorithm

Example: grocery shopping


For each item, count # of occurrences (say out of 100,000)
apples 1891, caviar 3, ice cream 1088,
Drop the ones that are below a minimum support level
apples 1891, ice cream 1088, pet food 2451,
Make a table of each item against each other item:
apples ice cream pet food
apples

1891

685

24

ice cream

-----

1088

322

pet food

-----

-----

2451

Discard cells below support threshold. Now make a cube for


triples, etc. Add 1 dimension for each product on LHS.
35

Clustering
The art of finding groups in data
Objective: gather items from a database into sets
according to (unknown) common characteristics
Much more difficult than classification since the
classes are not known in advance (no training)
Technique: unsupervised learning

36

The K-Means Clustering Method


10
9
8
7
6
5

10

10

4
3
2
1
0
0

K=2
Arbitrarily choose
K objects as initial
cluster center

10

Assign
each
of the
objects
to
most
similar
center

3
2
1
0
0

10

Update
the
cluster
means

4
3
2
1
0
0

reassign
10

4
3
2
1
0
1

10

reassign

10

10

Update
the
cluster
means

4
3
2
1
0
0

37

10

Opinion Analysis
Word-of-mouth on the Web
The Web has dramatically changed the way that
consumers express their opinions.
One can post reviews of products at merchant
sites, Web forums, discussion groups, blogs
Techniques are being developed to exploit these
sources.
Benefits of Review Analysis
Potential Customer: No need to read many reviews
Product manufacturer: market intelligence, product
benchmarking
38

Feature Based Analysis &


Summarization
Extracting product features (called Opinion
Features) that have been commented on by
customers.
Identifying opinion sentences in each review and
deciding whether each opinion sentence is positive
or negative.
Summarizing and comparing results.

39

Sentiment Analysis and opinion mining

40

Introduction
Two main types of textual information.
Facts and Opinions
Note: factual statements can imply opinions too.

Most current text information processing methods


(e.g., web search, text mining) work with factual
information.
Sentiment analysis or opinion mining
computational study of opinions, sentiments and emotions
expressed in text.

Why opinion mining now? Mainly because of the


Web; huge volumes of opinionated text.
41

Introduction user-generated
media

Importance of opinions:

Opinions are important because whenever we need to


make a decision, we want to hear others opinions.
In the past,
Individuals: opinions from friends and family
businesses: surveys, focus groups, consultants

Word-of-mouth on the Web


User-generated media: One can express opinions on
anything in reviews, forums, discussion groups, blogs ...
Opinions of global scale: No longer limited to:
Individuals: ones circle of friends
Businesses: Small scale surveys, tiny focus groups, etc.
42

A Fascinating Problem!

Intellectually challenging & major applications.


A popular research topic in recent years in NLP and Web data
mining.
20-60 companies in USA alone

It touches every aspect of NLP and yet is restricted


and confined.
Little research in NLP/Linguistics in the past.

Potentially a major technology from NLP.


But not yet and not easy!
Data sourcing and data integration are hard too!

43

An Example Review
I bought an iPhone a few days ago. It was such a
nice phone. The touch screen was really cool. The
voice quality was clear too. Although the battery life
was not long, that is ok for me. However, my mother
was mad with me as I did not tell her before I bought
the phone. She also thought the phone was too
expensive, and wanted me to return it to the shop.
What do we see?
Opinions, targets of opinions, and opinion holders

44

Target Object (Liu, Web Data Mining book, 2006)


Definition (object): An object o is a product, person,
event, organization, or topic. o is represented as
a hierarchy of components, sub-components, and so on.
Each node represents a component and is associated with
a set of attributes of the component.

An opinion can be expressed on any node or attribute of


the node.
To simplify our discussion, we use the term features to
represent both components and attributes.
45

What is an Opinion? (Liu, a Ch. in NLP


handbook)

An opinion is a quintuple
(oj, fjk, soijkl, hi, tl),
where
oj is a target object.
fjk is a feature of the object oj.
soijkl is the sentiment value of the opinion of the opinion holder hi
on feature fjk of object oj at time tl. soijkl is +ve, -ve, or neu, or a
more granular rating.
hi is an opinion holder.
tl is the time when the opinion is expressed.

46

Objective structure the


unstructured
Objective: Given an opinionated document,
Discover all quintuples (oj,

fjk, soijkl, hi, tl),

i.e., mine the five corresponding pieces of information in each


quintuple, and

Or, solve some simpler problems

With the quintuples,


Unstructured Text

Structured Data

Traditional data and visualization tools can be used to slice, dice


and visualize the results in all kinds of ways
Enable qualitative and quantitative analysis.

47

Sentiment Classification: doc-level


(Pang and Lee, et al 2002 and Turney 2002)

Classify a document (e.g., a review) based on the


overall sentiment expressed by opinion holder
Classes: Positive, or negative (and neutral)

In the model, (oj, fjk, soijkl, hi, tl),


It assumes
Each document focuses on a single object and contains
opinions from a single opinion holder.
It considers opinion on the object, oj (or oj = fjk)

48

Subjectivity Analysis
(Wiebe et al 2004)

Sentence-level sentiment analysis has two tasks:


Subjectivity classification: Subjective or objective.
Objective: e.g., I bought an iPhone a few days ago.
Subjective: e.g., It is such a nice phone.

Sentiment classification: For subjective sentences or


clauses, classify positive or negative.
Positive: It is such a nice phone.

However. (Liu, Chapter in NLP handbook)


subjective sentences +ve or ve opinions
E.g., I think he came yesterday.

Objective sentence no opinion


Imply ve opinion: My phone broke in the second day.
49

Feature-Based Sentiment Analysis


Sentiment classification at both document and
sentence (or clause) levels are not sufficient,
they do not tell what people like and/or dislike
A positive opinion on an object does not mean that the
opinion holder likes everything.
An negative opinion on an object does not mean ..

Objective: Discovering all quintuples


(oj, fjk, soijkl, hi, tl)
With all quintuples, all kinds of analyses become
possible.
50

Feature-Based Opinion Summary


(Hu & Liu, KDD-2004)
Feature Based Summary:

I bought an iPhone a few days


ago. It was such a nice phone.
The touch screen was really
cool. The voice quality was
clear too. Although the battery
life was not long, that is ok for
me. However, my mother was
mad with me as I did not tell her
before I bought the phone. She
also thought the phone was too
expensive, and wanted me to
return it to the shop.

Feature1: Touch screen


Positive: 212
The touch screen was really cool.
The touch screen was so easy to
use and can do amazing things.

Negative: 6
The screen is easily scratched.
I have a lot of difficulty in removing
finger marks from the touch screen.

Feature2: battery life

Note: We omit opinion holders


51

Visual Comparison (Liu et al. WWW-2005)

Summary of
reviews of
Cell Phone 1

_
Voice

Comparison of
reviews of

Screen

Cell Phone 1
Cell Phone 2

_
52

Battery

Size

Weight

You might also like