Professional Documents
Culture Documents
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From:
Tan,Steinbach,
R. Grossman,
Kumar
C. Kamath, V. Kumar,
Introduction
Data Mining
to Data
for Scientific
Mining and Engineering Applications
4/18/2004
Definitions
Tan,Steinbach, Kumar
4/18/2004
Look up phone
number in phone
directory
Query a Web
search engine for
information about
Amazon
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Prediction Methods
Use some variables to predict unknown or
future values of other variables.
Description Methods
Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
Tan,Steinbach, Kumar
4/18/2004
10
Classification Example
ca
go
e
t
al
c
ri
al
us
c
i
o
u
or
in
g
t
e
t
n
ss
a
o
a
c
c
cl
Refund Marital
Status
Taxable
Income Cheat
No
No
Single
75K
100K
No
Yes
Married
50K
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
Taxable
Income Cheat
Yes
Single
125K
No
Married
No
60K
Test
Set
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
10
Tan,Steinbach, Kumar
Training
Set
Learn
Classifier
Model
4/18/2004
11
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use
We
Collect
various demographic, lifestyle, and companyinteraction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use
Tan,Steinbach, Kumar
4/18/2004
12
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information on its accountholder as attributes.
When does a customer buy, what does he buy, how often he pays on time,
etc
Tan,Steinbach, Kumar
4/18/2004
13
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to
be lost to a competitor.
Approach:
Use
Label
4/18/2004
14
Classification: Application 4
Approach:
Segment
the image.
Measure
Model
Success
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Tan,Steinbach, Kumar
4/18/2004
15
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early
Class:
Stages of
Formation
Intermediate
Attributes:
Image features,
Characteristics of
light waves received,
etc.
Late
Data Size:
4/18/2004
16
Clustering Definition
Tan,Steinbach, Kumar
4/18/2004
17
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Tan,Steinbach, Kumar
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
4/18/2004
18
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach:
Collect
Tan,Steinbach, Kumar
4/18/2004
19
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
Tan,Steinbach, Kumar
4/18/2004
20
Tan,Steinbach, Kumar
Financial
Total
Articles
555
Correctly
Placed
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
4/18/2004
21
1
2
3
4
Tan,Steinbach, Kumar
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
4/18/2004
22
TID
Items
1
2
3
4
5
Tan,Steinbach, Kumar
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4/18/2004
23
Tan,Steinbach, Kumar
4/18/2004
24
4/18/2004
25
Inventory Management:
Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households.
Approach: Process the data on tools and parts
required in previous repairs at different consumer
locations and discover the co-occurrence patterns.
Tan,Steinbach, Kumar
4/18/2004
26
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events.
Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints.
(A B)
(A B)
<= xg
(C)
(D E)
(C) (D E)
>ng
<= ws
<= ms
Tan,Steinbach, Kumar
4/18/2004
27
Tan,Steinbach, Kumar
4/18/2004
28
Regression
Tan,Steinbach, Kumar
4/18/2004
29
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
Tan,Steinbach, Kumar
4/18/2004
30
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
Tan,Steinbach, Kumar
4/18/2004
31