Professional Documents
Culture Documents
Lecture Notes For Chapter 1 Introduction To Data Mining, 2 Edition
Lecture Notes For Chapter 1 Introduction To Data Mining, 2 Edition
Improving health care and reducing costs Predicting the impact of climate change
Prediction Methods
– Use some variables to predict unknown or
future values of other variables.
Description Methods
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
No Yes
No Education
{ High school,
Graduate
Undergrad }
Number of Number of
Data
years years
Clu
s > 3 yr < 3 yr > 7 yrs < 7 yrs
ter ng
ing
Tid Refund Marital Taxable
Status Income Cheat
l i
de
Yes No Yes No
1 Yes Single 125K No
2 No Married 100K No
M o
i ve
3 No Single 70K No
4 Yes Married 120K No
c t
5 No Divorced 95K Yes
edi
Pr
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No
tec ly
c i 13 No Single 85K Yes
s so 14 No Married 75K No
tio
A 10
15 No Single 90K Yes
n
l es
Ru
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
c al c al t ive # years at
ori ori ita Level of Credit
t Tid Employed present
t eg teg an
a ss Education
address
Worthy
ca ca qu cl 1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
1 Yes Graduate 5 Yes … … … … …
10
Training
Learn
Set Classifier Model
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
Introduction to Data Mining, 2nd Edition
09/03/2019 15
Tan, Steinbach, Karpatne, Kumar
Classification: Application 2
Late
Data Size:
72 million stars, 20 million galaxies
Object Catalog: 9 GB
Image Database: 150 GB
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
Ic e or No NPP
-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
longitude
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.
Document Clustering:
– Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
09/03/2019
diseases Introduction to Data Mining, 2nd Edition
25
Tan, Steinbach, Karpatne, Kumar
Association Analysis: Applications
Scalability
High Dimensionality
Non-traditional Analysis
Introduction to Data Mining, 2nd Edition
09/03/2019 28
Tan, Steinbach, Karpatne, Kumar