Romi DM 01 Introduction 1juli2011

Data Mining
Romi Satria Wahono

romi@romisatriawahono.net
http://romisatriawahono.net
0878-804804-85
Romi Satria Wahono
SD Sompok Semarang (1987)

SMPN 8 Semarang (1990)
SMA Taruna Nusantara, Magelang (1993)
S1, S2 dan S3 (on-leave)
Department of Computer Sciences
Saitama University, Japan (1994-2004)
Research Interests: Software Engineering,
Intelligent Systems
Founder dan Koordinator IlmuKomputer.Com
Peneliti LIPI (2004-2009)
Founder dan CEO PT Brainmatics Cipta Informati
ka
Learning Methods
1.
2.
3.
4.
Lecture
Discussion
Case Study
Practice
Textbooks
Course Outline
1.
2.
3.
4.
5.
6.
Introduction to Data Mining

Input - Concept, Instance and Attributes
Output - Knowledge Representation
Methods and Algorithm
Evaluation and Validation
Data Mining Research
Introduction to
Data Mining
Contents
1.
2.
3.
4.
5.
What is Data Mining

Main Task of Data Mining
Data Mining Standard Process
Data Mining Applications
Data Mining and Ethics
What is Data Mining
Why Data Mining?

Society produces huge amounts of data
Sources: business, science, medicine, economics,
geography, environment, sports,
Potentially valuable resource

Raw data is useless: need techniques to a
utomatically extract information (recogniz
e pattern) from it
Data: recorded facts
Information: patterns underlying the data
Knowledge Discovery in Database (KDD)
Definition of Data Mining

Extracting implicit, previously unknown, p
otentially useful information from data (Wi
tten, 2011)
The process of discovering meaningful ne
w correlations, patterns and trends by sifti
ng through large amounts of data stored i
n repositories, using pattern recognition t
echnologies as well as statistical and math
ematical techniques (Gartner Group)

The analysis of (often large) observational
data sets to find unsuspected relationship
s and to summarize the data in novel ways
that are both understandable and useful t
o the data owner (Hand et al., 2001)
Kegiatan yang meliputi pengumpulan, pe
makaian data historis untuk menemukan k
eteraturan, pola dan hubungan dalam set
data berukuran besar (Santosa, 2007)

An interdisciplinary field bringing together
techniques from machine learning, patter
n recognition, statistics, databases, and vi
sualization to address the issue of informa
tion extraction from large data bases
(Cabena et al, 1998).
Irisan Bidang Ilmu Data Mining

Statistik:
Lebih bersifat teori
Fokus ke pengujian hipotesis
Machine Learning:
Lebih bersifat heuristik
Fokus pada perbaikan performansi dari suatu teknik le
arning
Data Mining:
Gabungan teori dan heuristik
Fokus pada seluruh proses penemuan knowledge dan
pola
Termasuk data cleaning, learning dan visualisasi hasil
Data Mining Tools
WEKA
RapidMiner
Clementine
Matlab
R
Learning Methods -11. Unsupervised Learning:

the data mining algorithm searches for patterns and st
ructure among all the variables
no target variable is identified as such
clustering algorithm is an unsupervised learning meth
od
2. Supervised Learning:
most data mining methods (classification and predicti
on) are supervised methods
the algorithm is given many examples where the value
of the target variable is provided
the algorithm may learn which values of the target vari
able are associated with which values of the predictor
Learning Methods -2 Another data mining method, which may be su

pervised or unsupervised, is association rule
mining
In market basket analysis, for example, one ma
y simply be interested in which items are pur
chased together, in which case no target vari
able would be identified
The problem here, is that there are so many it
ems for sale, that searching for all possible ass
ociations may present a daunting task, due to
the resulting combinatorial explosion
The a priori algorithm, attack this problem cle

1. Description
2. Estimation
3. Prediction
4. Classification
5. Clustering
6. Association
Description
Researchers and analysts are simply trying to fin
d ways to describe patterns and trends lying wit
hin data
Data mining model should describe clear patter
ns that are amenable to intuitive interpretation
and explanation. Some data mining methods ar
e more suited than others to transparent interpr
etation
decision trees provide an intuitive and human friendly
explanation of their results
neural networks are comparatively opaque to nonspec
ialists, due to the nonlinearity and complexity of the m
odel
Description Techniques
Deskripsi Grafis
Diagram Titik
Histogram
Deskripsi Lokasi
Mean (Rata-Rata)
Median (Nilai Tengah)
Modus (Paling Sering Muncul)
Kuartil (Nilai di Tiap Seperempat Bagian)
Persentil
Deskripsi Keberagaman
Range (Rentang)
Varians dab Standar Deviasi
Estimation
Estimation is similar to classification except t
hat the target variable is numerical rather th
an categorical
Models are built using complete records,
which provide the value of the target variabl
e as well as the predictors
Then, for new observations, estimates of the
value of the target variable are made, based
on the values of the predictors
Estimation Techniques
The field of statistical analysis supplies se
veral venerable and widely used estimatio
n methods
These include point estimation and confid
ence interval estimations, simple linear re
gression and correlation, and multiple reg
ression
Neural networks may also be used for esti
mation
Estimation - Examples
Estimating the amount of money a randomly ch
osen family of four will spend for back-to-school
shopping this fall
Estimating the percentage decrease in rotary-m
ovement sustained by a National Football Leagu
e running back with a knee injury
Estimating the number of points per game that
Patrick Ewing will score when double-teamed in
the playoffs
Estimating the grade-point average (GPA) of a gr
aduate student, based on that students underg
raduate GPA
Regression estimates lie on the regres

sion line
Estimating CPU Performance

Example: 209 different computer configurat
time Main memory Cache
Channels
Performanc
ions Cycle
(ns)
(Kb)
(Kb)
e
MYCT
MMIN
MMAX
CACH
CHMIN
CHMAX
PRP
125
256
6000
256
16
128
198
29
8000
32000
32
32
269
208
480
512
8000
32
67
209
480
1000
4000
45
Linear regression function

PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
Prediction
Prediction is similar to classification a
nd estimation, except that for predict
ion, the results lie in the future
Prediction Techniques
Any of the methods and techniques used for
classification and estimation may also be use
d, under appropriate circumstances, for pred
iction
Statistical methods: point estimation and confide
nce interval estimations, simple linear regression
and correlation, and multiple regression
Data mining methods: neural network, decision tr
ee, and k-nearest neighbor
Prediction - Examples
Predicting the price of a stock three months i
nto the future
Predicting the percentage increase in traffic
deaths next year if the speed limit is increase
d
Predicting the winner of this falls baseball
World Series, based on a comparison of tea
m statistics
Predicting whether a particular molecule in
drug discovery will lead to a profitable new d
rug for a pharmaceutical company
Predicting the price of a stock
Classification
In classification, there is a target categorical
variable, such as income bracket, which, for
example, could be partitioned into three clas
ses or categories:
1. high income
2. middle income
3. low income
The data mining model examines a large set

of records, each record containing informati
on on the target variable as well as a set of in
put or predictor variables
Classification Techniques
neural network
decision tree
k-nearest neighbor
naive bayes
Classification - Examples
Determining whether a particular credit card tra
nsaction is fraudulent
Placing a new student into a particular track wit
h regard to special needs
Assessing whether a mortgage application is a g
ood or bad credit risk
Diagnosing whether a particular disease is prese
nt
Identifying whether or not certain financial or pe
rsonal behavior indicates a possible terrorist thr
eat
The Contact Lenses Data
A Complete and Correct Rule Se

t
A Decision Tree for This Proble

m
The Weather Problem

Example: Conditions for playing a certain game
Rules:
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Weather Data with Mixed Attrib

utes
Example: Some attributes have numeric valu
es
Rules:
If outlook = sunny and humidity = high then play = no
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
Classifying Iris Flowers
A Complete and Correct Rule Se

t
Data from Labor Negotiations

Attribute
Duration
Wage increase first year
Wage increase second year
Wage increase third year
Cost of living adjustment
Working hours per week
Pension
Standby pay
Shift-work supplement
Education allowance
Statutory holidays
Vacation
Long-term disability
Dental plan contribution
assistance
Bereavement assistance
Health plan contribution
Acceptability of contract
Type
(Number of years)
Percentage
Percentage
Percentage
{none,tcf,tc}
(Number of hours)
{none,ret-allw, emplPercentage
cntr}
Percentage
{yes,no}
(Number of days)
{below-avg,avg,gen}
{yes,no}
{none,half,full}
{yes,no}
{none,half,full}
{good,bad}
1
1
2%
?
?
non
28
e
non
?
e
?
yes
11
avg
no
non
no
e
non
bad
e
2
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
goo
d
3
3
4.3
4.4
%
?
%
?
38
?
?
4%
?
12
gen
?
full
?
full
goo
d
40
2
4.5
4.0
?
non
40
e
?
?
4
?
12
avg
yes
full
yes
half
goo
d
Decision Trees for the Labor Dat

a
Clustering
Clustering refers to the grouping of records, obs
ervations, or cases into classes of similar objects
A cluster is a collection of records that are simil
ar to one another, and dissimilar to records in ot
her clusters
Clustering differs from classification in that ther
e is no target variable for clustering (unsupervis
ed learning)
The clustering task does not try to classify, estim
ate, or predict the value of a target variable
Clustering is often performed as a preliminary st
ep in a data mining process, with the resulting cl
Clustering Techniques
Hierarchical clustering
K-means clustering
Self Organizing Map (SOM)
Clustering - Examples
Target marketing of a niche product for a sm
all-capitalization business that does not have
a large marketing budget
For accounting auditing purposes, to segme
ntize financial behavior into benign and susp
icious categories
As a dimension-reduction tool when the data
set has hundreds of attributes
For gene expression clustering, where very la
rge quantities of genes may exhibit similar b
ehavior
Clustering the Lifestyle Types

Claritas, Inc. provide a demographic profile of each of th
e geographic areas in the country, as defined by zip cod
e. One of the clustering mechanisms they use is the PRI
ZM segmentation system, which describes every U.S. zip
code area in terms of distinct lifestyle types. Just go to t
he companys Web site, enter a particular zip code, and
you are shown the most common PRIZM clusters for tha
t zip code.
What do these clusters mean? For illustration, lets look
up the clusters for zip code 90210, Beverly Hills, Califor
nia. The resulting clusters for zip code 90210 are:
1. Cluster 01: Blue Blood Estates
2. Cluster 10: Bohemian Mix
3. Cluster 02: Winners Circle
4. Cluster 07: Money and Brains
Association
The association task for data mining is the jo
b of finding which attributes go together
Most prevalent in the business world, where
it is known as affinity analysis or market bask
et analysis, the task of association seeks to u
ncover rules for quantifying the relationship
between two or more attributes
Association rules are of the form If anteced
ent, then consequent, together with a meas
ure of the support and confidence associate
d with the rule
Association
For example, a particular supermarket may fi
nd that of the 1000 customers shopping on
a Thursday night:
200 bought diapers
those 200 who bought diapers, 50 bought beer
Thus, the association rule would be If buy d

iapers, then buy beer with a support of 200
/1000 = 20% and a confidence of 50/200 =
25%
Association Techniques
A priori algorithm
FP-Growth algorithm
GRI algorithm
Association - Examples
Investigating the proportion of subscribers to a
companys cell phone plan that respond positiv
ely to an offer of a service upgrade
Predicting degradation in telecommunications n
etworks
Finding out which items in a supermarket are pu
rchased together and which items are never pur
chased together
Determining the proportion of cases in which a
new drug will exhibit dangerous side effects
Latihan (Classification)
1. Lakukan training pada data pemilu (datakpu
-training.xls) dengan menggunakan algoritm
a C4.5
2. Lakukan pengujian untuk datakpu-testing.xl
s
3. Ukur performance-nya dengan menggunaka
n:
1. Confusion Matric (Accuracy)
2. ROC Curve (AUC)
Latihan (Estimation)
1. Lakukan training pada data cpu (cpu.arff) d
engan menggunakan linear regression
2. Lakukan pengujian dengan XValidation
n:
1. RMSE
Latihan (Time Series Prediction)

1. Lakukan training pada data harga saham (h
argasaham-training.xls) dengan menggunak
an neural network
2. Lakukan pengujian dengan data uji (hargasa
ham-testing.xls)
n:
1. Prediction Accuracy
2. RMSE
Tugas
1. Coba semua data set yang ada di folder cas
e study dengan berbagai metode data mini
ng. Bila data tanpa testing, gunakan X valid
ation
2. Pelajari dan coba semua yang ada di rapid
miner-movietutorial
3. Buat laporan tentang seluruh ujicoba dari t
ugas 1 dan 2 beserta screenshootnya dan ki
rimkan via email ke romi@brainmatics.com
1. Subject: [datamining1-udinus] nama-nim
2. Deadline: 5 agustus 2011
Data Mining Standard

Process
Data Mining Standard Process
(CR
ISPDM)
A cross-industry standard was clearly requ

ired that is industry neutral, tool-neutral, a
nd application-neutral
The Cross-Industry Standard Process for
Data Mining (CRISPDM) was developed i
n 1996 (Chapman, 2000)
CRISP-DM provides a nonproprietary and
freely available standard process for fittin
g data mining into the general problem-so
lving strategy of a business or research un
it
CRISP-DM
1. Business Understanding Phas

e
Enunciate the project objectives and require
ments clearly in terms of the business or rese
arch unit as a whole
Translate these goals and restrictions into th
e formulation of a data mining problem defin
ition
Prepare a preliminary strategy for achieving t
hese objectives
2. Data Understanding Phase

Collect the data
Use exploratory data analysis to familiarize y
ourself with the data and discover initial insi
ghts
Evaluate the quality of the data
If desired, select interesting subsets that may
contain actionable patterns
3. Data Preparation Phase

Prepare from the initial raw data the final dat
a set that is to be used for all subsequent ph
ases. This phase is very labor intensive
Select the cases and variables you want to a
nalyze and that are appropriate for your anal
ysis
Perform transformations on certain variables,
if needed
Clean the raw data so that it is ready for the
modeling tools
4. Modeling phase
Select and apply appropriate modeling tech
niques
Calibrate model settings to optimize results
Remember that often, several different techn
iques may be used for the same data mining
problem
If necessary, loop back to the data preparati
on phase to bring the form of the data into li
ne with the specific requirements of a partic
ular data mining technique
5. Evaluation phase
Evaluate the one or more models delivered i
n the modeling phase for quality and effectiv
eness before deploying them for use in the fi
eld
Determine whether the model in fact achiev
es the objectives set for it in the first phase
Establish whether some important facet of th
e business or research problem has not been
accounted for sufficiently
Come to a decision regarding use of the data
mining results
6. Deployment phase
Make use of the models created: Model crea
tion does not signify the completion of a proj
ect
Example of a simple deployment: Generate a
report
Example of a more complex deployment: Im
plement a parallel data mining process in an
other department
For businesses, the customer often carries o
ut the deployment based on your model
Latihan
Pelajari dan pahami Case Study 1-5 dari b
uku Larose (2005) Chapter 1
Pelajari dan pahami bagaimana menerapk
an CRISP-DM pada tesis Firmansyah (201
1) tentang penerapan algoritma C4.5 unt
uk penentuan kelayakan kredit
Data Mining Applications
Fielded Applications
Processing loan applications

Screening images for oil slicks
Electricity supply forecasting
Diagnosis of machine faults
Marketing and sales
Separating crude oil and natural gas
Reducing banding in rotogravure printing
Finding appropriate technicians for telephone faults
Scientific applications: biology, astronomy, chemistry
Automatic selection of TV programs
Monitoring intensive care patients
Processing Loan Applications
(Ameri
can Express)
Given: questionnaire with

financial and personal information
Question: should money be lent?
Simple statistical method covers 90% of case
s
Borderline cases referred to loan officers
But: 50% of accepted borderline cases defa
ulted!
Solution: reject all borderline cases?
No! Borderline cases are most active customers
Enter Machine Learning

1000 training examples of borderline cases
20 attributes:
age
years with current employer
years at current address
years with the bank
other credit cards possessed,
Learned rules: correct on 70% of cases

human experts only 50%
Rules could be used to explain decisions to cust

omers
Screening Images
Given: radar satellite images of coastal waters
Problem: detect oil slicks in those images
Oil slicks appear as dark regions with changing s
ize and shape
Not easy: lookalike dark regions can be caused
by weather conditions (e.g. high wind)
Expensive process requiring highly trained pers
onnel

Extract dark regions from normalized image
Attributes:
size of region
shape, area
intensity
sharpness and jaggedness of boundaries
proximity of other regions

info about background
Constraints:
Few training examplesoil slicks are rare!

Unbalanced data: most dark regions arent slicks
Regions from same image form a batch
Requirement: adjustable false-alarm rate
Load Forecasting
Electricity supply companies need forecast
of future demand for power
Forecasts of min/max load for each hour
significant savings
Given: manually constructed load model that as
sumes normal climatic conditions
Problem: adjust for weather conditions
Static model consist of:
base load for the year
load periodicity over the year
effect of holidays

Prediction corrected using most similar days
Attributes:
temperature
humidity
wind speed
cloud cover readings
plus difference between actual load and predicted load
Average difference among three most similar

days added to static model
Linear regression coefficients form attribute wei
ghts in similarity function
Diagnosis of Machine Faults

Diagnosis: classical domain
of expert systems
Given: Fourier analysis of vibrations measure
d at various points of a devices mounting
Question: which fault is present?
Preventative maintenance of electromechani
cal motors and generators
Information very noisy
So far: diagnosis by expert/hand-crafted rule
s

Available: 600 faults with experts diagnosis
~300 unsatisfactory, rest used for training
Attributes augmented by intermediate conce
pts that embodied causal domain knowledge
Expert not satisfied with initial rules because
they did not relate to his domain knowledge
Further background knowledge resulted in m
ore complex rules that were satisfactory
Learned rules outperformed hand-crafted on
es
Marketing and Sales I

Companies precisely record massive amo
unts of marketing and sales data
Applications:
Customer loyalty:
identifying customers that are likely to defect by d
etecting changes in their behavior
(e.g. banks/phone companies)
Special offers:
identifying profitable customers
(e.g. reliable owners of credit cards that need extr
a money during the holiday season)
Marketing and Sales II

Market basket analysis
Association techniques find
groups of items that tend to
occur together in a transaction
(used to analyze checkout data)
Historical analysis of purchasing patterns

Identifying prospective customers
Focusing promotional mailouts
(targeted campaigns are cheaper than mass-mark
eted ones)
Data Mining and Ethics
Data Mining and Ethics I

Ethical issues arise in practical applications
Anonymizing data is difficult
85% of Americans can be identified from just zip cod
e, birth date and sex
Data mining often used to discriminate

E.g. loan applications: using some information (e.g. se
x, religion, race) is unethical
Ethical situation depends on application

E.g. same information ok in medical application
Attributes may contain problematic information

E.g. area code may correlate with race
Data Mining and Ethics II

Important questions:
Who is permitted access to the data?
For what purpose was the data collected?
What kind of conclusions can be legitimately
drawn from it?
Caveats must be attached to results

Purely statistical arguments are never suff
icient!
Are resources put to good use?
Referensi
Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Pra
ctical Machine Learning Tools and Techniques 3rd Editi
on, Elsevier, 2011
Daniel T. Larose, Discovering Knowledge in Data: an Intr
oduction to Data Mining, John Wiley & Sons, 2005
Florin Gorunescu, Data Mining: Concepts, Models and T
echniques, Springer, 2011
Jiawei Han and Micheline Kamber, Data Mining: Conce
pts and Techniques Second Edition, Elsevier, 2006
Oded Maimon and Lior Rokach, Data Mining and Knowl
edge Discovery Handbook Second Edition, Springer, 20
10
Warren Liao and Evangelos Triantaphyllou (eds.), Recen
t Advances in Data Mining of Enterprise Data: Algorithm

Romi DM 01 Introduction 1juli2011

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Romi DM 01 Introduction 1juli2011

Uploaded by

Copyright:

Available Formats

Data Mining

Romi Satria Wahono

Romi Satria Wahono

SD Sompok Semarang (1987)

Introduction to Data Mining

What is Data Mining

What is Data Mining

Why Data Mining?

Potentially valuable resource

Knowledge Discovery in Database (KDD)

Definition of Data Mining

Definition of Data Mining

Definition of Data Mining

Irisan Bidang Ilmu Data Mining

Data Mining Tools

Learning Methods -11. Unsupervised Learning:

Learning Methods -2 Another data mining method, which may be su

Main Task of Data Mining

Main Task of Data Mining

Regression estimates lie on the regres

Estimating CPU Performance

Linear regression function

Predicting the price of a stock

The data mining model examines a large set

The Contact Lenses Data

A Complete and Correct Rule Se

A Decision Tree for This Proble

The Weather Problem

Weather Data with Mixed Attrib

Classifying Iris Flowers

A Complete and Correct Rule Se

Data from Labor Negotiations

Decision Trees for the Labor Dat

Clustering the Lifestyle Types

Thus, the association rule would be If buy d

Latihan (Time Series Prediction)

Data Mining Standard

Data Mining Standard Process

A cross-industry standard was clearly requ

1. Business Understanding Phas

2. Data Understanding Phase

3. Data Preparation Phase

Data Mining Applications

Processing loan applications

Processing Loan Applications

Given: questionnaire with

Enter Machine Learning

Learned rules: correct on 70% of cases

Rules could be used to explain decisions to cust

Enter Machine Learning

proximity of other regions

Few training examplesoil slicks are rare!

Enter Machine Learning

Average difference among three most similar

Diagnosis of Machine Faults

Enter Machine Learning

Marketing and Sales I

Marketing and Sales II

Historical analysis of purchasing patterns

Data Mining and Ethics

Data Mining and Ethics I

Data mining often used to discriminate