Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 79

Data Mining

Romi Satria Wahono


romi@romisatriawahono.net
http://romisatriawahono.net
0878-804804-85

Romi Satria Wahono

SD Sompok Semarang (1987)


SMPN 8 Semarang (1990)
SMA Taruna Nusantara, Magelang (1993)
S1, S2 dan S3 (on-leave)
Department of Computer Sciences
Saitama University, Japan (1994-2004)
Research Interests: Software Engineering,
Intelligent Systems
Founder dan Koordinator IlmuKomputer.Com
Peneliti LIPI (2004-2009)
Founder dan CEO PT Brainmatics Cipta Informati
ka

Learning Methods
1.
2.
3.
4.

Lecture
Discussion
Case Study
Practice

Textbooks

Course Outline
1.
2.
3.
4.
5.
6.

Introduction to Data Mining


Input - Concept, Instance and Attributes
Output - Knowledge Representation
Methods and Algorithm
Evaluation and Validation
Data Mining Research

Introduction to
Data Mining

Contents
1.
2.
3.
4.
5.

What is Data Mining


Main Task of Data Mining
Data Mining Standard Process
Data Mining Applications
Data Mining and Ethics

What is Data Mining

Why Data Mining?


Society produces huge amounts of data
Sources: business, science, medicine, economics,
geography, environment, sports,

Potentially valuable resource


Raw data is useless: need techniques to a
utomatically extract information (recogniz
e pattern) from it
Data: recorded facts
Information: patterns underlying the data

Knowledge Discovery in Database (KDD)

Definition of Data Mining


Extracting implicit, previously unknown, p
otentially useful information from data (Wi
tten, 2011)
The process of discovering meaningful ne
w correlations, patterns and trends by sifti
ng through large amounts of data stored i
n repositories, using pattern recognition t
echnologies as well as statistical and math
ematical techniques (Gartner Group)

Definition of Data Mining


The analysis of (often large) observational
data sets to find unsuspected relationship
s and to summarize the data in novel ways
that are both understandable and useful t
o the data owner (Hand et al., 2001)
Kegiatan yang meliputi pengumpulan, pe
makaian data historis untuk menemukan k
eteraturan, pola dan hubungan dalam set
data berukuran besar (Santosa, 2007)

Definition of Data Mining


An interdisciplinary field bringing together
techniques from machine learning, patter
n recognition, statistics, databases, and vi
sualization to address the issue of informa
tion extraction from large data bases
(Cabena et al, 1998).

Irisan Bidang Ilmu Data Mining


Statistik:
Lebih bersifat teori
Fokus ke pengujian hipotesis

Machine Learning:
Lebih bersifat heuristik
Fokus pada perbaikan performansi dari suatu teknik le
arning

Data Mining:
Gabungan teori dan heuristik
Fokus pada seluruh proses penemuan knowledge dan
pola
Termasuk data cleaning, learning dan visualisasi hasil

Data Mining Tools

WEKA
RapidMiner
Clementine
Matlab
R

Learning Methods -11. Unsupervised Learning:


the data mining algorithm searches for patterns and st
ructure among all the variables
no target variable is identified as such
clustering algorithm is an unsupervised learning meth
od

2. Supervised Learning:
most data mining methods (classification and predicti
on) are supervised methods
the algorithm is given many examples where the value
of the target variable is provided
the algorithm may learn which values of the target vari
able are associated with which values of the predictor

Learning Methods -2 Another data mining method, which may be su


pervised or unsupervised, is association rule
mining
In market basket analysis, for example, one ma
y simply be interested in which items are pur
chased together, in which case no target vari
able would be identified
The problem here, is that there are so many it
ems for sale, that searching for all possible ass
ociations may present a daunting task, due to
the resulting combinatorial explosion
The a priori algorithm, attack this problem cle

Main Task of Data Mining

Main Task of Data Mining


1. Description
2. Estimation
3. Prediction
4. Classification
5. Clustering
6. Association

Description
Researchers and analysts are simply trying to fin
d ways to describe patterns and trends lying wit
hin data
Data mining model should describe clear patter
ns that are amenable to intuitive interpretation
and explanation. Some data mining methods ar
e more suited than others to transparent interpr
etation
decision trees provide an intuitive and human friendly
explanation of their results
neural networks are comparatively opaque to nonspec
ialists, due to the nonlinearity and complexity of the m
odel

Description Techniques
Deskripsi Grafis
Diagram Titik
Histogram

Deskripsi Lokasi
Mean (Rata-Rata)
Median (Nilai Tengah)
Modus (Paling Sering Muncul)
Kuartil (Nilai di Tiap Seperempat Bagian)
Persentil

Deskripsi Keberagaman
Range (Rentang)
Varians dab Standar Deviasi

Estimation
Estimation is similar to classification except t
hat the target variable is numerical rather th
an categorical
Models are built using complete records,
which provide the value of the target variabl
e as well as the predictors
Then, for new observations, estimates of the
value of the target variable are made, based
on the values of the predictors

Estimation Techniques
The field of statistical analysis supplies se
veral venerable and widely used estimatio
n methods
These include point estimation and confid
ence interval estimations, simple linear re
gression and correlation, and multiple reg
ression
Neural networks may also be used for esti
mation

Estimation - Examples
Estimating the amount of money a randomly ch
osen family of four will spend for back-to-school
shopping this fall
Estimating the percentage decrease in rotary-m
ovement sustained by a National Football Leagu
e running back with a knee injury
Estimating the number of points per game that
Patrick Ewing will score when double-teamed in
the playoffs
Estimating the grade-point average (GPA) of a gr
aduate student, based on that students underg
raduate GPA

Regression estimates lie on the regres


sion line

Estimating CPU Performance


Example: 209 different computer configurat
time Main memory Cache
Channels
Performanc
ions Cycle
(ns)
(Kb)
(Kb)
e
MYCT

MMIN

MMAX

CACH

CHMIN

CHMAX

PRP

125

256

6000

256

16

128

198

29

8000

32000

32

32

269

208

480

512

8000

32

67

209

480

1000

4000

45

Linear regression function


PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX

Prediction
Prediction is similar to classification a
nd estimation, except that for predict
ion, the results lie in the future

Prediction Techniques
Any of the methods and techniques used for
classification and estimation may also be use
d, under appropriate circumstances, for pred
iction
Statistical methods: point estimation and confide
nce interval estimations, simple linear regression
and correlation, and multiple regression
Data mining methods: neural network, decision tr
ee, and k-nearest neighbor

Prediction - Examples
Predicting the price of a stock three months i
nto the future
Predicting the percentage increase in traffic
deaths next year if the speed limit is increase
d
Predicting the winner of this falls baseball
World Series, based on a comparison of tea
m statistics
Predicting whether a particular molecule in
drug discovery will lead to a profitable new d
rug for a pharmaceutical company

Predicting the price of a stock

Classification
In classification, there is a target categorical
variable, such as income bracket, which, for
example, could be partitioned into three clas
ses or categories:
1. high income
2. middle income
3. low income

The data mining model examines a large set


of records, each record containing informati
on on the target variable as well as a set of in
put or predictor variables

Classification Techniques

neural network
decision tree
k-nearest neighbor
naive bayes

Classification - Examples
Determining whether a particular credit card tra
nsaction is fraudulent
Placing a new student into a particular track wit
h regard to special needs
Assessing whether a mortgage application is a g
ood or bad credit risk
Diagnosing whether a particular disease is prese
nt
Identifying whether or not certain financial or pe
rsonal behavior indicates a possible terrorist thr
eat

The Contact Lenses Data

A Complete and Correct Rule Se


t

A Decision Tree for This Proble


m

The Weather Problem


Example: Conditions for playing a certain game

Rules:
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

Weather Data with Mixed Attrib


utes
Example: Some attributes have numeric valu
es

Rules:
If outlook = sunny and humidity = high then play = no
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes

Classifying Iris Flowers

A Complete and Correct Rule Se


t

Data from Labor Negotiations


Attribute
Duration
Wage increase first year
Wage increase second year
Wage increase third year
Cost of living adjustment
Working hours per week
Pension
Standby pay
Shift-work supplement
Education allowance
Statutory holidays
Vacation
Long-term disability
Dental plan contribution
assistance
Bereavement assistance
Health plan contribution
Acceptability of contract

Type
(Number of years)
Percentage
Percentage
Percentage
{none,tcf,tc}
(Number of hours)
{none,ret-allw, emplPercentage
cntr}
Percentage
{yes,no}
(Number of days)
{below-avg,avg,gen}
{yes,no}
{none,half,full}
{yes,no}
{none,half,full}
{good,bad}

1
1
2%
?
?
non
28
e
non
?
e
?
yes
11
avg
no
non
no
e
non
bad
e

2
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
goo
d

3
3
4.3
4.4
%
?
%
?
38
?
?
4%
?
12
gen
?
full
?
full
goo
d

40
2
4.5
4.0
?
non
40
e
?
?
4
?
12
avg
yes
full
yes
half
goo
d

Decision Trees for the Labor Dat


a

Clustering
Clustering refers to the grouping of records, obs
ervations, or cases into classes of similar objects
A cluster is a collection of records that are simil
ar to one another, and dissimilar to records in ot
her clusters
Clustering differs from classification in that ther
e is no target variable for clustering (unsupervis
ed learning)
The clustering task does not try to classify, estim
ate, or predict the value of a target variable
Clustering is often performed as a preliminary st
ep in a data mining process, with the resulting cl

Clustering Techniques
Hierarchical clustering
K-means clustering
Self Organizing Map (SOM)

Clustering - Examples
Target marketing of a niche product for a sm
all-capitalization business that does not have
a large marketing budget
For accounting auditing purposes, to segme
ntize financial behavior into benign and susp
icious categories
As a dimension-reduction tool when the data
set has hundreds of attributes
For gene expression clustering, where very la
rge quantities of genes may exhibit similar b
ehavior

Clustering the Lifestyle Types


Claritas, Inc. provide a demographic profile of each of th
e geographic areas in the country, as defined by zip cod
e. One of the clustering mechanisms they use is the PRI
ZM segmentation system, which describes every U.S. zip
code area in terms of distinct lifestyle types. Just go to t
he companys Web site, enter a particular zip code, and
you are shown the most common PRIZM clusters for tha
t zip code.
What do these clusters mean? For illustration, lets look
up the clusters for zip code 90210, Beverly Hills, Califor
nia. The resulting clusters for zip code 90210 are:
1. Cluster 01: Blue Blood Estates
2. Cluster 10: Bohemian Mix
3. Cluster 02: Winners Circle
4. Cluster 07: Money and Brains

Association
The association task for data mining is the jo
b of finding which attributes go together
Most prevalent in the business world, where
it is known as affinity analysis or market bask
et analysis, the task of association seeks to u
ncover rules for quantifying the relationship
between two or more attributes
Association rules are of the form If anteced
ent, then consequent, together with a meas
ure of the support and confidence associate
d with the rule

Association
For example, a particular supermarket may fi
nd that of the 1000 customers shopping on
a Thursday night:
200 bought diapers
those 200 who bought diapers, 50 bought beer

Thus, the association rule would be If buy d


iapers, then buy beer with a support of 200
/1000 = 20% and a confidence of 50/200 =
25%

Association Techniques
A priori algorithm
FP-Growth algorithm
GRI algorithm

Association - Examples
Investigating the proportion of subscribers to a
companys cell phone plan that respond positiv
ely to an offer of a service upgrade
Predicting degradation in telecommunications n
etworks
Finding out which items in a supermarket are pu
rchased together and which items are never pur
chased together
Determining the proportion of cases in which a
new drug will exhibit dangerous side effects

Latihan (Classification)
1. Lakukan training pada data pemilu (datakpu
-training.xls) dengan menggunakan algoritm
a C4.5
2. Lakukan pengujian untuk datakpu-testing.xl
s
3. Ukur performance-nya dengan menggunaka
n:
1. Confusion Matric (Accuracy)
2. ROC Curve (AUC)

Latihan (Estimation)
1. Lakukan training pada data cpu (cpu.arff) d
engan menggunakan linear regression
2. Lakukan pengujian dengan XValidation
3. Ukur performance-nya dengan menggunaka
n:
1. RMSE

Latihan (Time Series Prediction)


1. Lakukan training pada data harga saham (h
argasaham-training.xls) dengan menggunak
an neural network
2. Lakukan pengujian dengan data uji (hargasa
ham-testing.xls)
3. Ukur performance-nya dengan menggunaka
n:
1. Prediction Accuracy
2. RMSE

Tugas
1. Coba semua data set yang ada di folder cas
e study dengan berbagai metode data mini
ng. Bila data tanpa testing, gunakan X valid
ation
2. Pelajari dan coba semua yang ada di rapid
miner-movietutorial
3. Buat laporan tentang seluruh ujicoba dari t
ugas 1 dan 2 beserta screenshootnya dan ki
rimkan via email ke romi@brainmatics.com
1. Subject: [datamining1-udinus] nama-nim
2. Deadline: 5 agustus 2011

Data Mining Standard


Process

Data Mining Standard Process

(CR

ISPDM)

A cross-industry standard was clearly requ


ired that is industry neutral, tool-neutral, a
nd application-neutral
The Cross-Industry Standard Process for
Data Mining (CRISPDM) was developed i
n 1996 (Chapman, 2000)
CRISP-DM provides a nonproprietary and
freely available standard process for fittin
g data mining into the general problem-so
lving strategy of a business or research un
it

CRISP-DM

1. Business Understanding Phas


e
Enunciate the project objectives and require
ments clearly in terms of the business or rese
arch unit as a whole
Translate these goals and restrictions into th
e formulation of a data mining problem defin
ition
Prepare a preliminary strategy for achieving t
hese objectives

2. Data Understanding Phase


Collect the data
Use exploratory data analysis to familiarize y
ourself with the data and discover initial insi
ghts
Evaluate the quality of the data
If desired, select interesting subsets that may
contain actionable patterns

3. Data Preparation Phase


Prepare from the initial raw data the final dat
a set that is to be used for all subsequent ph
ases. This phase is very labor intensive
Select the cases and variables you want to a
nalyze and that are appropriate for your anal
ysis
Perform transformations on certain variables,
if needed
Clean the raw data so that it is ready for the
modeling tools

4. Modeling phase
Select and apply appropriate modeling tech
niques
Calibrate model settings to optimize results
Remember that often, several different techn
iques may be used for the same data mining
problem
If necessary, loop back to the data preparati
on phase to bring the form of the data into li
ne with the specific requirements of a partic
ular data mining technique

5. Evaluation phase
Evaluate the one or more models delivered i
n the modeling phase for quality and effectiv
eness before deploying them for use in the fi
eld
Determine whether the model in fact achiev
es the objectives set for it in the first phase
Establish whether some important facet of th
e business or research problem has not been
accounted for sufficiently
Come to a decision regarding use of the data
mining results

6. Deployment phase
Make use of the models created: Model crea
tion does not signify the completion of a proj
ect
Example of a simple deployment: Generate a
report
Example of a more complex deployment: Im
plement a parallel data mining process in an
other department
For businesses, the customer often carries o
ut the deployment based on your model

Latihan
Pelajari dan pahami Case Study 1-5 dari b
uku Larose (2005) Chapter 1
Pelajari dan pahami bagaimana menerapk
an CRISP-DM pada tesis Firmansyah (201
1) tentang penerapan algoritma C4.5 unt
uk penentuan kelayakan kredit

Data Mining Applications

Fielded Applications

Processing loan applications


Screening images for oil slicks
Electricity supply forecasting
Diagnosis of machine faults
Marketing and sales
Separating crude oil and natural gas
Reducing banding in rotogravure printing
Finding appropriate technicians for telephone faults
Scientific applications: biology, astronomy, chemistry
Automatic selection of TV programs
Monitoring intensive care patients

Processing Loan Applications

(Ameri

can Express)

Given: questionnaire with


financial and personal information
Question: should money be lent?
Simple statistical method covers 90% of case
s
Borderline cases referred to loan officers
But: 50% of accepted borderline cases defa
ulted!
Solution: reject all borderline cases?
No! Borderline cases are most active customers

Enter Machine Learning


1000 training examples of borderline cases
20 attributes:
age
years with current employer
years at current address
years with the bank
other credit cards possessed,

Learned rules: correct on 70% of cases


human experts only 50%

Rules could be used to explain decisions to cust


omers

Screening Images
Given: radar satellite images of coastal waters
Problem: detect oil slicks in those images
Oil slicks appear as dark regions with changing s
ize and shape
Not easy: lookalike dark regions can be caused
by weather conditions (e.g. high wind)
Expensive process requiring highly trained pers
onnel

Enter Machine Learning


Extract dark regions from normalized image
Attributes:

size of region
shape, area
intensity
sharpness and jaggedness of boundaries

proximity of other regions


info about background

Constraints:

Few training examplesoil slicks are rare!


Unbalanced data: most dark regions arent slicks
Regions from same image form a batch
Requirement: adjustable false-alarm rate

Load Forecasting
Electricity supply companies need forecast
of future demand for power
Forecasts of min/max load for each hour
significant savings
Given: manually constructed load model that as
sumes normal climatic conditions
Problem: adjust for weather conditions
Static model consist of:
base load for the year
load periodicity over the year
effect of holidays

Enter Machine Learning


Prediction corrected using most similar days
Attributes:

temperature
humidity
wind speed
cloud cover readings
plus difference between actual load and predicted load

Average difference among three most similar


days added to static model
Linear regression coefficients form attribute wei
ghts in similarity function

Diagnosis of Machine Faults


Diagnosis: classical domain
of expert systems
Given: Fourier analysis of vibrations measure
d at various points of a devices mounting
Question: which fault is present?
Preventative maintenance of electromechani
cal motors and generators
Information very noisy
So far: diagnosis by expert/hand-crafted rule
s

Enter Machine Learning


Available: 600 faults with experts diagnosis
~300 unsatisfactory, rest used for training
Attributes augmented by intermediate conce
pts that embodied causal domain knowledge
Expert not satisfied with initial rules because
they did not relate to his domain knowledge
Further background knowledge resulted in m
ore complex rules that were satisfactory
Learned rules outperformed hand-crafted on
es

Marketing and Sales I


Companies precisely record massive amo
unts of marketing and sales data
Applications:
Customer loyalty:
identifying customers that are likely to defect by d
etecting changes in their behavior
(e.g. banks/phone companies)
Special offers:
identifying profitable customers
(e.g. reliable owners of credit cards that need extr
a money during the holiday season)

Marketing and Sales II


Market basket analysis
Association techniques find
groups of items that tend to
occur together in a transaction
(used to analyze checkout data)

Historical analysis of purchasing patterns


Identifying prospective customers
Focusing promotional mailouts
(targeted campaigns are cheaper than mass-mark
eted ones)

Data Mining and Ethics

Data Mining and Ethics I


Ethical issues arise in practical applications
Anonymizing data is difficult
85% of Americans can be identified from just zip cod
e, birth date and sex

Data mining often used to discriminate


E.g. loan applications: using some information (e.g. se
x, religion, race) is unethical

Ethical situation depends on application


E.g. same information ok in medical application

Attributes may contain problematic information


E.g. area code may correlate with race

Data Mining and Ethics II


Important questions:
Who is permitted access to the data?
For what purpose was the data collected?
What kind of conclusions can be legitimately
drawn from it?

Caveats must be attached to results


Purely statistical arguments are never suff
icient!
Are resources put to good use?

Referensi
Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Pra
ctical Machine Learning Tools and Techniques 3rd Editi
on, Elsevier, 2011
Daniel T. Larose, Discovering Knowledge in Data: an Intr
oduction to Data Mining, John Wiley & Sons, 2005
Florin Gorunescu, Data Mining: Concepts, Models and T
echniques, Springer, 2011
Jiawei Han and Micheline Kamber, Data Mining: Conce
pts and Techniques Second Edition, Elsevier, 2006
Oded Maimon and Lior Rokach, Data Mining and Knowl
edge Discovery Handbook Second Edition, Springer, 20
10
Warren Liao and Evangelos Triantaphyllou (eds.), Recen
t Advances in Data Mining of Enterprise Data: Algorithm

You might also like