Professional Documents
Culture Documents
Romi DM 01 Introduction 1juli2011
Romi DM 01 Introduction 1juli2011
Learning Methods
1.
2.
3.
4.
Lecture
Discussion
Case Study
Practice
Textbooks
Course Outline
1.
2.
3.
4.
5.
6.
Introduction to
Data Mining
Contents
1.
2.
3.
4.
5.
Machine Learning:
Lebih bersifat heuristik
Fokus pada perbaikan performansi dari suatu teknik le
arning
Data Mining:
Gabungan teori dan heuristik
Fokus pada seluruh proses penemuan knowledge dan
pola
Termasuk data cleaning, learning dan visualisasi hasil
WEKA
RapidMiner
Clementine
Matlab
R
2. Supervised Learning:
most data mining methods (classification and predicti
on) are supervised methods
the algorithm is given many examples where the value
of the target variable is provided
the algorithm may learn which values of the target vari
able are associated with which values of the predictor
Description
Researchers and analysts are simply trying to fin
d ways to describe patterns and trends lying wit
hin data
Data mining model should describe clear patter
ns that are amenable to intuitive interpretation
and explanation. Some data mining methods ar
e more suited than others to transparent interpr
etation
decision trees provide an intuitive and human friendly
explanation of their results
neural networks are comparatively opaque to nonspec
ialists, due to the nonlinearity and complexity of the m
odel
Description Techniques
Deskripsi Grafis
Diagram Titik
Histogram
Deskripsi Lokasi
Mean (Rata-Rata)
Median (Nilai Tengah)
Modus (Paling Sering Muncul)
Kuartil (Nilai di Tiap Seperempat Bagian)
Persentil
Deskripsi Keberagaman
Range (Rentang)
Varians dab Standar Deviasi
Estimation
Estimation is similar to classification except t
hat the target variable is numerical rather th
an categorical
Models are built using complete records,
which provide the value of the target variabl
e as well as the predictors
Then, for new observations, estimates of the
value of the target variable are made, based
on the values of the predictors
Estimation Techniques
The field of statistical analysis supplies se
veral venerable and widely used estimatio
n methods
These include point estimation and confid
ence interval estimations, simple linear re
gression and correlation, and multiple reg
ression
Neural networks may also be used for esti
mation
Estimation - Examples
Estimating the amount of money a randomly ch
osen family of four will spend for back-to-school
shopping this fall
Estimating the percentage decrease in rotary-m
ovement sustained by a National Football Leagu
e running back with a knee injury
Estimating the number of points per game that
Patrick Ewing will score when double-teamed in
the playoffs
Estimating the grade-point average (GPA) of a gr
aduate student, based on that students underg
raduate GPA
MMIN
MMAX
CACH
CHMIN
CHMAX
PRP
125
256
6000
256
16
128
198
29
8000
32000
32
32
269
208
480
512
8000
32
67
209
480
1000
4000
45
Prediction
Prediction is similar to classification a
nd estimation, except that for predict
ion, the results lie in the future
Prediction Techniques
Any of the methods and techniques used for
classification and estimation may also be use
d, under appropriate circumstances, for pred
iction
Statistical methods: point estimation and confide
nce interval estimations, simple linear regression
and correlation, and multiple regression
Data mining methods: neural network, decision tr
ee, and k-nearest neighbor
Prediction - Examples
Predicting the price of a stock three months i
nto the future
Predicting the percentage increase in traffic
deaths next year if the speed limit is increase
d
Predicting the winner of this falls baseball
World Series, based on a comparison of tea
m statistics
Predicting whether a particular molecule in
drug discovery will lead to a profitable new d
rug for a pharmaceutical company
Classification
In classification, there is a target categorical
variable, such as income bracket, which, for
example, could be partitioned into three clas
ses or categories:
1. high income
2. middle income
3. low income
Classification Techniques
neural network
decision tree
k-nearest neighbor
naive bayes
Classification - Examples
Determining whether a particular credit card tra
nsaction is fraudulent
Placing a new student into a particular track wit
h regard to special needs
Assessing whether a mortgage application is a g
ood or bad credit risk
Diagnosing whether a particular disease is prese
nt
Identifying whether or not certain financial or pe
rsonal behavior indicates a possible terrorist thr
eat
Rules:
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Rules:
If outlook = sunny and humidity = high then play = no
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
Type
(Number of years)
Percentage
Percentage
Percentage
{none,tcf,tc}
(Number of hours)
{none,ret-allw, emplPercentage
cntr}
Percentage
{yes,no}
(Number of days)
{below-avg,avg,gen}
{yes,no}
{none,half,full}
{yes,no}
{none,half,full}
{good,bad}
1
1
2%
?
?
non
28
e
non
?
e
?
yes
11
avg
no
non
no
e
non
bad
e
2
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
goo
d
3
3
4.3
4.4
%
?
%
?
38
?
?
4%
?
12
gen
?
full
?
full
goo
d
40
2
4.5
4.0
?
non
40
e
?
?
4
?
12
avg
yes
full
yes
half
goo
d
Clustering
Clustering refers to the grouping of records, obs
ervations, or cases into classes of similar objects
A cluster is a collection of records that are simil
ar to one another, and dissimilar to records in ot
her clusters
Clustering differs from classification in that ther
e is no target variable for clustering (unsupervis
ed learning)
The clustering task does not try to classify, estim
ate, or predict the value of a target variable
Clustering is often performed as a preliminary st
ep in a data mining process, with the resulting cl
Clustering Techniques
Hierarchical clustering
K-means clustering
Self Organizing Map (SOM)
Clustering - Examples
Target marketing of a niche product for a sm
all-capitalization business that does not have
a large marketing budget
For accounting auditing purposes, to segme
ntize financial behavior into benign and susp
icious categories
As a dimension-reduction tool when the data
set has hundreds of attributes
For gene expression clustering, where very la
rge quantities of genes may exhibit similar b
ehavior
Association
The association task for data mining is the jo
b of finding which attributes go together
Most prevalent in the business world, where
it is known as affinity analysis or market bask
et analysis, the task of association seeks to u
ncover rules for quantifying the relationship
between two or more attributes
Association rules are of the form If anteced
ent, then consequent, together with a meas
ure of the support and confidence associate
d with the rule
Association
For example, a particular supermarket may fi
nd that of the 1000 customers shopping on
a Thursday night:
200 bought diapers
those 200 who bought diapers, 50 bought beer
Association Techniques
A priori algorithm
FP-Growth algorithm
GRI algorithm
Association - Examples
Investigating the proportion of subscribers to a
companys cell phone plan that respond positiv
ely to an offer of a service upgrade
Predicting degradation in telecommunications n
etworks
Finding out which items in a supermarket are pu
rchased together and which items are never pur
chased together
Determining the proportion of cases in which a
new drug will exhibit dangerous side effects
Latihan (Classification)
1. Lakukan training pada data pemilu (datakpu
-training.xls) dengan menggunakan algoritm
a C4.5
2. Lakukan pengujian untuk datakpu-testing.xl
s
3. Ukur performance-nya dengan menggunaka
n:
1. Confusion Matric (Accuracy)
2. ROC Curve (AUC)
Latihan (Estimation)
1. Lakukan training pada data cpu (cpu.arff) d
engan menggunakan linear regression
2. Lakukan pengujian dengan XValidation
3. Ukur performance-nya dengan menggunaka
n:
1. RMSE
Tugas
1. Coba semua data set yang ada di folder cas
e study dengan berbagai metode data mini
ng. Bila data tanpa testing, gunakan X valid
ation
2. Pelajari dan coba semua yang ada di rapid
miner-movietutorial
3. Buat laporan tentang seluruh ujicoba dari t
ugas 1 dan 2 beserta screenshootnya dan ki
rimkan via email ke romi@brainmatics.com
1. Subject: [datamining1-udinus] nama-nim
2. Deadline: 5 agustus 2011
(CR
ISPDM)
CRISP-DM
4. Modeling phase
Select and apply appropriate modeling tech
niques
Calibrate model settings to optimize results
Remember that often, several different techn
iques may be used for the same data mining
problem
If necessary, loop back to the data preparati
on phase to bring the form of the data into li
ne with the specific requirements of a partic
ular data mining technique
5. Evaluation phase
Evaluate the one or more models delivered i
n the modeling phase for quality and effectiv
eness before deploying them for use in the fi
eld
Determine whether the model in fact achiev
es the objectives set for it in the first phase
Establish whether some important facet of th
e business or research problem has not been
accounted for sufficiently
Come to a decision regarding use of the data
mining results
6. Deployment phase
Make use of the models created: Model crea
tion does not signify the completion of a proj
ect
Example of a simple deployment: Generate a
report
Example of a more complex deployment: Im
plement a parallel data mining process in an
other department
For businesses, the customer often carries o
ut the deployment based on your model
Latihan
Pelajari dan pahami Case Study 1-5 dari b
uku Larose (2005) Chapter 1
Pelajari dan pahami bagaimana menerapk
an CRISP-DM pada tesis Firmansyah (201
1) tentang penerapan algoritma C4.5 unt
uk penentuan kelayakan kredit
Fielded Applications
(Ameri
can Express)
Screening Images
Given: radar satellite images of coastal waters
Problem: detect oil slicks in those images
Oil slicks appear as dark regions with changing s
ize and shape
Not easy: lookalike dark regions can be caused
by weather conditions (e.g. high wind)
Expensive process requiring highly trained pers
onnel
size of region
shape, area
intensity
sharpness and jaggedness of boundaries
Constraints:
Load Forecasting
Electricity supply companies need forecast
of future demand for power
Forecasts of min/max load for each hour
significant savings
Given: manually constructed load model that as
sumes normal climatic conditions
Problem: adjust for weather conditions
Static model consist of:
base load for the year
load periodicity over the year
effect of holidays
temperature
humidity
wind speed
cloud cover readings
plus difference between actual load and predicted load
Referensi
Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Pra
ctical Machine Learning Tools and Techniques 3rd Editi
on, Elsevier, 2011
Daniel T. Larose, Discovering Knowledge in Data: an Intr
oduction to Data Mining, John Wiley & Sons, 2005
Florin Gorunescu, Data Mining: Concepts, Models and T
echniques, Springer, 2011
Jiawei Han and Micheline Kamber, Data Mining: Conce
pts and Techniques Second Edition, Elsevier, 2006
Oded Maimon and Lior Rokach, Data Mining and Knowl
edge Discovery Handbook Second Edition, Springer, 20
10
Warren Liao and Evangelos Triantaphyllou (eds.), Recen
t Advances in Data Mining of Enterprise Data: Algorithm