An Introduction To WEKA

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 85

An Introduction to WEKA

Content
• What is WEKA?
• The Explorer:
– Preprocess data
– Classification
– Clustering
– Association Rules
– Attribute Selection
– Data Visualization
• References and Resources
5/18/2019 2
What is WEKA?
• Waikato Environment for Knowledge Analysis
– It’s a data mining/machine learning tool
developed by Department of Computer Science,
University of Waikato, New Zealand.
– Weka is also a bird found only on the islands of
New Zealand.

5/18/2019 3
Download and Install WEKA
• Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.
html
• Support multiple platforms (written in java):
– Windows, Mac OS X and Linux

5/18/2019 4
Main Features
• 49 data preprocessing tools
• 76 classification/regression algorithms
• 8 clustering algorithms
• 3 algorithms for finding association rules
• 15 attribute/subset evaluators + 10 search
algorithms for feature selection

5/18/2019 5
Main GUI
• Three graphical user interfaces
– “The Explorer” (exploratory data
analysis)
– “The Experimenter” (experimental
environment)
– “The KnowledgeFlow” (new process
model inspired interface)

5/18/2019 6
Content
• What is WEKA?
• The Explorer:
– Preprocess data
– Classification
– Clustering
– Association Rules
– Attribute Selection
– Data Visualization
• References and Resources
5/18/2019 7
Explorer: pre-processing the data
• Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
• Data can also be read from a URL or from an
SQL database (using JDBC)
• Pre-processing tools in WEKA are called
“filters”
• WEKA contains filters for:
– Discretization, normalization, resampling,
attribute selection, transforming and combining
5/18/2019 attributes, … 8
WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

5/18/2019 9
WEKA only deals with “flat” files
@relation heart-disease-simplified

@attribute age numeric


@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}

@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...

5/18/2019 10
5/18/2019 University of Waikato 11
5/18/2019 University of Waikato 12
5/18/2019 University of Waikato 13
5/18/2019 University of Waikato 14
5/18/2019 University of Waikato 15
5/18/2019 University of Waikato 16
5/18/2019 University of Waikato 17
5/18/2019 University of Waikato 18
5/18/2019 University of Waikato 19
5/18/2019 University of Waikato 20
5/18/2019 University of Waikato 21
5/18/2019 University of Waikato 22
5/18/2019 University of Waikato 23
5/18/2019 University of Waikato 24
5/18/2019 University of Waikato 25
5/18/2019 University of Waikato 26
5/18/2019 University of Waikato 27
5/18/2019 University of Waikato 28
5/18/2019 University of Waikato 29
5/18/2019 University of Waikato 30
5/18/2019 University of Waikato 31
Explorer: building “classifiers”
• Classifiers in WEKA are models for predicting
nominal or numeric quantities
• Implemented learning schemes include:
– Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …

5/18/2019 32
Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


This follows <=30
<=30
high
high
no fair
no excellent
no
no
an example 31…40 high no fair yes
>40 medium no fair yes
of Quinlan’s >40 low yes fair yes
ID3 (Playing >40
31…40
low
low
yes excellent
yes excellent
no
yes
Tennis) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

May 18, 2019 33


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

May 18, 2019 34


5/18/2019 University of Waikato 36
5/18/2019 University of Waikato 37
5/18/2019 University of Waikato 38
5/18/2019 University of Waikato 39
5/18/2019 University of Waikato 40
5/18/2019 University of Waikato 41
5/18/2019 University of Waikato 42
5/18/2019 University of Waikato 43
5/18/2019 University of Waikato 44
5/18/2019 University of Waikato 45
5/18/2019 University of Waikato 46
5/18/2019 University of Waikato 47
5/18/2019 University of Waikato 48
5/18/2019 University of Waikato 49
5/18/2019 University of Waikato 50
5/18/2019 University of Waikato 51
5/18/2019 University of Waikato 52
5/18/2019 University of Waikato 53
5/18/2019 University of Waikato 54
5/18/2019 University of Waikato 55
5/18/2019 University of Waikato 56
5/18/2019 University of Waikato 57
Explorer: finding associations
• WEKA contains an implementation of the
Apriori algorithm for learning association rules
– Works only with discrete data
• Can identify statistical dependencies between
groups of attributes:
– milk, butter  bread, eggs (with confidence 0.9
and support 2000)
• Apriori can compute all rules that have a given
minimum support and exceed a given
confidence
5/18/2019 61
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper • k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
• (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
40 Nuts, Eggs, Milk occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
• (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
May 18, 2019 62
Basic Concepts: Association Rules
Tid Items bought
• Find all the rules X  Y with
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
minimum support and confidence
30 Beer, Diaper, Eggs – support, s, probability that a
40
50
Nuts, Eggs, Milk
Nuts, Coffee, Diaper, Eggs, Milk
transaction contains X  Y
Customer Customer
– confidence, c, conditional
buys both
buys probability that a transaction
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Customer Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
buys beer  Diaper}:3
Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
May 18, 2019 63
5/18/2019 University of Waikato 64
5/18/2019 University of Waikato 65
5/18/2019 University of Waikato 66
5/18/2019 University of Waikato 67
5/18/2019 University of Waikato 68
Explorer: attribute selection
• Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
• Attribute selection methods contain two
parts:
– A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
– An evaluation method: correlation-based,
wrapper, information gain, chi-squared, …
• Very flexible: WEKA allows (almost) arbitrary
5/18/2019 69
5/18/2019 University of Waikato 70
5/18/2019 University of Waikato 71
5/18/2019 University of Waikato 72
5/18/2019 University of Waikato 73
5/18/2019 University of Waikato 74
5/18/2019 University of Waikato 75
5/18/2019 University of Waikato 76
5/18/2019 University of Waikato 77
Explorer: data visualization
• Visualization very useful in practice: e.g. helps
to determine difficulty of the learning
problem
• WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
– To do: rotating 3-d visualizations (Xgobi-style)
• Color-coded class values
• “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)
5/18/2019 78
• “Zoom-in” function
5/18/2019 University of Waikato 79
5/18/2019 University of Waikato 80
5/18/2019 University of Waikato 81
5/18/2019 University of Waikato 82
5/18/2019 University of Waikato 83
5/18/2019 University of Waikato 84
5/18/2019 University of Waikato 85
5/18/2019 University of Waikato 86
5/18/2019 University of Waikato 87
5/18/2019 University of Waikato 88
References and Resources
 References:
 WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
 WEKA Tutorial:
 Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in
Weka.
 A presentation which explains how to use Weka for exploratory data
mining.
 WEKA Data Mining Book:
 Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning
Tools and Techniques (Second Edition)
 WEKA Wiki:
http://weka.sourceforge.net/wiki/index.php/Main_Page
 Others:
 Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 2nd ed.

You might also like