WEKA Manual

2015-16
Malla Reddy Engineering College (Autonomous)

LTP
- - 4
Course Code: 50530 Credits: 2
B.Tech. – VII Semester
DATA WAREHOUSING AND DATA MINING LAB
1:
Title: Introduction to the Weka machine learning toolkit
Aim: To learn to use the Weak machine learning toolkit
Requirements: How do you load Weka?

1. What options are available on main panel?
Explorer, Experimenter, knowledge flow, workbench, simple CLI
2. What is the purpose of the following in Weka:

1. The Explorer: is an environment for exploring data. ( It gives access to all facilities of Weka
using menu selection and form filling. It allows loading the prepared data. We can flip back and forth
between results, evaluate models built on different datasets and visualize graphically both models and
datasets, including classification errors)
2. The Knowledge Flow interface: is a Java-Beans-based interface for setting up and running
machine learning experiments. (Data sources, classifiers, etc. are beans and can be connected graphically.
Data “flows” through components: e.g., n “data source” -> “filter” -> “classifier” -> “evaluator”. Layouts
can be saved and loaded again later)
3. The Experimenter: is an environment for performing experiments and conducting statistical
tests between learning schemes. ( The experimenter, which can be run from both the command line and a
GUI (easier to use), is a tool that allows you to perform more than one experiment at a time, maybe applying
different techniques to a datasets, or the same technique repeatedly with different parameters p
Experimenter makes it easy to compare the performance of different learning schemes
 For classification and regression problems
 Results can be written into file or database
 Evaluation options: cross-validation, learning curve, hold-out
 Can also iterate over different parameter settings
 Significance-testing built in )
4. The command-line interface: provides a simple command-line interface and allows direct
execution of Weka commands.
3. Describe the .arff file format.

 Data can be imported from a file into WEKA in various formats: ARFF, CSV, C4.5, binary, it can
also be read from a URL or from an SQL database (using JDBC). The easiest and the most common
way of getting the data into WEKA is to store it as Attribute-Relation File Format (ARFF) files.
 It is an ASCII text file that describes a list of instances sharing a set of attributes.
 ARFF files have two distinct sections. The first section is the Header information, which
is followed the Data information.
 The Header of the ARFF file contains the name of the relation, a list of the attributes (the
columns in the data), and their types. An example header on the standard IRIS dataset
looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC

@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments.

The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.
4. Press the Explorer button on the main panel and load the weather dataset and answer the
following questions
1. How many instances are there in the dataset?
There are 14 instances.
2. State the names of the attributes along with their types and values.
S.No Name of the Attribute Data Type Values

1. Outlook Nominal Sunny,overcast,rainy
2. Temperature Numeric Max 85, min 64
3. Humidity Numeric Max 96, min 65
4. Windy Nominal True, False
5. Play Nominal Yes,No
3. What is the class attribute?

‘Play’ is the class attribute.
4. In the histogram on the bottom-right, which attributes are plotted on the X,Yaxes? How
do you change the attributes plotted on the X,Y-axes?
By selecting the attribute under attribute panel we can visualize each attribute on the
bottom right of the tool.
5. How will you determine how many instances of each class are present in the data?
By clicking on the ‘play’ attribute. Under yes-9,no-5 instances are there.
6. What happens with the Visualize All button is pressed?

We can visualize all the attributes at a time.
7. How will you view the instances in the dataset? How will you save the Changes?
By clicking the ‘edit’ button we can see all the instances. If you click on any values it
allows modifying the data.
2: https://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html
a .What is the purpose of the following in the Explorer Panel?
1. The Preprocess panel
The preprocess panel is the start point for knowledge exploration. From this panel you can
load datasets, browse the characteristics of attributes and apply any combination of Weka's
unsupervised filters. to the data.
1. What are the main sections of the Preprocess panel?
Open file, open URL, open DB, Generate, Undo, Edit, Save.
2. What are the primary sources of data in Weka?
.ARFF, .CSV, C4.5, binary
2. The Classify panel

The classifier panel allows you to configure and execute any of the weka classifiers on the
current dataset. You can choose to perform a cross validation or test on a separate dataset.
Classification errors can be visualized in a pop-up data visualization tool. If the classifier
produces a decision tree it can be displayed graphically in a pop-up tree visualizers.
3. The Cluster panel
From the cluster panel you can configure and execute any of the weka clusterers on the
current dataset. Clusters can be visualized in a pop-up data visualization tool.
4. The Associate panel

From the associate panel you can mine the current dataset for association rules using the
weka associators.
5. The Select Attributes panel

This panel allows you to configure and apply any combination of weka attribute evaluator
and search method to select the most pertinent attributes in the dataset. If an attribute
selection scheme transforms the data then the transformed data can be visualized in a pop-
up data visualization tool.
6. The Visualize panel.
This panel displays a scatter plot matrix for the current dataset. The size of the individual
cells and the size of the points they display can be adjusted using the slider controls at the
bottom of the panel. The number of cells in the matrix can be changed by pressing the
"Select Attributes" button and then choosing those attributes to displayed. When a dataset
is large, plotting performance can be improved by displaying only a subsample of the
current dataset. Clicking on a cell in the matrix pops up a larger plot panel window that
displays the view from that cell. This panel allows you to visualize the current dataset in
one and two dimensions. When the colouring attribute is discrete, each value is displayed
as a different colour; when the colouring attribute is continuous, a spectrum is used to
indicate the value. Attribute "bars" (down the right hand side of the panel) provide a
convenient summary of the discriminating power of the attributes individually. This panel
can also be popped up in a separate window from the classifier panel and the cluster panel
to allow you to visualize predictions made by classifiers/clusterers. When the class is
discrete, misclassified points are shown by a box in the color corresponding to the class
predicted by the classifier; when the class is continuous, the size of each plotted point varies
in proportion to the magnitude of the error made by the classifier.
b. Load the weather dataset and perform the following tasks:
1. Use the unsupervised filter Remove with Values to remove all instances where
the attribute ‗humidity ‘has the value ‗high‘?
Load weather.arff into Weka. Click on choose and select weka-Filters-
unsupervised-instance-RemoveWithValues and click on Filters. We can observe Remove
WithValues in the text box right next to the choose button. Right click on Remove
WithValues, it opens weka.ghi.GenericObjectEditorPanel. Set the attributeIndex to 3,
invertSelection to True, nominalindices to 3, Splitpoint to 96.0(highest value under that
attribute). Then click on ok button. Click on Apply button(next to text bar). For verification
check the maximum value of the attribute ‘humidity’ in attribute section.
2. Undo the effect of the filter.
Cliking on undo button will set back to original data.
c. Answer the following questions:

1.What is meant by filtering in Weka?
Weka includes many filters that can be used before invoking a classifier to clean
up the dataset, or alter it in some way. Filters help with data preparation. For
example, you can easily remove an attribute. Or you can remove all instances that
have a certain value for an attribute (e.g. instances for which humidity has the
value high). Surprisingly, removing attributes sometimes leads to better
classification! – and also simpler decision trees.
2. Which panel is used for filtering a dataset?

Filter section under Preprocess tab will allow the user to perform filter operations
on the uploaded dataset.
3. What are the two main types of filters in Weka?

Supervised: attribute and instance
Unsupervised: attribute and instance
4. What is the difference between the two types of filters? What is the difference
between and attribute filter and an instance filter?
In WEKA, filters are used to preprocess the data, and they can be found below the
package weka.filters. Each filter falls into one of the following two categories:
 Supervised: the filter requires a class attribute to be set

 Unsupervised: a class attribute is not required to be present
And into one of the two sub-categories:
 Attribute-based: columns are processed, e.g., added or removed.

 Instance-based: rows are processed, e.g., added or deleted.
The supervised one takes the class attribute and its distribution over the dataset into
account, in order to determine the optimal number and size of bins, whereas the
unsupervised one relies on a user-specified number of bins.
So, Weka filters are used to manipulate a dataset in order to obtain a data instance
which can be processed by a classifier or a clusterer.
d. Load the iris dataset and perform the following tasks:

1. Press the Visualize tab to view the Visualizer panel.
Visualizer panel consists of plot matrix. Rows are Sepal length, sepal width, petal
length, petal width, class. Columns are class, petal width, petal length, sepal width,
sepal length.
plot matrix,
Sepal length Sepal width Petal length Petal width Class

Class | |
Petal width | |
Petal length | |
Sepal width | |
Sepal length | |
2. What is the purpose of the Visualizer?
For successful data mining we must know our data. WEKA’s visualize panel
will give us idea to look at a dataset and select different attributes-preferably
numeric ones for X and Y-axes.
Instances are shown as points, with different colors for different classes.
We can sweep out a rectangle and focus the dataset on the points inside it.
We apply a classifier and visualize the errors it makes by plotting the “class”
against the “predicted class”
3. Select one panel in the Visualizer and experiment with the buttons on the panel.
We selected product of:

X-coordinate: sepalwidth(num)
Y-coordinate: petalwidth(num)
Experiment with button:

We selected color: Sepallength(num)
Polygon
Observation:
Graph has been changed with respect to color and design.
3. Classification using the WEKA tool kit.
Aim:
To perform classification on datasets using the WEKA machine learning toolkit.
Requirements:
1. Load the “weather.nominal.arff” dataset into WEKA and run ID3 classification
algorithm.
Answer the following questions:

1. List the attributes of the given relation along with the type details.
Ans. Relation: weather Symbolic.
1.Outlook 2. Temperature 3. 4. Windy 5. Play
Humidity
Nominal Nominal Nominal Nominal Nominal
1 Sunny Hot High False No
2 Sunny Hot High True No
3 Overcast Hot High False Yes
4 Rainy Hot High False Yes
5 Rainy Mild Normal False Yes
6 Rainy Cool Normal True No
7 Overcast Cool Normal True Yes
8 Sunny Mild High False No
9 Sunny Cool Normal False Yes
10 Rainy Mild Normal False Yes
11 Sunny Mild Normal True Yes
12 Overcast Mild High True Yes
13 Overcast Hot Normal False Yes
14 Rainy Mild High True No
2. Create a table of the weather.nominal.arff data
TP FP Precision Recall F- ROC Class

Rate Rate Measure Area
0 0 0 0 0 0.5 Sunny
0 0 0 0 0 0.5 Overcast
1 1 0.2 1 0.333 0.5 Rainy
Weighted 0.2 0.2 0.04 0.2 0.067 0.5
Average
3. Study the classifier Output and answer the following questions:

1. Draw the decision tree generated by the classifier.
Select>>Classify>>Trees>>J48 Algorithm>>Use Training set>>Start>>Right
click>>Visualize tree
2. Compute the entropy values for each of the attributes

−9 9 5 5
𝐼𝑛𝑓𝑜(𝐷) = 𝐼(9,5) = 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.940
14 14 14 14
𝑟
𝑠1𝑗 + 𝑠2𝑗 + ⋯ + 𝑠𝑛𝑗
Info(D) = E(D) = ∑ ( ∗ 𝑖(𝑠1𝑗, 𝑠2𝑗, … , 𝑠𝑛𝑗)
𝑠
𝑗=1
Gain(outlook): 0.247
Gain(temp): 0.029
Gain(humidity): 0.152
Gain(wind): 0.048
3. What is the relationship between the attribute entropy values and nodes of
decision tree?
Based on the entropy values and information gain we are going to decide which
the deciding factor for the decision tree.
4. Draw the confusion matrix? What information does the confusion matrix
provide?
A confusion matrix is a technique for summarizing the performance of a
classification algorithm.
Classification accuracy alone can be misleading if you have an unequal
number of observations in each class or if we have more than two classes in
dataset.
Calculating a confusion matrix gives use better idea of what our
classification model is getting right and what types of errors it is making.
Use training set:
(Nominal)Outlook
Select Start, confusion matrix is obtained as follows in the WEKA tool.
a b c Classified as
4 1 0 | a = sunny
0 4 0 | b = overcast
3 0 2 | c = rainy
5. Describe the following quantities:
TP Rate:
Rate of true positives (instances correctly classified as a given class)
FP Rate:
Rate of false positives (instances falsely classified as a given class)
Precision:
Proportion of instances that are truly of a class divided by the total instances
classified as that class
Recall:
Proportion of instances as a given class divided by the actual total in that
class (equivalent to TP rate)
4. Performing data preprocessing tasks for data mining in WEKA.
Aim:
To learn how to use various data preprocessing methods as part of data mining.
Requirements:
Applications of Discretization filters
Performing the following tasks:

1. Load the ‘sick.arff’ dataset
2. How many instances does this dataset have
a. There are 3772 instances in sick.arff dataset
3. How many attributes does it have?
a. There are 30 attributes:
4. Which is the class attribute and what are the characteristics of this attributes?
a. Class is class attribute:
Class
S.No Label Count
1 Negative 3541
2 Sick 231
Class attribute indicates what the class index of data instances object is.
5. How many attributes are numeric? What are the attribute indices of the numeric attributes?
a.
1 Age
18 TSH
20 T3
22 TT4
24 T4U
26 FTI
28 TBG
Above 7 attributes are numeric.
1. Age
Type: Numeric
Statistic Value
Minimum 18
Maximum 87
Mean 56.347
StdDev 19.687
2. Name: TSH
Type: Numeric
Statistic Value
Minimum 0.03
Maximum 45
Mean 3.02
StdDev 6.986
3. Name: T3
Statistic Value
Minimum 0.3
Maximum 5.5
Mean 1.992
StdDev 0.87
4. Name: TT4
Statistic Value
Minimum 39
Maximum 199
Mean 109.729
StdDev 34.903
5. Name: T4U
Statistic Value
Minimum 0.56
Maximum 1.55
Mean 0.957
StdDev 0.18
6. Name: FTI
Statistic Value
Minimum 33
Maximum 190
Mean 113.711
StdDev 31.824
7. Name: TBG
Statistic Value
Minimum NaN
Maximum NaN
Mean NaN
StdDev NaN
5. Performing clustering using the data mining toolkit
Aim:
To learn to use clustering techniques
Requirements:
Perform the following tasks:

1. Load the ‘bank.arff’ dataset in WEKA
2. Write down the following details regarding the attributes:
a. Names
b. Types
c. Values
1. Age
Type: Numeric
Statistic Value
Minimum 18
Maximum 67
Mean 42.57
StdDev 14.22
2. Sex
Type: Nominal
Statistic Value
1. Male 154
2. Female 146
3. Region
Type: Nominal
Label Count
Inner city 137
Rural 51
Town 87
Sub Urban 25
4. Income
Type: Numeric
Statistic Value
Minimum 5014.21
Maximum 63130.1
Mean 27655.498
StdDev 12956.7
5. Married
Type: Nominal
Label Count
Yes 202
No 98
6. Children
Type: Nominal
Label Count
Yes 171
No 129
7. Car
Type: Nominal
Label Count
Yes 147
No 153
8. Mortgage
Type: Nominal
Label Count
Yes 105
No 195
9. Pep
Type: Nominal
Label Count
Yes 138
No 162
3. Run the simple K-means clustering algorithms on the dataset.
i. How many clusters are created?
2 clusters are created
ii. What are the number of instances and percentages figures in each cluster?
Clustered instances
0 172 (57%)
1 128 (33%)
iii. What is the number of iterations that were required?
Number of iterations that were required is 3.
iv. What is the sum of squared errors? What does it represent?
Within the cluster sum of squared errors: 615.6202745877614
Missing values globally replaced with mean/mode.
v. Tabulate the charecteristics of the centroid of each cluster.
Cluster centroids
Attribute Full Data Cluster 0 Cluster 1 Cluster 2
(300) (123) (99) (78)
Age 43.57 40.6179 46.6869 40.4231
Sex M F F M
Region INNER_CITY INNER_CITY TOWN INNER_CITY
Income 27655.4981 26439.1302 30579.2044 25862.7588
Married Yes Yes No No
Children Yes No Yes No
Car No Yes No Yes
Mortgage No No No Yes
vi. Visualize the results of this clustering (let the x-axis represent the cluster name, and
the y-axis represent the instance number).
Select cluster
Choose>>Simple K-means>>Select/Click on menu bar.
Number of clusters: (set as) 3
Select the option as: classes to clusters evaluation
Start
Right click
On result set>>Right click>>Visualize cluster assignment options
1) Is there a signification variation in age between clusters?
Yes, there is not much significant variation in age between clusters
2) What are the number of instances and percentages figures in each cluster?
Male predominated by cluster 2
Female predominated by cluster 0 and 1
3) What can be said about values of region attribute in each cluster?
INNER_CITY and TOWN
4) What can be said about the variation of income between clusters.
Difference between 0 and 1st clusters are -4140.0742
Difference between 1 and 2 is 4716.4456
5) Which clusters are dominated by married people and which clusters are
dominated by unmarried people?
Married clusters are predominated by 0 and unmarried clusters are
predominated by 1 and 2
6) How do the clusters differ with respect to the number of children?
Cluster 0: No
Cluster 1: Yes
Cluster 2: No
7) Which cluster has the highest number of people with cars?
Cluster 0
8) Which clusters are predominated by people with savings accounts?
Cluster 2
9) What can be said about the variation of current accounts between clusters?
For cluster 0 and 1 – No
For cluster 1 and 2 – Yes
10) Which clusters comprise mostly of people who buy the PEP product and
which ones are comprised of people who do not buy the PEP product?
Class Attribute: PEP
Classes to clusters:
0 1 2 Assigned to cluster
42 58 28 | Yes
81 41 40 | No
6. Using WEKA to determine association rules
Aim:
To learn to use association algorithms on datasets
Requirements:
1) Perform the following tasks:
1. Define the following terms:
a. Item and Item Set:
Item: Item is a binary treated variable whose value is one if the item is present in a
transaction and zero otherwise.
b. Association:
A study defined WEKA is the gathering or a collection of the implements for
execution data mining with the application of the association rules in it.
c. Association rule:
Association rules applied to find the connection between data items in a
transactional database.
Association data mining algorithms are used to discover frequent association.
d. Support of an association rule:
The support of an association pattern refers to the percentage of task-relevent data
tuples.
For association rules of the form “A⟹ 𝐵”,
Where A and B are sets of items, it is defined as
#_𝑇𝑈𝑃𝐿𝐸𝑆_𝐶𝑂𝑁𝑇𝐴𝐼𝑁𝐼𝑁𝐺_𝐵𝑂𝑇𝐻_𝐴_𝐴𝑁𝐷_𝐵
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓(𝐴 ⇒ 𝐵) =
𝑇𝑂𝑇𝐴𝐿_#_𝑂𝐹_𝑇𝑈𝑃𝐿𝐸𝑆
e. Confidence of an association rule:
A certainty measure for association rules of the form “A⟹ 𝐵”, where A & B are
sets of items is confidence.
#_𝑇𝑈𝑃𝐿𝐸𝑆_𝐶𝑂𝑁𝑇𝐴𝐼𝑁𝐼𝑁𝐺_𝐵𝑂𝑇𝐻_𝐴_𝐴𝑁𝐷_𝐵
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑓(𝐴 ⇒ 𝐵) =
#_𝑇𝑈𝑃𝐿𝐸𝑆_𝐶𝑂𝑁𝑇𝐴𝐼𝑁𝐼𝑁𝐺_𝐴
f. Large item set:
Existence of large item set collected represents a potential wealth of information
and also given adequate methods of transforming the data into meaningful
information. Item set that meet a minimum support threshold are referred to as
frequent item sets.
g. Association rule problem:
Given a set of transactions T, Find out the rules having support≥minsup and
confidence≥minconf, where minsup and minconf are the corresponding support
and confidence thresholds
2. What is the purpose of an Apriori Algorithm.
It uses large item set property
It is easily parallelized
It is easy to implement
Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. Apriori uses a “bottom-up approach”, where frequent subsets are extended one
item at a time (candidate generation).

1. Load the ‘vote.arff’ dataset
2. Apply the Apriori Association rule.
Select Associate>>Choose>>Apriori
Start
Minimum support: 0.45 (196 instances)
Minimum metric < confidence >: 0.9
Number of cycles performed: 11
3) What is the support threshold used? What is the confidence threshold used?
Minimum support: 0.45 (196 instances)
Minimum metric <confidence>:0.9
4) Write down the top 6 rules along with the support and confidence values?
Rule 1: {2,3} → {5}

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓({2,3,5})⁄𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 ({2,3}) = 2⁄2 = 100%
Rule 2: {2,5} → {3}

Rule 3: {3,5} → {2}

Rule 4: {2} → {3,5}

𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓({2,3,5})⁄𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 ({2}) = 2⁄3 = 67%
Rule 5: {3} → {2,5}

Rule 6: {5} → {2,3}

5) What does the figure to the left of the arrow in the association rule represent?
Figure to the left of the arrow in the association rule represents antecedent.
6) What does the figure to the right of the arrow in the association rule represent?
Consequent.
7) For Rule 8, verify the numerical values used for computation of support and confidence
is in accordance with the data by using the preprocess panel. Then compute the support.
7.
i. Load the dataset ‘weather.nominal.arff’
ii. Apply the Apriori Association rule.
1) Consider the rule

”𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = ℎ𝑜𝑡 ⇒ 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑛𝑜𝑟𝑚𝑎𝑙".
Compute the support and confidence for this rule.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴⋃𝐵)
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 ⇒ 𝐵) = 𝑃(𝐵⁄𝐴) =
𝑆𝑢𝑝𝑝𝑜𝑟𝑡_𝐶𝑜𝑢𝑛𝑡(𝐴)
𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐻𝑜𝑡 ⇒ 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔ℎ 𝑖𝑛 3 𝑐𝑎𝑠𝑒𝑠
1
𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐻𝑜𝑡 ⇒ 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑁𝑜𝑟𝑚𝑎𝑙 𝑖𝑛 1 𝑐𝑎𝑠𝑒 = = 25%
4
2) Consider the rule”𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = ℎ𝑜𝑡, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔ℎ ⇒ 𝑊𝑖𝑛𝑑𝑦 = 𝑇𝑟𝑢𝑒". Consider the
Support and Confidence for this rule.
𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐻𝑜𝑡, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔ℎ ⇒ 𝑊𝑖𝑛𝑑𝑦 = 𝑇𝑟𝑢𝑒
𝑂𝑛𝑙𝑦 1 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑖𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 1 = 100%
3) Is it possible to have a rule like the following rule?

"𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐶𝑜𝑜𝑙" ⇒ 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝑁𝑜𝑟𝑚𝑎𝑙, 𝑃𝑙𝑎𝑦 = 𝑌𝑒𝑠
But strong rule does not exist for it.
a) Load the ‘bank-data.csv’ file

b) Apply the Apriori Association Rule Algorithm. What is the result? Why?
If we apply apriori association rule directly by selecting on associate. It will not produce any
output because the data is not nominal. Instead we must follow the following steps:
Select PreProcess>>Choose filters>>Unsupervised>>Attribute>>Select Numeric to

Nominal>>Apply.
Then select Associate>>Start. Associated rules are generated.
WEKA>>Filters>>Unsupervised>>Attribute>>NumericToNominal>>Apply. Associate rules

are generated.
c) Apply the supervised discretization filter to the age and income attributes.
Select Preprocess>>Choose Supervised>>Attribute>>Discretize
Beside choose click on bar; in attribute indices change first-last to 2, 5. Since in given
dataset 2 is age and 5 is income.
Select OK>>Apply
Associate>>Start>>Run Apriori Algorithm.

d) Run the Apriori rule algorithm:
In Association output, strong rules have been generated by selecting the option Apriori.
e) List the results that were generated?
Best/ Strong rules found:
1. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑆𝑎𝑣𝑒. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 74 ⇒ 𝑚𝑎𝑟𝑟𝑖𝑒𝑑

= 𝑌𝑒𝑠73. 𝑆𝑜 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.99
2. 𝑆𝑒𝑥 = 𝐹𝑒𝑚𝑎𝑙𝑒𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 64 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑

= 𝑌𝑒𝑠. 𝑠𝑜 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.98
3. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 32 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.98
4. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 107 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 104 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.97
5. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝐶𝑎𝑟 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 62 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 60 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.97
6. 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑆𝑎𝑣𝑒. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 87 ⇒ 𝑃𝑒𝑃 = 𝑁𝑜 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒

= 0.92
7. 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑆𝑎𝑣𝑒. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 ⇒ 𝑃𝑒𝑃

= 𝑁𝑜 73 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.91
8. 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝐶𝑢𝑟𝑟𝑒𝑛𝑡. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 ⇒ 𝑃𝑒𝑃

= 𝑁𝑜 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.91
9. 𝑆𝑒𝑥 = 𝐹𝑒𝑚𝑎𝑙𝑒 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 70 ⇒ 𝑃𝑒𝑃
= 𝑁𝑜 63 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.9

WEKA Manual

Uploaded by

Copyright:

Available Formats

You might also like

WEKA Manual

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WEKA Manual

Uploaded by

Copyright:

Available Formats

2015-16

Malla Reddy Engineering College (Autonomous)

Title: Introduction to the Weka machine learning toolkit

Aim: To learn to use the Weak machine learning toolkit

Requirements: How do you load Weka?

2. What is the purpose of the following in Weka:

3. Describe the .arff file format.

@ATTRIBUTE sepallength NUMERIC

The Data of the ARFF file looks like the following:

Lines that begin with a % are comments.

S.No Name of the Attribute Data Type Values

3. What is the class attribute?

6. What happens with the Visualize All button is pressed?

2. The Classify panel

4. The Associate panel

5. The Select Attributes panel

c. Answer the following questions:

2. Which panel is used for filtering a dataset?

3. What are the two main types of filters in Weka?

 Supervised: the filter requires a class attribute to be set

And into one of the two sub-categories:

 Attribute-based: columns are processed, e.g., added or removed.

d. Load the iris dataset and perform the following tasks:

Sepal length Sepal width Petal length Petal width Class

2. What is the purpose of the Visualizer?

We selected product of:

Experiment with button:

Answer the following questions:

TP FP Precision Recall F- ROC Class

3. Study the classifier Output and answer the following questions:

2. Compute the entropy values for each of the attributes

Performing the following tasks:

Perform the following tasks:

2) Perform the following tasks:

Rule 1: {2,3} → {5}

Rule 2: {2,5} → {3}

Rule 3: {3,5} → {2}

Rule 4: {2} → {3,5}

Rule 5: {3} → {2,5}

Rule 6: {5} → {2,3}

1) Consider the rule

3) Is it possible to have a rule like the following rule?

a) Load the ‘bank-data.csv’ file

Select PreProcess>>Choose filters>>Unsupervised>>Attribute>>Select Numeric to

Then select Associate>>Start. Associated rules are generated.

WEKA>>Filters>>Unsupervised>>Attribute>>NumericToNominal>>Apply. Associate rules

Select Preprocess>>Choose Supervised>>Attribute>>Discretize

Associate>>Start>>Run Apriori Algorithm.

e) List the results that were generated?

Best/ Strong rules found:

1. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑆𝑎𝑣𝑒. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 74 ⇒ 𝑚𝑎𝑟𝑟𝑖𝑒𝑑

2. 𝑆𝑒𝑥 = 𝐹𝑒𝑚𝑎𝑙𝑒𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 64 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑

3. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 32 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.98

4. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 107 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 104 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.97

5. 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝐶𝑎𝑟 = 𝑁𝑜 𝑃𝑒𝑃 = 𝑁𝑜 62 ⇒ 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 60 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 0.97

6. 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑆𝑎𝑣𝑒. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 87 ⇒ 𝑃𝑒𝑃 = 𝑁𝑜 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒

7. 𝑀𝑎𝑟𝑟𝑖𝑒𝑑 = 𝑌𝑒𝑠 𝐶ℎ𝑖𝑙𝑑𝑟𝑒𝑛 = 0 𝑆𝑎𝑣𝑒. 𝑎𝑐𝑡 = 𝑌𝑒𝑠 𝑀𝑜𝑟𝑡𝑔𝑎𝑔𝑒 = 𝑁𝑜 ⇒ 𝑃𝑒𝑃