Professional Documents
Culture Documents
WEKA Manual
WEKA Manual
WEKA Manual
2. State the names of the attributes along with their types and values.
4. In the histogram on the bottom-right, which attributes are plotted on the X,Yaxes? How
do you change the attributes plotted on the X,Y-axes?
By selecting the attribute under attribute panel we can visualize each attribute on the
bottom right of the tool.
5. How will you determine how many instances of each class are present in the data?
By clicking on the ‘play’ attribute. Under yes-9,no-5 instances are there.
4. What is the difference between the two types of filters? What is the difference
between and attribute filter and an instance filter?
In WEKA, filters are used to preprocess the data, and they can be found below the
package weka.filters. Each filter falls into one of the following two categories:
So, Weka filters are used to manipulate a dataset in order to obtain a data instance
which can be processed by a classifier or a clusterer.
Visualizer panel consists of plot matrix. Rows are Sepal length, sepal width, petal
length, petal width, class. Columns are class, petal width, petal length, sepal width,
sepal length.
plot matrix,
For successful data mining we must know our data. WEKA’s visualize panel
will give us idea to look at a dataset and select different attributes-preferably
numeric ones for X and Y-axes.
Instances are shown as points, with different colors for different classes.
We can sweep out a rectangle and focus the dataset on the points inside it.
We apply a classifier and visualize the errors it makes by plotting the “class”
against the “predicted class”
3. Select one panel in the Visualizer and experiment with the buttons on the panel.
Observation:
Graph has been changed with respect to color and design.
3. Classification using the WEKA tool kit.
Aim:
To perform classification on datasets using the WEKA machine learning toolkit.
Requirements:
1. Load the “weather.nominal.arff” dataset into WEKA and run ID3 classification
algorithm.
Gain(outlook): 0.247
Gain(temp): 0.029
Gain(humidity): 0.152
Gain(wind): 0.048
3. What is the relationship between the attribute entropy values and nodes of
decision tree?
Based on the entropy values and information gain we are going to decide which
the deciding factor for the decision tree.
4. Draw the confusion matrix? What information does the confusion matrix
provide?
A confusion matrix is a technique for summarizing the performance of a
classification algorithm.
Classification accuracy alone can be misleading if you have an unequal
number of observations in each class or if we have more than two classes in
dataset.
Calculating a confusion matrix gives use better idea of what our
classification model is getting right and what types of errors it is making.
Use training set:
(Nominal)Outlook
Select Start, confusion matrix is obtained as follows in the WEKA tool.
a b c Classified as
4 1 0 | a = sunny
0 4 0 | b = overcast
3 0 2 | c = rainy
5. Describe the following quantities:
TP Rate:
Rate of true positives (instances correctly classified as a given class)
FP Rate:
Rate of false positives (instances falsely classified as a given class)
Precision:
Proportion of instances that are truly of a class divided by the total instances
classified as that class
Recall:
Proportion of instances as a given class divided by the actual total in that
class (equivalent to TP rate)
4. Performing data preprocessing tasks for data mining in WEKA.
Aim:
To learn how to use various data preprocessing methods as part of data mining.
Requirements:
Applications of Discretization filters
4. Which is the class attribute and what are the characteristics of this attributes?
a. Class is class attribute:
Class
S.No Label Count
1 Negative 3541
2 Sick 231
Class attribute indicates what the class index of data instances object is.
5. How many attributes are numeric? What are the attribute indices of the numeric attributes?
a.
1 Age
18 TSH
20 T3
22 TT4
24 T4U
26 FTI
28 TBG
Above 7 attributes are numeric.
1. Age
Type: Numeric
Statistic Value
Minimum 18
Maximum 87
Mean 56.347
StdDev 19.687
2. Name: TSH
Type: Numeric
Statistic Value
Minimum 0.03
Maximum 45
Mean 3.02
StdDev 6.986
3. Name: T3
Statistic Value
Minimum 0.3
Maximum 5.5
Mean 1.992
StdDev 0.87
4. Name: TT4
Statistic Value
Minimum 39
Maximum 199
Mean 109.729
StdDev 34.903
5. Name: T4U
Statistic Value
Minimum 0.56
Maximum 1.55
Mean 0.957
StdDev 0.18
6. Name: FTI
Statistic Value
Minimum 33
Maximum 190
Mean 113.711
StdDev 31.824
7. Name: TBG
Statistic Value
Minimum NaN
Maximum NaN
Mean NaN
StdDev NaN
5. Performing clustering using the data mining toolkit
Aim:
To learn to use clustering techniques
Requirements:
1. Age
Type: Numeric
Statistic Value
Minimum 18
Maximum 67
Mean 42.57
StdDev 14.22
2. Sex
Type: Nominal
Statistic Value
1. Male 154
2. Female 146
3. Region
Type: Nominal
Label Count
Inner city 137
Rural 51
Town 87
Sub Urban 25
4. Income
Type: Numeric
Statistic Value
Minimum 5014.21
Maximum 63130.1
Mean 27655.498
StdDev 12956.7
5. Married
Type: Nominal
Label Count
Yes 202
No 98
6. Children
Type: Nominal
Label Count
Yes 171
No 129
7. Car
Type: Nominal
Label Count
Yes 147
No 153
8. Mortgage
Type: Nominal
Label Count
Yes 105
No 195
9. Pep
Type: Nominal
Label Count
Yes 138
No 162
3. Run the simple K-means clustering algorithms on the dataset.
i. How many clusters are created?
2 clusters are created
ii. What are the number of instances and percentages figures in each cluster?
Clustered instances
0 172 (57%)
1 128 (33%)
iii. What is the number of iterations that were required?
Number of iterations that were required is 3.
iv. What is the sum of squared errors? What does it represent?
Within the cluster sum of squared errors: 615.6202745877614
Missing values globally replaced with mean/mode.
v. Tabulate the charecteristics of the centroid of each cluster.
Cluster centroids
Attribute Full Data Cluster 0 Cluster 1 Cluster 2
(300) (123) (99) (78)
Age 43.57 40.6179 46.6869 40.4231
Sex M F F M
Region INNER_CITY INNER_CITY TOWN INNER_CITY
Income 27655.4981 26439.1302 30579.2044 25862.7588
Married Yes Yes No No
Children Yes No Yes No
Car No Yes No Yes
Mortgage No No No Yes
vi. Visualize the results of this clustering (let the x-axis represent the cluster name, and
the y-axis represent the instance number).
Select cluster
Choose>>Simple K-means>>Select/Click on menu bar.
Number of clusters: (set as) 3
Select the option as: classes to clusters evaluation
Start
Right click
On result set>>Right click>>Visualize cluster assignment options
1) Is there a signification variation in age between clusters?
Yes, there is not much significant variation in age between clusters
2) What are the number of instances and percentages figures in each cluster?
Male predominated by cluster 2
Female predominated by cluster 0 and 1
3) What can be said about values of region attribute in each cluster?
INNER_CITY and TOWN
4) What can be said about the variation of income between clusters.
Difference between 0 and 1st clusters are -4140.0742
Difference between 1 and 2 is 4716.4456
5) Which clusters are dominated by married people and which clusters are
dominated by unmarried people?
Married clusters are predominated by 0 and unmarried clusters are
predominated by 1 and 2
6) How do the clusters differ with respect to the number of children?
Cluster 0: No
Cluster 1: Yes
Cluster 2: No
7) Which cluster has the highest number of people with cars?
Cluster 0
8) Which clusters are predominated by people with savings accounts?
Cluster 2
9) What can be said about the variation of current accounts between clusters?
For cluster 0 and 1 – No
For cluster 1 and 2 – Yes
10) Which clusters comprise mostly of people who buy the PEP product and
which ones are comprised of people who do not buy the PEP product?
Class Attribute: PEP
Classes to clusters:
0 1 2 Assigned to cluster
42 58 28 | Yes
81 41 40 | No
6. Using WEKA to determine association rules
Aim:
To learn to use association algorithms on datasets
Requirements:
1) Perform the following tasks:
1. Define the following terms:
a. Item and Item Set:
Item: Item is a binary treated variable whose value is one if the item is present in a
transaction and zero otherwise.
b. Association:
A study defined WEKA is the gathering or a collection of the implements for
execution data mining with the application of the association rules in it.
c. Association rule:
Association rules applied to find the connection between data items in a
transactional database.
Association data mining algorithms are used to discover frequent association.
d. Support of an association rule:
The support of an association pattern refers to the percentage of task-relevent data
tuples.
For association rules of the form “A⟹ 𝐵”,
Where A and B are sets of items, it is defined as
#_𝑇𝑈𝑃𝐿𝐸𝑆_𝐶𝑂𝑁𝑇𝐴𝐼𝑁𝐼𝑁𝐺_𝐵𝑂𝑇𝐻_𝐴_𝐴𝑁𝐷_𝐵
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 𝑜𝑓(𝐴 ⇒ 𝐵) =
𝑇𝑂𝑇𝐴𝐿_#_𝑂𝐹_𝑇𝑈𝑃𝐿𝐸𝑆
e. Confidence of an association rule:
A certainty measure for association rules of the form “A⟹ 𝐵”, where A & B are
sets of items is confidence.
#_𝑇𝑈𝑃𝐿𝐸𝑆_𝐶𝑂𝑁𝑇𝐴𝐼𝑁𝐼𝑁𝐺_𝐵𝑂𝑇𝐻_𝐴_𝐴𝑁𝐷_𝐵
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑓(𝐴 ⇒ 𝐵) =
#_𝑇𝑈𝑃𝐿𝐸𝑆_𝐶𝑂𝑁𝑇𝐴𝐼𝑁𝐼𝑁𝐺_𝐴
f. Large item set:
Existence of large item set collected represents a potential wealth of information
and also given adequate methods of transforming the data into meaningful
information. Item set that meet a minimum support threshold are referred to as
frequent item sets.
g. Association rule problem:
Given a set of transactions T, Find out the rules having support≥minsup and
confidence≥minconf, where minsup and minconf are the corresponding support
and confidence thresholds
2. What is the purpose of an Apriori Algorithm.
It uses large item set property
It is easily parallelized
It is easy to implement
Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. Apriori uses a “bottom-up approach”, where frequent subsets are extended one
item at a time (candidate generation).
3) What is the support threshold used? What is the confidence threshold used?
Minimum support: 0.45 (196 instances)
Minimum metric <confidence>:0.9
4) Write down the top 6 rules along with the support and confidence values?
5) What does the figure to the left of the arrow in the association rule represent?
Figure to the left of the arrow in the association rule represents antecedent.
6) What does the figure to the right of the arrow in the association rule represent?
Consequent.
7) For Rule 8, verify the numerical values used for computation of support and confidence
is in accordance with the data by using the preprocess panel. Then compute the support.
7.
i. Load the dataset ‘weather.nominal.arff’
ii. Apply the Apriori Association rule.
2) Consider the rule”𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = ℎ𝑜𝑡, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔ℎ ⇒ 𝑊𝑖𝑛𝑑𝑦 = 𝑇𝑟𝑢𝑒". Consider the
Support and Confidence for this rule.
𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐻𝑜𝑡, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔ℎ ⇒ 𝑊𝑖𝑛𝑑𝑦 = 𝑇𝑟𝑢𝑒
𝑂𝑛𝑙𝑦 1 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 𝑖𝑠 𝑖𝑛 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 1 = 100%
If we apply apriori association rule directly by selecting on associate. It will not produce any
output because the data is not nominal. Instead we must follow the following steps:
c) Apply the supervised discretization filter to the age and income attributes.
Beside choose click on bar; in attribute indices change first-last to 2, 5. Since in given
dataset 2 is age and 5 is income.
Select OK>>Apply
In Association output, strong rules have been generated by selecting the option Apriori.