Professional Documents
Culture Documents
Demonstration of WEKA Tool
Demonstration of WEKA Tool
Demonstration of WEKA Tool
1
VNRVJIET,
DATA MINING LABORATORY
IT Department
EXP #1:
EXPLORE CONTACT LENS DATA SET.
contact-lens.arff
contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor,
Benoit Julien in the year 1990.
Database: This database is complete. The examples used in this database are complete and
noise-free. The database has 24 instances and 4 attributes.
Attributes: All four attributes are nominal. There are no missing attribute values.
2
VNRVJIET,
DATA MINING LABORATORY
IT Department
young
pre-presbyopic
presbyopic
myope
hypermetrope
no
yes
reduced
normal
Class Distribution: The instances that are classified into class labels are enlisted below:
3. No contact lenses 15
3
VNRVJIET,
DATA MINING LABORATORY
IT Department
4
VNRVJIET,
DATA MINING LABORATORY
IT Department
EXP #2:
EXPLORE IRIS DATASTE FOR WEKA TOOL.
iris.arff
iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.
5
VNRVJIET,
DATA MINING LABORATORY
IT Department
Database: This database is used for pattern recognition. The data set contains 3 classes of 50
instances. Each class represents a type of iris plant. One class is linearly separable from the
other 2 but the latter are not linearly separable from each other. It predicts to which species of
the 3 iris flower the observation belongs. This is called a multi-class classification dataset.
Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing
attributes.
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris Setosa
Iris Versicolour
Iris Virginica
Summary Statistics:
Class
Min Max Mean SD
Correlation
6
VNRVJIET,
DATA MINING LABORATORY
IT Department
7
VNRVJIET,
DATA MINING LABORATORY
IT Department
EXP #3:
Explore CREDIT Dataset with respect to Weka Tool. Answer the
following questions.
Answer:
Follow the steps enlisted below to use WEKA for identifying real values and nominal
attributes in the dataset.
8
VNRVJIET,
DATA MINING LABORATORY
IT Department
#2) Select the “Pre-Process” tab. Click on “Open File”. With WEKA user, you can access
WEKA sample files.
#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the
predefined .arff file “credit-g.arff” file and click on “Open”.
#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on
the right panel along with the histogram.
9
VNRVJIET,
DATA MINING LABORATORY
IT Department
In the right panel, the selected attribute statistics are displayed. Select the attribute
“checking_status”.
It shows:
Type: The attribute is of the nominal type that is, it does not take any numeric value.
Count: Among the 1000 instances, the count of each distinct class label is written in
the count column.
Histogram: It will display the output class label for the attribute. The class label in
this dataset is either good or bad. There are 700 instances of good (marked in blue) and
300 instances of bad (marked in red).
For the label < 0, the instances for good or bad are almost the same in number.
For label, 0<= X<200, the instances with decision good are more than
instances with bad.
Similarly, for label >= 200, the max instances occur for good and no checking
label has more instances with decision good.
10
VNRVJIET,
DATA MINING LABORATORY
IT Department
Unique: It has 5 unique values that do not match with each other.
11
VNRVJIET,
DATA MINING LABORATORY
IT Department
12
VNRVJIET,
DATA MINING LABORATORY
IT Department
The class is the classification feature of the nominal type. It has two distinct values: good
and bad. The good class label has 700 instances and the bad class label has 300 instances.
13
VNRVJIET,
DATA MINING LABORATORY
IT Department
#5) To find out only numeric attributes, click on the Filter button. From there, click
on Choose ->WEKA >FILTERS -> Unsupervised Type ->Remove Type.
WEKA filters have many functionalities to transform the attribute values of the dataset to
make it suitable for the algorithms. For example, the numeric transformation of attributes.
Filtering the nominal and real-valued attributes from the dataset is another example of using
WEKA filters.
#6) Click on the RemoveType in the filter tab. An object editor window will open. Select
attributeType “Delete numeric attributes” and click on OK.
14
VNRVJIET,
DATA MINING LABORATORY
IT Department
The class attribute is of the nominal type. It classifies the output and hence cannot be deleted.
Thus it is seen with the numeric attribute.
Output:
The real-valued and nominal values attributes in the dataset are identified. Visualization with
the class label is seen in the form of histograms.
15
VNRVJIET,
DATA MINING LABORATORY
IT Department
16
VNRVJIET,
DATA MINING LABORATORY
IT Department
EXP #4:
DEMONSTRATE THE WEKA DECISION TREE
CLASSIFICATION ALGORITHMS FOR
WEATHER.NOMINAL DATASET.
Now, we will see how to implement decision tree classification on
WEATHER.NOMINAL.ARFF dataset using the J48 classifier.
WEATHER.NOMINAL.ARFF
It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is
suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label
“play” classifies the output as “yes’ or “no”.
Decision Tree is the classification technique that consists of three components root node,
branch (edge or link), and leaf node. Root represents the test condition for different attributes,
the branch represents all possible outcomes that can be there in the test, and leaf nodes
contain the label of the class to which it belongs. The root node is at the starting of the tree
which is also called the top of the tree.
J48 Classifier
It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3).
It is also known as a statistical classifier. For decision tree classification, we need a database.
Steps include:
#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.
17
VNRVJIET,
DATA MINING LABORATORY
IT Department
#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose”
button. From this, select “trees -> J48”. Let us also have a quick look at other options in
the Choose button:
18
VNRVJIET,
DATA MINING LABORATORY
IT Department
#4) Click on Start Button. The classifier output will be seen on the Right-hand panel. It
shows the run information in the panel as:
The number of leaves and the size of the tree describes the decision tree.
Full classification of the J48 pruned with the attributes and number of
instances.
19
VNRVJIET,
DATA MINING LABORATORY
IT Department
20
VNRVJIET,
DATA MINING LABORATORY
IT Department
#5) To visualize the tree, right-click on the result and select visualize the tree.
Output:
The output is in the form of a decision tree. The main attribute is “outlook”.
If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then
class label play= “yes”.
If the outlook is overcast, the class label, play is “yes”. The number of instances which obey
the classification is 4.
Conclusion
WEKA offers a wide range of sample datasets to apply machine learning algorithms. The
users can perform machine learning tasks such as classification, regression, attribute
selection, association on these sample datasets, and can also learn the tool using them.
WEKA explorer is used for performing several functions, starting from preprocessing.
Preprocessing takes input as a .arff file, processes the input, and gives an output that can be
used by other computer programs. In WEKA the output of preprocessing gives the attributes
present in the dataset which can be further used for statistical analysis and comparison with
class labels.
21
VNRVJIET,
DATA MINING LABORATORY
IT Department
WEKA also offers many classification algorithms for decision tree. J48 is one of the popular
classification algorithms which outputs a decision tree. Using the Classify tab the user can
visualize the decision tree. If the decision tree is too populated, tree pruning can be applied
from the Pre-process tab by removing the attributes which are not required and start the
classification process again.
22
VNRVJIET,
DATA MINING LABORATORY
IT Department
EXP #5:
DEMONSTRATE KNN CLASSIFER FOR THE IONOSHPERE
DATASET USING WEKA.
Ionosphere Dataset
The Ionosphere Dataset is a classic machine learning dataset. The problem is to predict the
presence (or not) of free electron structure in the ionosphere given radar signals. It is
comprised of 16 pairs of real-valued radar signals (34 attributes) and a single class attribute
with two values: good and bad radar returns.
You can read more about this problem on the UCI Machine Learning Repository page for the
Ionosphere dataset.
The IBk algorithm does not build a model, instead it generates a prediction for a test instance
just-in-time. The IBk algorithm uses a distance measure to locate k “close” instances in the
training data for each test instance and uses those selected instances to make a prediction.
In this experiment, we are interested to locate which distance measure to use in the IBk
algorithm on the Ionosphere dataset. We will add 3 versions of this algorithm to our
experiment:
Euclidean Distance
This will add the IBk algorithm with Euclidean distance, the default distance measure.
Manhattan Distance
23
VNRVJIET,
DATA MINING LABORATORY
IT Department
This will add the IBk algorithm with Manhattan Distance, also known as city block distance.
Chebyshev Distance
24
VNRVJIET,
DATA MINING LABORATORY
IT Department
This will add the IBk algorithm with Chebyshev Distance, also known as city chessboard
distance.
4. Run Experiment
This tab is the control panel for running the currently configured experiment.
Click the big “Start” button to start the experiment and watch the “Log” and “Status” sections
to keep an eye on how it is doing.
5. Review Results
Algorithm Rank
The first thing we want to know is which algorithm was the best. We can do that by ranking
the algorithms by the number of times a given algorithm beat the other algorithms.
1. Click the “Select” button for the “Test base” and choose “Ranking“.
The ranking table shows the number of statistically significant wins each algorithm has had
against all other algorithms on the dataset. A win, means an accuracy that is better than the
accuracy of another algorithm and that the difference was statistically significant.
25
VNRVJIET,
DATA MINING LABORATORY
IT Department
We can see the Manhattan Distance variation is ranked at the top and that the Euclidean
Distance variation is ranked down the bottom. This is encouraging, it looks like we have
found a configuration that is better than the algorithm default for this problem.
Algorithm Accuracy
1. Click the “Select” button for the “Test base” and choose the “IBk” algorithm with
“Manhattan Distance” in the list and click the “Select” button.
In the “Test output” we can see a table with the results for 3 variations of the IBk algorithm.
Each algorithm was run 10 times on the dataset and the accuracy reported is the mean and the
standard deviation in rackets of those 10 runs.
26
VNRVJIET,
DATA MINING LABORATORY
IT Department
Table of algorithm classification accuracy on the Ionosphere dataset in the Weka Explorer
We can see that IBk with Manhattan Distance achieved an accuracy of 90.74% (+/- 4.57%)
which was better than the default of Euclidean Distance that had an accuracy of 87.10% (+/-
5.12%).
The little *” next to the result for IBk with Euclidean Distance tells us that the accuracy
results for the Manhattan Distance and Euclidean Distance variations of IBk were drawn
from different populations, that the difference in the results is statistically significant.
We can also see that there is no “*” for the results of IBk with Chebyshev Distance indicating
that the difference in the results between the Manhattan Distance and Chebyshev Distance
variations of IBk was not statistically significant.
Summary
In this post you discovered how to configure a machine learning experiment with one dataset
and three variations of an algorithm in Weka. You discovered how you can use the Weka
experimenter to tune the parameters of machine learning algorithm on a dataset and analyse
the results.
EXP #6:
DEMONSTRATE THE CLUSTERING ALGORITHM FOR
IRIS DATASET USING WEKA.
A clustering algorithm finds groups of similar instances in the entire dataset. WEKA
supports several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer,
SimpleKMeans and so on. You should understand these algorithms completely to fully
exploit the WEKA capabilities.
As in the case of classification, WEKA allows you to visualize the detected clusters
graphically. To demonstrate the clustering, we will use the provided iris database. The data
set contains three classes of 50 instances each. Each class refers to a type of iris plant.
Loading Data
In the WEKA explorer select the Preprocess tab. Click on the Open file ... option and select
the iris.arff file in the file selection dialog. When you load the data, the screen looks like as
shown below −
27
VNRVJIET,
DATA MINING LABORATORY
IT Department
You can observe that there are 150 instances and 5 attributes. The names of attributes are
listed as sepallength, sepalwidth, petallength, petalwidth and class. The first four
attributes are of numeric type while the class is a nominal type with 3 distinct values.
Examine each attribute to understand the features of the database. We will not do any
preprocessing on this data and straight-away proceed to model building.
Clustering
Click on the Cluster TAB to apply the clustering algorithms to our loaded data. Click on
the Choose button. You will see the following screen −
28
VNRVJIET,
DATA MINING LABORATORY
IT Department
29
VNRVJIET,
DATA MINING LABORATORY
IT Department
Click on the Start button to process the data. After a while, the results will be presented on
the screen.
Examining Output
30
VNRVJIET,
DATA MINING LABORATORY
IT Department
If you scroll up the output window, you will also see some statistics that gives the mean and
standard deviation for each of the attributes in the various detected clusters. This is shown in
the screenshot given below −
31
VNRVJIET,
DATA MINING LABORATORY
IT Department
Visualizing Clusters
To visualize the clusters, right click on the EM result in the Result list. You will see the
following options −
32
VNRVJIET,
DATA MINING LABORATORY
IT Department
33
VNRVJIET,
DATA MINING LABORATORY
IT Department
As in the case of classification, you will notice the distinction between the correctly and
incorrectly identified instances. You can play around by changing the X and Y axes to
analyze the results. You may use jittering as in the case of classification to find out the
concentration of correctly identified instances. The operations in visualization plot are
similar to the one you studied in the case of classification.
To demonstrate the power of WEKA, let us now look into an application of another
clustering algorithm. In the WEKA explorer, select the HierarchicalClusterer as your ML
algorithm as shown in the screenshot shown below −
34
VNRVJIET,
DATA MINING LABORATORY
IT Department
35
VNRVJIET,
DATA MINING LABORATORY
IT Department
Notice that in the Result list, there are two results listed: the first one is the EM result and
the second one is the current Hierarchical. Likewise, you can apply multiple ML algorithms
to the same dataset and quickly compare their results.
If you examine the tree produced by this algorithm, you will see the following output −
36
VNRVJIET,
DATA MINING LABORATORY
IT Department
EXP #7:
EXPLAIN THE PROCESS OF DATA PREPROCESSING IN
WEKA.
Step 1: Data Pre Processing or Cleaning
3. Click on PreProcess tab & then look at your lower R.H.S. bottom window
click on drop down arrow and choose “No Class”
4. Click on “Edit” tab, a new window opens up that will show you the loaded
datafile. By looking at your dataset you can also find out if there are missing
values in it or not. Also please note the attribute types on the column header. It
would either be ‘nominal’ or ‘numeric’.
37
VNRVJIET,
DATA MINING LABORATORY
IT Department
1) If your data has missing values then its best to clean it first before you apply any forms of
mining algorithm to it. Please look below at Figure 1, you will see the highlighted fields
are blank that means the data at hand is dirty and it first needs to be cleaned.
Figure 1
2) Data Cleaning: To clean the data, you apply “Filters” to it. Generally the data will be
missing with values, so the filter to apply is “ReplaceMissingWithUserConstant” (the filter
choice may vary according to your need, for more information on it please consult the
resources).Click on Choose button below Filters-> Unsupervised->attribute—————>
ReplaceMissingWithUserConstant
Please refer below to Figure: 2 to know how to edit the filter values.
38
VNRVJIET,
DATA MINING LABORATORY
IT Department
Figure: 2
A good choice for replacing missing numeric values is to give it values like -1 or 0 and for
string values it could be NULL. Refer to Figure 3.
Figure: 3
It’s worthwhile to also know how to check the total number of data values or instances in
your dataset.
Refer to Figure: 4.
39
VNRVJIET,
DATA MINING LABORATORY
IT Department
Figure: 4
So as you can see in Figure 4 the number of instances is 345446. The reason why I want you
to know about this is because later when we will be applying clustering to this data, your
Weka software will crash because of “OutOfMemory” problem.
So this logically follows that how do we now partition or sample the dataset such that we
have a smaller data content which Weka can process. So for this again we use the Filter
option.
4.3 Sampling the Dataset : Click Filters-> unsupervised-> and then you can choose any of
the following options below
40
VNRVJIET,
DATA MINING LABORATORY
IT Department
3. RemoveWithValues
4. Resample
5. ReservoirSample
To know about each of these, place your mouse cursor on their name and you will see a tool-
tip that will explain them.
For this dataset I’m using filter, ‘ReservoirSample’. In my experiments I have found that
Weka is unable to handle values in size equal to or greater than 999999. Therefore when you
are sampling your data I will suggest choose the sample size to a value less than or equal to
9999. The default value of the sample size will be 100. Change it to 9999 as shown below in
Figure: 5. and then click on button Apply to apply the filter on the dataset. Once the filter has
been applied, if you look at the Instances value also shown in Figure 6, you will see that the
sample size is now 9999 as compared to the previous complete instances value at 345446.
Figure: 5
41
VNRVJIET,
DATA MINING LABORATORY
IT Department
Figure: 6
If you now click on the “Edit” tab on the top of the explorer screen you will see the dataset
cleaned. All missing values have been replaced with your user specified constants. Please see
below at Figure 7. Congratulations! Step 1 of data pre-processing or cleaning has been
completed.
Figure: 7
It’s always a good idea to save the cleaned dataset. To do so, click on the save button as
shown below in Figure: 8.
Figure: 8
42
VNRVJIET,
DATA MINING LABORATORY
IT Department
43