pp9 - v4 - Mejorado

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A new Fuzzy and Hard Clustering Software Tool kit

Mariana Soffer, M. Daniela López De Luise


AIGroup. Universidad de Palermo. Buenos Aires. Argentina.
marianasoffer@gmail.com
lopezdeluise@yahoo.com.ar

Abstract 2. FuzzyToolokit design


This paper presents the FuzzyToolkit The overall architecture of the FuzzyToolkit was
prototype which has a flexible software design for designed to be very flexible. Object Oriented Design
handling fuzzy and hard clustering algorithms. It patterns such as Observer, Singleton, Strategy,
provides modules that allow easy data input/output Factory, Delegator, etc were implemented in it. The
manipulation; multiple functionalities and clustering pattern implementation objective is to allow adding
algorithms, which include calculation of the evaluation new clustering algorithms, distance metrics and
of the clustering performance; parameter selection via visualization procedures as natural extensions of the
meta-heuristics; and 3D result visualizations. This actual system.
toolkit is compared here against other open source The main components of the FuzzyToolkit
clustering tools. A very short revision of some of its architecture are:
graphic options is included. 1. Distance function
2. Clustering algorithm
1. Introduction 3. Meta learning
4. Dimension reduction approach
Clustering methods [1], which is an unsupervised 5. Visualization strategy
machine learning technique, consists in assigning items 6. Support function
that belong to a dataset to a certain amount of groups. Each one has been implemented as a class or
These groups could overlap or not. The methods where interface hierarchy with a generic to specific fashion
groups do not overlap are obtained using hard approach.
clustering algorithms [4]; in this case each instance
belongs to only one of the groups. The methods where 2. 1. Architecture
groups are able to overlap are generally referred to as The prototype implements a specific hierarchy
fuzzy clustering algorithms; in this less constrained from each of the main components. There is a set of
case [2] [3] items do not belong exclusively to one standard algorithms codified as extensions of the
cluster; they are associated to a value which indicates generic modules. Any specific alternate algorithm
the strength of belonging to each group. should be implemented as an extended class base on an
This paper presents a functional prototype original component.
developed in Java programming language, called In the next section, the main hierarchies will be
FuzzyToolkit. This working prototype implements described.
several fuzzy and non-fuzzy clustering algorithms. The
following sections are organized as follows: section 2 2.1.1. Distance function
describes the main functionality and structure of the This class defines the specific metrics that could be
Toolkit; section 3 present’s a case study showing its used to evaluate dot distances. Each dot represents a
usage, and a comparison of the FuzzyToolkit with determined instance. This interface is inherited by two
other open source clustering software (xPloRe [9] and main abstract classes whose respective goals are:
Weka [10]). The results are being presented in section a) A generic function to calculate real number
4. Section 5 includes conclusions and the future distances.
development plans for the toolkit. b) A generic function to calculate nominal distances.
Each one is further extended by implementing a to be optimize, In some cases the step increment value
specific distance calculation according to the type of should also be specified; with all this information it
data. executes the algorithm several times, changing the
Distances such as Manhattan, Euclidean and parameter value. Then it chooses and returns the best
Tchebyscheb [4] are already implemented in the solution. To do this it uses an objective function either
FuzzyToolkit. All of them are an extension of the to be minimized or maximized according to the
generalized real distance. In the case of nominal specific parameter.
distances, Binary and correlation-based distances are This meta-learning was extended into the following
already implemented as extensions of the generalized more specific classes:
nominal distance. a) A general meta-learning approach for hard
clustering with nominal values algorithms.
Table 1 b) A general meta-learning generic algorithm for fuzzy
FuzzyToolkit Structure clustering with real values algorithms.

2.1.4. Dimension reduction approach


The dimensionality for a model takes a big impact
into the solution itself. But it could be also a problem
when the number of dimensions becomes difficult to
manage.
Henceforth, there must be certain strategies to
reduce the dimension of the problem without losing the
model accuracy.
The toolkit has the ability to reduce the dataset
dimensionality using different methodologies, such as
PCA and Gain Ratio. This is useful for high
dimensionality datasets, due o the fact that these
methodologies can transform the data in to a chosen
number of dimensions or factors, loosing the minimum
amount of information in the process. This is good also
2.1.2. Clustering algorithm for visualization of the results when reducing the large
This class covers several approaches for clustering quantity of variables to either 2 or 3 factors.
methods. It has three extended basic classes: There are two classes that extend the general
a) Fuzzy clustering general algorithm class for items dimension reduction class:
composed by fields with real values. a) Dimensionality reduction for graphical interface.
b) Hard clustering general algorithm class for items b) Dimensionality reduction for clustering processing.
composed by fields with real values. The first one was designed just for better visibility
c) Hard clustering general algorithm class for items of the solution in the result graphics, whereas the
composed by fields with nominal values. second is intended to reduce the amount of information
From each one, there are several specific handled to the clustering algorithms. Among the
algorithms implemented according to different algorithms implemented for this later case are PCA [7],
approaches. RedRelief [5] and Gain Ratio [5].
-Among the implementation of the general Fuzzy
clustering algorithms are CMeans [3], FuzzyKMeans 2.1.5. Visualization strategy
[6] and GKCmeans [3]. It covers all the classes that are related with the
-As implementation of hard clustering algorithms there graphical interface for showing clustering results. The
are KMeans [4], FarthestFirst [4] and KMedoids [4]. main classes defined into this category are:
-Some of the nominal hard clustering algorithms are a) Algorithm to graph results in 2D.
KMeans and KMedoids. b) Algorithm to graph results in 2D showing also
cluster radius as circles/ellipses.
2.1.3. Meta learning c) Algorithm to graph results in 3D.
The Meta-learning looks for finding out the best d) Algorithm to graph results in 3D with the circular or
parameter for an algorithm upon certain restrictions. ellipsoidal cluster limits depicted.
Therefore it needs a clustering method. It also requires
the specification of the domain value of the parameter
2.1.6. Support objects The KMeans algorithm will be tested under two
In this part of the hierarchy there are several open source programs and FuzzyToolkit in order to
methods that implement minor functions of the data compare the efficiency and accuracy.
manipulation. Each one is a kind of interface between For this experiment the Euclidean distance is used,
items and implemented algorithms. There is a short list since all of the software packages are able to compute
of them: with it. The same parameter values for all the test
-Data normalization, useful to balance dimensions software are settled. Those parameters that do not exist
when needed. in the rest of the tools are adjusted to provide the best
-Result calculation upon applying certain clustering possible outcome.
algorithm. It could be a cluster number (in hard The resulting confusion matrices of each algorithm
strategies) or a belonging one (in fuzzy approaches). along with the percent of incorrectly classified
Besides other information is part of the result: cluster’s instances are compared in order to evaluate the tests.
maximum and average radius, SSE [4], number of To do that, the hand-made classification of the
clusters and other measures to evaluate the goodness of instances is taken. The resulting cluster-to-specie
the calculation. association could be easily obtained by performing a
-Data uploading from a CSV file. simple exploratory analysis.
-Results saving into a CSV file. As part of the test the meta-algorithm is used to
-Previous results and the dataset retrieving from a CSV search the best number of cluster for this dataset. This
file. parameter (ranging between 2 and 10), is the result of
-Learned clustering model saving. the minimization of the resulting SSE value. As will be
-Saved clustering model retrieving from file. shown in the next section, 3 is the optimal number of
-New instance classification using a retrieved model. clusters chosen by FuzzyToolkit meta-algorithm,
-Confusion matrix calculation for comparison between which is the real number of species or classes in the
both a saved clustering result and a new calculated dataset (Iris-Setosa, Iris-Versicolor and Iris-Virginica).
clustering result. All the tests are compared with the WEKA
It is important to note that in these algorithmic there software, an open source, developed by the Waikato
is no guarantee of an optimal result. In general, the University of New Zealand. Its distribution already
related parameters require a careful research performed comes with the iris dataset and the correctly assigned
either manually or with meta-heuristics. class that each instance belongs to. The other software
to be used is XploRe,
3. Case Study Whereas WEKA can only display results in 2D, the
FuzzyToolkit graphic module can be used to display
A. Data set Specification the solution in 2D and 3D (like other frameworks such
The Iris Plants Database was created by R.A. Fisher as the SPSS one). The GUI can also rotate the graph,
and donated by Michael Marshall on the year 1988. It adjust the size and customize the graphic type among
is perhaps the best-known database to be found in the other functionalities in a similar fashion to SPSS. In
pattern recognition literature. Fisher's paper [8] is a WEKA there are an important number of features to
classic in the field and is referenced frequently to this visualize the graph in several ways. Finally, in the
day. The data set contains 150 instances of the iris prototype the type of data does not require extra effort
plant. Each instance could belong to one of the 3 but with xPloRe requires an extra module to show the
existing classes (Iris-Setosa, Iris-Versicolor and Iris- resulting clustering visualization graph.
Virginica), in this particular dataset each class contains
exactly 50 plans. 4. Results
Each instance reflects information for its 4
attributes: The dataset was processed with xPloRe. It was
-Sepal length ranging from 43 to 7.9 cm. required a pre-processing of the iris dataset that comes
-Sepal width ranging from 2 to 4.4 cm. with the standard distribution of WEKA in order to
-Petal length ranging from 1 to 6.9 cm. make it suitable for the xPloRe. A small script was
-Petal width ranging from 0.1 to 2.5 cm. developed to load the data, execute the KMeans
algorithm and extract the solution. Since this software
B. Experiment description does not have the ability to visualize clustering
graphically, it will not to be shown here. The related
confusion matrix is depicted in table 2.
Table 2 The meta-heuristic module was tested first in order
Confusion Matrix for xPloRe KMeans to demonstrate that it could select the best number of
clusters for this dataset. The selection criterion of
clusters was the minimization of the SSE (Sum of
Square Errors). The parameter to be tuned was the
number of clusters. The domain specified for it ranged
between 2 and 10 clusters. The result indicated that the
best number of groups was 3.
The next test performed was KMeans in order to
compare results with the same data in 3 clusters,
sticking to the Euclidean distance, setting the epsilon
parameter to 0.01 and the number of iterations to 999.
The results where are in Table 5.

The same processing was repeated with WEKA. Table 5


The corresponding confusion matrix is showed in Fig. Confusion Matrix for FuzzyTookit KMeans
2. The graphical display of the tool, with the cluster
assignments is also included in table 3.

Table 3
Confusion Matrix for WEKA KMeans

As can be seen from the previous figures, all


software package outputs include the cluster
assignment of each instance and the centroid of each
cluster. WEKA provides also the confusion matrix of
the clustered dataset when the real results are provided,
Table 4
including the percentage of misclassified instances and
Graph of clustering results in WEKA
the percentage of instances that belong to each cluster.
FuzzyToolkit provides the previous mentioned
information and some other evaluation measures on
top of then. They are used to know how good the
results and descriptive information about each cluster
are. These measures include (among others) for the
hard clustering methods:
-SSE: sum of the square error from the items of each
cluster.
-Inter cluster distance: sum of the square distance
between each cluster centroid.
-Intra cluster distance for each cluster: sum of the
square distance from the items of each cluster to its
centroid.
-Maximum Radius: largest distance from an instance to
its cluster centroid.
-Average Radius: sum of the largest distance from an to WEKA. This higher accuracy is provided in part by
instance to its cluster centroid divided by the number the possibility of adjustment of more parameters, such
of clusters. as epsilon (that indicates the minimum variance of the
The FuzzyToolkit can perform additional graphics. objective function in order not to stop the algorithm)
The system applies the PCA dimensionality reduction and the maximum amount of iterations allowed. These
algorithm, in order to show the same data in two parameters can be used both at the same time or not.
alternate 3D visualizations that are in table 6 and table The meta-algorithm method can provide a good
7 respectively. approximation of the required parameter to be taken. It
is useful in cases where the exact amount of clusters is
Table 6 unknown and not obvious.
Graph I of clustering results in FuzzyToolkit The FuzzyToolkit is a more flexible tool allowing
to perform cluster selection automatically based on
desired parameters; enabling to tune more parameters
than the other programs; giving more evaluation
measures enabling to better judge cluster results;
letting select a dimensionality reduction algorithm to
graph the clustering assignment results, and finally
providing much better visualizations of the obtained
results.
Other advantages not shown in this paper are the
number of options this software provides which can be
combined as desired, such as:
- 20 different clustering algorithms.
- 10 different distance functions.
- 5 different dimensionality reduction algorithms.
- 4 kinds of elaborate and interactive visualizations.
- Retrieve different kind of datasets.
- Store the data along with the clustering results.
- Compare different algorithms.
- Store and retrieve clustering models.
Table 7
- Easily extend the toolkit.
Graph II of clustering results in FuzzyToolkit
The prototype clustering methods provided results
as good and in some cases better than the ones
provided by the open source software packages. This
happens for each algorithm tested.
In the future we’ll test if fuzzy algorithms can
outperform its analogous non-fuzzy ones; implement
fuzzy clustering with nominal data; implement new
algorithms; and finally develop clustering algorithms
in order to be able to evaluate datasets containing
instances with nominal and real numbers.

6. References
[1] Johnson R., Wichern D. “Applied Multivariate
Statistical Analysis”. 5th ed. Prentice Hall Publishers. 2002.
[2] Tran D., Sharma D. “Generalized Fuzzy Clustering”.
University of Canberra, Australia.
[3] Bezdek J., “Pattern recognition with fuzzy objective
function algorithms”. Plenum Press. 1981.
5. Conclusions [4] Witten I. H., Frank E. “Data Mining - Practical Machine
nd
As it can be seen from the confusion matrixes that Learning Tools and Techniques”. 2 ed. Morgan Kaufmann
resulted from each software, the accuracy of the Publishers. 2005.
FuzzyToolkit is better than that of xPloRe and similar [5] Hall M., Holmes G. “Benchmarking attributes
techniques for Data mining”. University of Waikato, 2000.
[6] Krishnapuram R., Joshi A., Yi L. “A fuzzy relative k- [9] Hardle W. “xPloRe “The Statistical Computing
medoids algorithm with application to web documents and Environment”. Method and Data Technologies. 2008.
snippet clustering”. IEEE international Fuzzy Systems [10] Waikato University “Data mining Software in Java”.
Conference. 1999. Waikato University. 2008.
[7] Smith L. “A tutorial on principal component analysis”.
Wiley and Sons. 2002.
[8] Fisher R. “The use of multiple measurements in
taxonomic problems”. Annals of Eugenics. 1936.

You might also like