Paper 37-Exploreing K-Means With Internal Validity Indexes

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/315747115
Exploreing K-Means with Internal Validity Indexes for Data Clustering in Trafﬁc
Management System
Article · March 2017

DOI: 10.14569/IJACSA.2017.080337#sthash.jmnh8QU0.dpuf
CITATIONS READS
28 1,883
3 authors:
Sadia Nawrin Md Rahatur Rahman

East West University (Bangladesh) 14 PUBLICATIONS 173 CITATIONS
2 PUBLICATIONS 62 CITATIONS
SEE PROFILE
SEE PROFILE
Shamim Akhter
Ahsanullah University of Science & Tech
106 PUBLICATIONS 492 CITATIONS
SEE PROFILE
All content following this page was uploaded by Shamim Akhter on 03 April 2017.
The user has requested enhancement of the downloaded file.

(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
Exploreing K-Means with Internal Validity Indexes

for Data Clustering in Traffic Management System
Sadia Nawrin Md Rahatur Rahman Shamim Akhter
Dept. of Computer Science and Simplexhub Ltd. Dept. of Computer Science and
Engineering Dhaka, Engineering
East West University Bangladesh East West University
Dhaka, Bangladesh Dhaka, Bangladesh
Abstract—Traffic Management System (TMS) is used to rains, and makes slower traffic movements. The heat released
improve traffic flow by integrating information from different from the engines, air-conditioners of the traffic stacked
data repositories and online sensors, detecting incidents and vehicles, may raise the overall temperature of the area. Thus,
taking actions on traffic routing. In general, two decision making the current temperature helps to classify traffic congestion
systems-weights updating and forecasting are integrated inside status of a particular road segment. Gusts of wind have direct
the TMS. The models need numerous data sets for making influence on road safety and that pushes to slower vehicle
appropriate decisions. To determine the dynamic road weights in movement. In addition, temperature, wind and humidity have
TMS, four (4) different environmental attributes are considered, direct influence to predict the future rainfall in a particular area.
which are directly or indirectly related to increase the traffic
Peak hour is one of the most influential attributes to cause
jam– rain fall, temperature, wind, and humidity. In addition,
peak hour is taken as an additional attribute. Usually, the data
traffic congestion in metro cities. Thus, these four (4)
sets are classified by instinct method. However, optimum environmental attributes and peak hour have direct or indirect
classification on data sets is vital to improve the decision relationship on traffic congestion as well as vehicle movement
accuracy of the TMS. Collected data sets have no class label and and influence to choose them as decision making parameters.
thus, cluster based unsupervised classifications (partitioning, The value of these attributes (features) are intelligently
hierarchical, grid-based, density-based) can be used to find crawled by search engine, with metadata indexing (title,
optimum number of classifications in each attribute, and description, keyword etc.), directly from the multiple data
expected to improve the performance of the TMS. Two most
feeds (like web site, RSS feeds, web service etc.) from the web
popular and frequently used classifiers are hierarchical
clustering and partition clustering. K-means is simple, easy to
page in [5]. Crawled data are simplified (structured) and stored
implement, and easy to interpret the clustering results. It is also in a historic table. However, the number of attributes can be
faster, because the order of time complexity is linear with the changed according to the system requirements. We collect
number of data. Thus, in this paper we are going to demonstrate more than two (2) years or 750 days (1/12/2006 to 20/12/2008)
the performance of partition k-means and hierarchical k-means data of five features from the web page in [5].
with their implementations by Davies Boulder Index (DBI), Dunn Initially, decision tree (DT) [1] [2] [3] was used to classify
Index (DI), Silhouette Coefficient (SC) methods to outline the
road weights and weighted moving average analytic was
optimal number classifications (features) inside each attribute of
TMS data sets. Subsequently, the optimal classes are validated by
implemented to estimate or predict feature values in DT
using WSS (within sum of square) errors and correlation [28][29] based system and achieved 16.45% accuracy.
methods. The validation results conclude that k-means with DI However, the model data sets were classified by instinct
performs better in all attributes of TMS data sets and provides method. Cluster based classifications (K-means, Locality-
more accurate optimum classification numbers. Thereafter, the Sensitive Hashing (LSH) etc.) can be used to find optimum
dynamic road weights for TMS are generated and classified number of classifications in each feature and can improve the
using the combined k-means and DI method. performance of the TMS. With this hypothesis, we implement
two unsupervised clustering techniques partition k-means and
Keywords—Traffic Management System (TMS); Data hierarchical k-means. There are several methods
Clustering; K-means; Hierarchical Clustering; Cluster Validation (internal/external) to measure similarity between two clustering
steps and used to compare how well different data clustering
I. INTRODUCTION algorithms perform on a set of data. Only internal methods -
A new low cost, flexible, maintainable, and secure internet- Davies Boulder Index (DBI), Dunn Index (DI), and Silhouette
based traffic management system with real time bi-directional Coefficient (SC) - are used to choose the optimum number of
communication was proposed and implemented (in classification, as they do not have any external information.
[1][2][3][4]) to assist and reduce the traffic situation. To Subsequently, the optimal classes are cross-validated by using
determine the dynamic road weights in TMS, four (4) different statistical analytics - correlation and Within Sum of Square
environmental attributes - rain fall, temperature, wind, and (WSS) errors.
humidity are considered. Rainfall is one of the most influential
Results highlight that Dunn Index (DI) performs better for
weather attributes to determine the road congestion in metro
both partition k-means and hierarchical k-means algorithms by
city, as the road segments are submerged due to the heavy
264 | P a g e
www.ijacsa.thesai.org
Vol. 8, No. 3, 2017
providing minimum Sum of Square Error (SSE) for all

environmental attributes. However, the optimum numbers of
classifications are generated by both algorithms, for each
environmental attribute, differs in their numbers. Both
algorithms are compared by computing the correlation values
on their optimal number of clusters for each attribute. The
correlation values of partition k-means algorithm are higher
than the correlations of hierarchical k-means algorithm for all
attributes. The validation results conclude that the combination
of the k-means with Dunn Index performs better and provides
more accurate optimum classification number(s) on
environmental data set. Thereafter, the dynamic road weights
for TMS are generated and classified with these combined Fig. 1. Decision Tree Using ID3 Algorithm
algorithms.
Paper [13] evaluates the performance of three clustering
II. RELATED WORKS algorithms (hard k-means, single linkage, and a simulated
Integrating intelligence technologies in transportation annealing) and determining the number of clusters using four
system including intelligent and effective route planning to methods-Davies-Bouldin index, Dunn’s index, Calinski-
reduce travel time, reliable estimation of traffic congestion, Harabasz index and index I. Paper [14] compares three
accident and/or hazard detection etc., can help to reduce both clustering algorithms- agglomerative hierarchical clustering k-
fuel consumption and the associated emission of greenhouse means algorithm, bisecting k-means algorithm and standard k-
gases. However, this kind of Intelligent Transportation System means. Results indicate that the bisecting k-means technique
(ITS) requires collecting and modeling tremendous amount of performance better than other two.
continuous data from all road segments, in different time In [15], authors discuss and compare the various clustering
domains, for everyday in a year, and is a complex task. In methods to find the best and fix the optimal number of clusters
addition, analytical decision making on optimum route over three (3) structured datasets. They use three (3) different
planning requires high data processing and centralized clustering algorithms- hierarchical, k-means, PAM and three
computation. Data mining techniques, especially clustering, are (3) internal optimal clustering methods- connectivity,
involved to shape the unstructured data to a structural silhouette and dunn.
formulation and make easier decision making system for ITS
problems. It is common and popular to apply hierarchical or partition
clustering on classification problems [16]. K-means is simple,
Traffic flow data is used in [31] to detect the traffic status easier to implement and provide linear order complexity. Thus,
and predict the traffic patterns from historical database. Two partition k-means and hierarchical k-means algorithms are used
different data mining techniques-cluster analysis and to classify the TMS data sets and their optimum classification
classification analysis are used in the historical data prediction numbers are determined by three (3) different cluster validity
model. Classified road features are used to estimate traffic indexes- Davies Boulder Index (DBI), Dunn Index (DI),
flows in [32]. Functional Data Analysis (FDA) is used in [33] Silhouette Coefficient (SC).
to analyze the daily traffic flow. A comparative study on
different data mining techniques to classify traffic congestion is III. CLASSIFICATION TECHNIQUES
done in [34]. It examines J48 Decision Tree, Artificial Neural
There are many industrial problems identified as
Network, Support Vector Machine (SVM), PART and K-
classification problems. For examples, stock market prediction,
Nearest Neighborhood to classify future traffic status and
weather forecasting, bankruptcy prediction, medical diagnosis,
concludes J48 Decision Tree algorithm has the best
speech recognition, character recognitions to name a few [6-
performance.
10]. Classifications are typically classified into three broad
In our previous works, traffic management data attributes categories- supervised, unsupervised and reinforce learning
were worked with DT (decision tree) [1] [2] [3] (Fig.1) and [11]. Supervised learning is used when the data class label are
Neural Network (NN) [4]. NN performs better than DT. known. Unsupervised learning (cluster analysis) is applicable
However, these works did not perform any recognized data on unknown class label datasets. Reinforcement learning is the
mining or classification technique to the environmental data problem of getting an agent to act in the world to maximize its
sets. Rather, they classified data according to the intuitive rewards. In this paper, TMS data sets have no class label thus
guesses. Thus, the proposed TMS is suffering from optimal falls in unsupervised learning category. This section describes
data classification strategies. the algorithms and methods- those are used for clustering in
this paper. Notations and their descriptions are listed in Table
There are many available methods/techniques used to I.
classify data sets. In [12], optimal cluster numbers are
determined based on the intra-cluster and inter-cluster distance A. Hierarchical Clustering
measurements. They use Davies-Bouldin index and Dunn's Hierarchical clustering constructs a hierarchy of clusters
index methods for classifying both synthetic and natural (dendrogram). Dendrogram is a process that captures whether
images. the order in which clusters are merged (bottom-up view) or
265 | P a g e
Vol. 8, No. 3, 2017
clusters are split (top-down view). There are two variant of B. Partitional Clustering
hierarchical clustering methods (in fig. 2.): i) Agglomerative Partitional clustering determines a flat clustering into k
Hierarchical clustering algorithm (HAC) or AGNES (bottom- clusters with minimal costs. It partitions data set into k clusters
up approaches), ii) Divisive Hierarchical clustering algorithm and assigns the object to their nearest centers. Here (in fig. 4),
(HDC) or DIANA (top-down approaches). In this paper, we k is the number of centroids.
implement the divisive hierarchical cluster to classify the
feature data, as it has less computational cost compare to
AGNES. We stop our iteration when optimal clustering
number is reached.
Fig. 4. Partitional clustering

Fig. 2. Hierarchical clustering structure
1) K-means Clustering Algorithm: K-means clustering
1) Divisive Hierarchical Clustering Algorithm: Division [18][19][27] aims to partition data into k clusters. K-means is
Hierarchical Clustering Algorithm (HDC) or DIANA the most popular non-hierarchical iterative clustering
(DivisiveANAly) [17] is a variant of hierarchical clustering. It algorithm (Fig.5). The basic idea of k-means is to start with an
starts evaluation from the top with all data in one cluster (fig. initial partition and assign data objects to cluster so that the
3) and then split using flat clustering algorithm such as k- squared error decreases.
means clustering. Algorithm:
Algorithm: a. Randomly initialize k center from the set of data
a. Initially all items belong to one cluster Ci=0. point {Xd=x1d, x2d, x3d …xnd}.
b. Split Ci into sub-clusters, Ci+1 and Ci+2. b. Assign each point to their nearest center using
c. Apply K-mean on Ci+1 and Ci+2. Manhattan distance measure.
d. Increment the value of i. (1)
e. Repeat steps b, c and d until the desired cluster c. Compute the centroid for each cluster by averaging
structure is obtained. the data objects belonging to the cluster, assign it as a
new cluster center.
Node 0 containing the whole data set
∑ (2)
C1=2 input nodes 1-2.
d. Re-assign all the data points to its new center.
C2=3 input nodes-> 2- 4 (1 spilt into 2 sub group-3 e. Repeat b, c and d steps until all the cluster centers do
and 4). not change anymore otherwise stop.
C3=4 input nodes ->3-6 (1 spilt into 2 sub group-5 & 6).
Do until Ckmaxnot reached where Ckmax is maximum number
of clusters.
Fig. 3. Splitting node in DIANA
266 | P a g e
Vol. 8, No. 3, 2017
TABLE. I. LIST OF SYMBOLS AND THEIR DESCRIPTION
SL Symbol/Notation Description
No
1. nc Number of total cluster
2. Ci ithcluster
Manhattan distance between two data
3. d(x,y)
element
4. ni Number of element in the ith cluster
5. Value of the center of the ithcluster
6. d( , ) Distance between two center
7. Si Variance of ithcluster
8. Ckmax Maximum number of cluster
9. d No of dimension
IV. OPTIMAL CLUSTERING METHODS

Clustering validity indexes [20][21][22][23] are usually
defined by combining compactness and separability of the
clusters. Compactness measures closeness of cluster elements.
A common measure of compactness is variance. Separability
indicates how distinct two clusters are. Basically, there are two
types of validity techniques used for clustering evaluation-
external criteria and internal criteria [30]. External criteria are
used for categorized data clustering. No internal information is Fig. 5. Flow chart of K-mean Algorithm
needed for internal criteria. It evaluates the quality of clusters,
using only the data and without referencing to the external 2) Dunn Index: The value of Dunn index (DI) [21] is
information. There are so many methods to measure the quality expected to large if clusters of the data set are well separated.
of the clustering-Davies-Bouldin index, Dunn index, CH index, If the dataset has compact and well-separated clusters, the
Elbow method, X-means clustering, Information Criterion distance between the clusters is expected to be larged and the
Approach, Information Theoretic Approach, Silhouette diameter of the clusters is expected to be smaller. The clusters
method, and cross-validation. The used TMS data do not have are compact and well separated by maximizing the inter-
any external information and thus influences to use internal cluster distance while minimizing the intra-cluster distance.
measure or criteria for clustering validation. Large value of Dunn index indicates the compact and well-
1) Davies-Bouldin Index: Davies Bouldin (DB) index separated clusters. The formulae of Dunn index are-
[20][21] measures the average similarity between each cluster (8)
and its most similar one. Lower value of DB Index indicates
that clusters are tight compact and well separated which Where,
reflects better clustering. The goal of this index is to achieve ( ) (9)
minimum within-cluster variance and maximum between (10)
cluster separations. It measures similarity of cluster (Rij) by
variance of a cluster (Si) and separation of cluster (dij) by
3) Silhouette Coefficient: Silhouette Coefficient (SC) [22]
distance between two clusters (vi and vj). The formulae of DB
[23][24] shows- how well the objects can fit within the cluster.
index are-
It measures the quality of the cluster by ranging between -1
∑ (3) and 1. A value near to one (1) indicates that the point x is
(4) affected to the right cluster. There are two terms- cohesion and
separation. Cohesion is intra clustering distance, and
(5) separation is distance between cluster centroids. A(x) is the
(6) average dissimilarity between x and all other points of its
cluster. B(x) is the minimum dissimilarity x and its nearest
∑ (7) cluster. A cluster which has a value near -1, indicates that the
point should be affected to another cluster. The formulae of
SC are-
267 | P a g e
Vol. 8, No. 3, 2017
∑ (11) E. Within Sum of Square Error (WSS) : WSS is also called

Sum of Squared Error (SSE) [26]. Sum of Square Error
∑ (12) (SSE) or within sum of square cluster error (WSS) is
widely used for criteria measuring. The value of SSE is
∑ ∑ (13) high, indicates high error, which means poor quality
cluster. Good clustering aims for minimum value of SSE.
V. CLUSTER VALIDATION METHODS The formula of within Sum of Square Error is-
A. Correlation: An effective clustering algorithm needs a ∑ ∑ (17)
suitable measurement of similarity or dissimilarity.
Correlation (in Fig. 6) computes the similarity matrix and
incident matrix (also called occurrence matrix) to measure
the correlation between the data and its cluster [25].
Higher value of correlation indicates that the points belong
to the same cluster (very close to each other), and reflects
good clustering. The formula of correlation is-
∑ ̅ ̅
(14)
√∑ ̅ √∑
Here,
r =correlation of the data and its cluster,
Distance matrix, D= {d11,d22,d33, …,dnn}, Fig. 6. Correlation
Incident matrix C= {c11,c22,c33, …,cnn},
̅ =mean of the distance matrix,
and ̅ =mean of the incident matrix.
B. Distance Matrix : It is also called similarity matrix, an nxn

two dimensional matrix -where n is the number of
elements in a data set. d(x, y) distance or dissimilarity Fig. 7. Distance Matrix
between objects x and y. Fig. 7 represents distance matrix.
d(x,y)=|x-y| (15) VI. EXPERIMENTAL RESULTS
Based on the above algorithms and methods, data are
C. Incidence Matrix: An incidence matrix is a matrix that formulated to determine the optimal classes in each feature,
shows the relationship between two classes of objects. It is road weight and verify better algorithm. Experiments in Table
an nxn matrix where n is the total number of data set. If the 2 and 3 are generated from the 750 days (1/12/2006 to
object x and the object y belongs to the same cluster then 20/12/2008) collected data from [5] and presented in Fig. 8.
Ixy=1 and if the object x and the object y belongs to the
different cluster then Ixy=0. 1) Results of Divisive Hierarchical Method: the sum of
square error (SSE) of all features using divisive hierarchical
D. Manhattan Matrix : Manhattan distance is the absolute cluster with Davies-Bouldin index, Dunn index and Silhouette
distance between two points. Let, the objects x = (x1, ...,xd) index are presented in Table II. Shaded block (in Table II)
and y = (y1, ..., yd) then the Manhattan distance between indicates the minimum value of SSE. This table represents the
the two objects is,
optimal cluster size of each feature using three methods and
∑ | | (16) also presents that Dunn index minimizes the SSE values in all
cases. Thus, we conclude that Dunn index performs better for
In this work, we use Manhattan distance as a distance HDC to find optimal cluster. Thus, the optimal classes of each
measurement technique. feature using HDC are – Rainfall (k=2), Temperature (k=2),
Wind (k=3), Humidity (k=5) and Peak hour (k=4).
268 | P a g e
Vol. 8, No. 3, 2017
TABLE. II. OPTIMAL NUMBER OF CLUSTER AND VALUE OF SSE OF RAINFALL, TEMPERATURE, WIND, HUMIDITY AND PEAK HOUR USING HDC WITH DB, DUNN
AND SC INDICES
Feature Rainfall Temperature Wind Humidity Peak hour
Optimal Optimal Optimal Optimal Optimal

SSE SSE SSE SSE SSE
k k k k k
Method
DB
2 25051.32 2 3690.40 2 9301.24 3 31152.06 3 99507.77
Dunn 2 25051.32 2 3690.40 3 4850.62 5 11231.18 4 49786.87
SC
2 25051.32 2 3690.40 3 4850.62 3 31152.06 2 183452.98
Optimal k 2 2 3 5 4
TABLE. III. OPTIMAL NUMBER OF CLUSTER AND ITS VALUE OF SSE OF RAINFALL, TEMPERATURE, WIND, HUMIDITY AND PEAK HOUR USING K-MEAN WITH
DB, DUNN AND SC INDICES
Feature Rainfall Temperature Wind Humidity Peak hour
Optimal Optimal Optimal Optimal Optimal
SSE SSE SSE SSE SSE
Method k k k k k
DB 3 9574.97 2 3690.40 2 9301.24 3 23146.38 2 183452.98
Dunn 3 9574.97 3 2106.56 4 4657.67 6 6339.761 5 49541.539
SC 2 25051.32 2 3690.40 2 9301.24 3 23146.38 2 183452.98
Optimal k 3 3 4 6 5
Fig. 8. Collected data from ACCU Weather[5]
269 | P a g e
Vol. 8, No. 3, 2017
TABLE. IV. COMPARISON THE CORRELATION BETWEEN TWO ALGORITHMS

TABLE. VI. SAMPLE ROAD WEIGHT CLUSTERING RESULT
Algorithm Hierarchical K mean
Optima
Optimal k
Temperature
Road weight
lk
Peak hour
Humidity
cluster
Rainfall
cluster
Wind
Data
Correlation Using Correlation
Using
Dunn
Dunn
index
index
Feature
Rainfall 2 -0.801 3 -0.789
1 0 1 0 3 1 0
Temperature 2 -0.736 3 -0.721
2 0 1 0 4 3 5
Wind 3 -0.580 4 -0.405
Humidity 5 -0.555 6 -0.521 3 0 0 0 3 2 2
Peak hour 4 -0.639 5 -0.578 4 0 0 0 3 3 5
5 0 2 1 4 0 4
TABLE. V. CLUSTER SIZE AND DUNN INDEX VALUE OF THE ROAD WEIGHT
6 0 2 2 4 1 6
No of cluster k Dunn index value
7 0 2 0 3 1 0
2 0.08
8 0 2 1 4 3 3
3 0.09
9 0 2 0 3 0 4
4 0.10
10 0 2 1 5 0 4
5 0.11
6 0.11
4) Optimal Cluster of Road Weight : From the previous
experiments it is clear that k-means with Dunn index performs
7 0.13
better for all features. Thus, for the classification of the road
8 0.12 weight k-means with Dunn index can be chosen. Table V
9 0.12 shows the no of cluster of road weight and Dunn index value
of that corresponding cluster. This table represents that
2) Result of K-means Clustering Method: The sum of maximum value of Dunn index achieves in k=7. Thus, the
square error (SSE) [26] of all features using k-means optimal cluster size of road weight is seven (7) and there
clustering algorithm with Davies Bouldin index, Dunn index should be seven (7) different type of classes for road weight
and Silhouette index are presented in Table III. Shaded block updates. Table VI presents some sample experimental results
(in Table III) indicates the minimum value of SSE. Table III of road weight updates.
reflects that Dunn index provides minimum value of the SSE
in all features. Thus, we conclude that Dunn index performs VII. CONCLUSION AND FUTURE WORKS
better for k-means algorithm to find optimal cluster numbers. In this section, we summarize our work. The features data
The optimal classes of each feature using k-means are – are collected from the external feeds (like web site, RSS feed,
Rainfall (k=3), Temperature (k=3), Wind (k=4), Humidity web service etc.) for classifying data. We cluster the data using
(k=6) and Peak hour (k=5). two approaches (partition k-means and hierarchical k-means)
3) Comparison of HDC and K-means: Hierarchical and find the optimal number of clusters for each feature using
clustering and K-means clustering are compared by computing Davies-Bouldin index, Dunn index and Silhouette coefficient.
the correlation on their optimal cluster numbers in each Thereafter, conclusion has been drawn which algorithm is
feature. It is clear from Table IV that the correlations of K- better for which feature data and then find the optimal number
of clusters of road weights with the input of the measured five
means are higher than the correlations of HDC, for all
(5) feature clusters.
features. Thus, we conclude k-means performs better than
HDC.
270 | P a g e
Vol. 8, No. 3, 2017
In future, we can also measure validity of the classes by [7] Asadi, R., Mustapha, N., Sulaiman, N. and Shiri, N. 2009. New
other probabilistic and statistical methods. Dunn index method supervised multi layer feed forward neural network model to accelerate
classification with high accuracy. European Journal of Scientific
needs lots of computational cost. Improvement on the Research. Vol. 33. No. 1. 2009. pp.163-178.
computation cost and error of the cluster building procedure [8] Pelliccioni, R. Cotroneo, and F. Pung 2010. Optimization of neural net
can be reduced using other statistical models. At present, we training using patterns selected by cluster analysis: a case-study of ozone
are not considering other characteristics of environmental and prediction level‖, 8th Conference on Artificial Intelligence and its
road status such as: accidents, road works, etc. Roads and Applications to the Environmental Sciences, 2010.
Highway authorities in Bangladesh does not provide/publish [9] Kayri, M. and Çokluk, Ö. 2010. Using multinomial logistic regression
any road construction, maintenance status and thus, these analysis In artificial neural network: an application, Ozean Journal of
Applied Sciences. Vol. 3. No. 2. 2010.
attributes will be considered in our future research direction.
[10] Khan, U., Bandopadhyaya, T. K. and Sharma, S. 2009. Classification of
Online multi data feeds capability supports the proposed Stocks Using Self Organizing Map. International Journal of Soft
model to be connected with different Social Medias (facebook, Computing Applications. Issue 4. pp.19-24.
twitters etc.), and collects necessary information (mishap, [11] Hagan, M.T., Demuth, H.B., and Beale, M. 1996. Neural Network
Design.PWS publishing company, Boston, Massachusetts.
disaster situations), and uses analytical tools to make proper
decisions. However, special consideration is required on [12] Ray, S., and Turi, R., H. 1999. Determination of number of clusters in
K-means clustering and application in color image segmentation,
internet securities as all of the information is available on the Published in the 4th International Conference on Advances in Pattern
internet. Recently, deep learning (DL) techniques are also used Recognition and Digital Techniques. Pg. 137-143. Culcutta, India.
to solve unsupervised clustering problem. Interpolation of deep [13] Maulik, U., and Bandyopadhyay, S., ―Performance Evaluation of Some
leaning is much complex than k-means. In addition, deep Clustering Algorithms and Validity Indices‖, Published in Journal IEEE
learning works with multi-layer data representation and Transactions on Pattern Analysis and Machine Intelligence archive
sometimes degrades the performance due to the limited amount Volume 24 Issue 12, Pg.1650-1654, December 2002.
of data. It addresses over fitting problem also. Thus, a [14] Steinbach, M., Karypis, G., and Kumar, V., ―A Comparison of
Document Clustering Techniques‖, Published in the 6th ACM SIGKDD,
comparative study with simple k-means and DL is required and World Text Mining Conference, (2000).
will be applied in near future. [15] Begum, S., F., Kaliyamurthie, K., P., and Rajesh, A., ―Comparative
Still, the proposed TMS is in construction phase and cover Study of Clustering Methods over Ill-Structured Datasets using Validity
Indices‖, published in Indian Journal of Science and Technology, Vol
small road networks. City level broader area will be considered 9(12), March 2016.
in near future. A GSM and GPS based micro controller with [16] Reddy, C.K. and Vinzamuri, B., ―A Survey of Partitional and
different embedded sensors is in developing phase. This device Hierarchical Clustering Algorithms‖, Data Clustering: Algorithms and
will help to collect real time environmental data at an instant Applications. CRC Press 2014, ISBN 978-1-46-655821-2 (pg.88-91).
time. [17] Hierarchical-clustering-algorithm available at:
https://sites.google.com/site/dataclusteringalgorithms/hierarchical-
ACKNOWLEDGEMENTS clustering-algorithm,15-5-2016.
[18] Ray, S., and Turi, R., H., ―Determination of Number of Clusters in K-
East West University (EWU), Bangladesh is gratefully Means Clustering and Application in Colour Image Segmentation‖,
acknowledged for providing article processing charge (APC) to Published in the 4th International Conference on Advances in Pattern
cover the costs of publication. Recognition and Digital Techniques (Pg. 137-143).
REFERENCES [19] K-mean Algorithm Available at:
http://www.academia.edu/3438357/K_Means_Algorithm_Example,12-
[1] Rahman, M. R. and Akhter, S. 2015. Real time bi-directional traffic 4-2016.
management support system with GPS and websocket, Proc. of the 15th
IEEE International Conference on Computer and Information [20] Kovács, F., Legány, C. and Babos, A., ―Cluster Validity Measurement
Technology (CIT-2015). Liverpool. UK. 26-28 Oct. 2015. Techniques‖ ,AIKED'06 Proceedings of the 5th WSEAS International
Conference on Artificial Intelligence, Knowledge Engineering and Data
[2] Rahman, M. R. and Akhter, S. 2015. Bidirectional traffic management Bases (2006), ISBN:111-2222-33-9 (pg. 388-393).
with multiple data feeds for dynamic route computation and prediction
system. International Journal of Intelligent Computing Research [21] Davies, D.L. and Bouldin, D.W., ―A Cluster Separation Measure‖, IEEE
(IJICR). Special Issue. Volume 7. Issue 2. ISSN: 2042 4655. Mar/2015. Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224–
http://infonomics-society.ie/ijicr/ 227.
[3] Rahman, M. R. and Akhter, S. 2015. Bi-directional traffic management [22] Dunn, J.C., ―A Fuzzy Relative of the ISODATA Process and Its Use in
support system with decision tree based dynamic routing, Proc. of 10th Detecting Compact Well-Separated Clusters,‖ J. Cybernetics, vol. 3,
International Conference for Internet Technology and Secured 1973, (pg. 32-57).
Transactions, ICITST 2015. London. United Kingdom. December 14-16. [23] Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J.,‖Understanding of
2015. Internal Clustering Validation Measures‖, Proceeding ICDM '10
[4] Akhter,S. Rahman, M.R., and Islam M. A. 2016. Neural Network (NN) Proceedings of the 2010 IEEE International Conference on Data Mining
based route computation for bi-directional traffic management system. ISBN: 978-0-7695-4256-0 (Pg. 911-916).
International Journal of Applied Evolutionary Computation- special [24] Zoubi, M., and Rawi, M., ―An efficient approach for computing
issue on Emerging Research Trend in Computing and Communication silhouette coefficients,‖ Journal of Computer Science 4,(2008) (pg.252–
Technologies. Volume 7. Issue 4. 255).
[5] AccuWeather. (2016, March 7). Retrieved from [25] Correlation Definition available at:
http://www.accuweather.com/en/bd/dhaka/28143/january- www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_basics.pdf,
weather/28143?monyr=1%2F1%2F2016&view=table 23-4-2016.
[6] Moghadassi, F. Parvizian, and S. Hosseini. 2009. A new approach based [26] Sum of square error
on artificial neural networks for prediction of high pressure vapor-liquid at:https://hlab.stanford.edu/brian/error_sum_of_squares.html,15-5-2016.
equilibrium. Australian Journal of Basic and Applied Sciences. Vol. 3.
[27] Pelleg, D., Moore, A.W., ―X-means: Extending K-means with Efficient
No. 3. pp. 1851-1862. 2009.
Estimation of the Number of Clusters‖, Proceeding ICML '00
271 | P a g e
Vol. 8, No. 3, 2017
Proceedings of the Seventeenth International Conference on Machine on Intelligent Control and Automation (WCICA 0'6), vol. 2, pp. 6078–
Learning, ISBN: 1-55860-707-2 (Pg.727-734). 6081, Dalian, China, June 2006.
[28] ID3 Decision Tree Algorithm - Part 1 at: [32] Caceres, N., Romero, L. M. and Benitez, F. G., ―Estimating traffic flow
http://www.codeproject.com/Articles/259241/ID-Decision-Tree- profiles according to a relative attractiveness factor‖, Procedia—Social
Algorithm-Part and Behavioral Sciences, vol. 54, pp. 1115–1124, 2012.
[29] https://datajobs.com/data-science-repo/Decision-Trees-[Rokach-and- [33] Guardiola, I. G., Leon, T. and Mallor, F., ―A functional approach to
Maimon].pdf monitor and recognize patterns of daily traffic profiles‖, Transportation
[30] Liu,Y., Li,Z., Xiong,H., Gao,X.,Wu,J., ―Understanding of Internal Research, Part B: Methodological, vol. 65, pp. 119–136, 2014.
Clustering Validation Measures‖, Proc. of IEEE International [34] Mirakhorli, A., "A Comparative Study: Utilizing Data Mining
Conference on Data Mining,2010 Techniques to Classify Traffic Congestion Status", UNLV Theses,
http://datamining.rutgers.edu/publication/internalmeasures.pdf Dissertations, Professional Papers, and Capstones. Paper 2197.
[31] Wang, Y., Chen, Y., Qin, M. and Zhu, Y., ―Dynamic traffic prediction
based on traffic flow mining,‖ in Proceedings of the 6th World Congress
272 | P a g e
View publication stats

Paper 37-Exploreing K-Means With Internal Validity Indexes

Uploaded by

Copyright:

Available Formats

You might also like

Paper 37-Exploreing K-Means With Internal Validity Indexes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper 37-Exploreing K-Means With Internal Validity Indexes

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Article · March 2017

Sadia Nawrin Md Rahatur Rahman

The user has requested enhancement of the downloaded file.

Exploreing K-Means with Internal Validity Indexes

providing minimum Sum of Square Error (SSE) for all

Fig. 4. Partitional clustering

Fig. 3. Splitting node in DIANA

TABLE. I. LIST OF SYMBOLS AND THEIR DESCRIPTION

6. d( , ) Distance between two center

IV. OPTIMAL CLUSTERING METHODS

∑ (11) E. Within Sum of Square Error (WSS) : WSS is also called

B. Distance Matrix : It is also called similarity matrix, an nxn

Feature Rainfall Temperature Wind Humidity Peak hour

Optimal Optimal Optimal Optimal Optimal

Dunn 2 25051.32 2 3690.40 3 4850.62 5 11231.18 4 49786.87

Fig. 8. Collected data from ACCU Weather[5]

TABLE. IV. COMPARISON THE CORRELATION BETWEEN TWO ALGORITHMS

You might also like