Professional Documents
Culture Documents
Paper 37-Exploreing K-Means With Internal Validity Indexes
Paper 37-Exploreing K-Means With Internal Validity Indexes
Paper 37-Exploreing K-Means With Internal Validity Indexes
net/publication/315747115
Exploreing K-Means with Internal Validity Indexes for Data Clustering in Traffic
Management System
CITATIONS READS
28 1,883
3 authors:
Shamim Akhter
Ahsanullah University of Science & Tech
106 PUBLICATIONS 492 CITATIONS
SEE PROFILE
All content following this page was uploaded by Shamim Akhter on 03 April 2017.
Abstract—Traffic Management System (TMS) is used to rains, and makes slower traffic movements. The heat released
improve traffic flow by integrating information from different from the engines, air-conditioners of the traffic stacked
data repositories and online sensors, detecting incidents and vehicles, may raise the overall temperature of the area. Thus,
taking actions on traffic routing. In general, two decision making the current temperature helps to classify traffic congestion
systems-weights updating and forecasting are integrated inside status of a particular road segment. Gusts of wind have direct
the TMS. The models need numerous data sets for making influence on road safety and that pushes to slower vehicle
appropriate decisions. To determine the dynamic road weights in movement. In addition, temperature, wind and humidity have
TMS, four (4) different environmental attributes are considered, direct influence to predict the future rainfall in a particular area.
which are directly or indirectly related to increase the traffic
Peak hour is one of the most influential attributes to cause
jam– rain fall, temperature, wind, and humidity. In addition,
peak hour is taken as an additional attribute. Usually, the data
traffic congestion in metro cities. Thus, these four (4)
sets are classified by instinct method. However, optimum environmental attributes and peak hour have direct or indirect
classification on data sets is vital to improve the decision relationship on traffic congestion as well as vehicle movement
accuracy of the TMS. Collected data sets have no class label and and influence to choose them as decision making parameters.
thus, cluster based unsupervised classifications (partitioning, The value of these attributes (features) are intelligently
hierarchical, grid-based, density-based) can be used to find crawled by search engine, with metadata indexing (title,
optimum number of classifications in each attribute, and description, keyword etc.), directly from the multiple data
expected to improve the performance of the TMS. Two most
feeds (like web site, RSS feeds, web service etc.) from the web
popular and frequently used classifiers are hierarchical
clustering and partition clustering. K-means is simple, easy to
page in [5]. Crawled data are simplified (structured) and stored
implement, and easy to interpret the clustering results. It is also in a historic table. However, the number of attributes can be
faster, because the order of time complexity is linear with the changed according to the system requirements. We collect
number of data. Thus, in this paper we are going to demonstrate more than two (2) years or 750 days (1/12/2006 to 20/12/2008)
the performance of partition k-means and hierarchical k-means data of five features from the web page in [5].
with their implementations by Davies Boulder Index (DBI), Dunn Initially, decision tree (DT) [1] [2] [3] was used to classify
Index (DI), Silhouette Coefficient (SC) methods to outline the
road weights and weighted moving average analytic was
optimal number classifications (features) inside each attribute of
TMS data sets. Subsequently, the optimal classes are validated by
implemented to estimate or predict feature values in DT
using WSS (within sum of square) errors and correlation [28][29] based system and achieved 16.45% accuracy.
methods. The validation results conclude that k-means with DI However, the model data sets were classified by instinct
performs better in all attributes of TMS data sets and provides method. Cluster based classifications (K-means, Locality-
more accurate optimum classification numbers. Thereafter, the Sensitive Hashing (LSH) etc.) can be used to find optimum
dynamic road weights for TMS are generated and classified number of classifications in each feature and can improve the
using the combined k-means and DI method. performance of the TMS. With this hypothesis, we implement
two unsupervised clustering techniques partition k-means and
Keywords—Traffic Management System (TMS); Data hierarchical k-means. There are several methods
Clustering; K-means; Hierarchical Clustering; Cluster Validation (internal/external) to measure similarity between two clustering
steps and used to compare how well different data clustering
I. INTRODUCTION algorithms perform on a set of data. Only internal methods -
A new low cost, flexible, maintainable, and secure internet- Davies Boulder Index (DBI), Dunn Index (DI), and Silhouette
based traffic management system with real time bi-directional Coefficient (SC) - are used to choose the optimum number of
communication was proposed and implemented (in classification, as they do not have any external information.
[1][2][3][4]) to assist and reduce the traffic situation. To Subsequently, the optimal classes are cross-validated by using
determine the dynamic road weights in TMS, four (4) different statistical analytics - correlation and Within Sum of Square
environmental attributes - rain fall, temperature, wind, and (WSS) errors.
humidity are considered. Rainfall is one of the most influential
Results highlight that Dunn Index (DI) performs better for
weather attributes to determine the road congestion in metro
both partition k-means and hierarchical k-means algorithms by
city, as the road segments are submerged due to the heavy
264 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
265 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
clusters are split (top-down view). There are two variant of B. Partitional Clustering
hierarchical clustering methods (in fig. 2.): i) Agglomerative Partitional clustering determines a flat clustering into k
Hierarchical clustering algorithm (HAC) or AGNES (bottom- clusters with minimal costs. It partitions data set into k clusters
up approaches), ii) Divisive Hierarchical clustering algorithm and assigns the object to their nearest centers. Here (in fig. 4),
(HDC) or DIANA (top-down approaches). In this paper, we k is the number of centroids.
implement the divisive hierarchical cluster to classify the
feature data, as it has less computational cost compare to
AGNES. We stop our iteration when optimal clustering
number is reached.
C3=4 input nodes ->3-6 (1 spilt into 2 sub group-5 & 6).
Do until Ckmaxnot reached where Ckmax is maximum number
of clusters.
266 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
SL Symbol/Notation Description
No
1. nc Number of total cluster
2. Ci ithcluster
Manhattan distance between two data
3. d(x,y)
element
4. ni Number of element in the ith cluster
5. Value of the center of the ithcluster
7. Si Variance of ithcluster
8. Ckmax Maximum number of cluster
9. d No of dimension
267 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
Here,
r =correlation of the data and its cluster,
Distance matrix, D= {d11,d22,d33, …,dnn}, Fig. 6. Correlation
Incident matrix C= {c11,c22,c33, …,cnn},
̅ =mean of the distance matrix,
and ̅ =mean of the incident matrix.
268 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
TABLE. II. OPTIMAL NUMBER OF CLUSTER AND VALUE OF SSE OF RAINFALL, TEMPERATURE, WIND, HUMIDITY AND PEAK HOUR USING HDC WITH DB, DUNN
AND SC INDICES
SC
2 25051.32 2 3690.40 3 4850.62 3 31152.06 2 183452.98
Optimal k 2 2 3 5 4
TABLE. III. OPTIMAL NUMBER OF CLUSTER AND ITS VALUE OF SSE OF RAINFALL, TEMPERATURE, WIND, HUMIDITY AND PEAK HOUR USING K-MEAN WITH
DB, DUNN AND SC INDICES
Feature Rainfall Temperature Wind Humidity Peak hour
Optimal Optimal Optimal Optimal Optimal
SSE SSE SSE SSE SSE
Method k k k k k
DB 3 9574.97 2 3690.40 2 9301.24 3 23146.38 2 183452.98
Dunn 3 9574.97 3 2106.56 4 4657.67 6 6339.761 5 49541.539
SC 2 25051.32 2 3690.40 2 9301.24 3 23146.38 2 183452.98
Optimal k 3 3 4 6 5
269 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
Optima
Optimal k
Temperature
Road weight
lk
Peak hour
Humidity
cluster
Rainfall
cluster
Wind
Data
Correlation Using Correlation
Using
Dunn
Dunn
index
index
Feature
Rainfall 2 -0.801 3 -0.789
1 0 1 0 3 1 0
Temperature 2 -0.736 3 -0.721
2 0 1 0 4 3 5
Wind 3 -0.580 4 -0.405
Humidity 5 -0.555 6 -0.521 3 0 0 0 3 2 2
Peak hour 4 -0.639 5 -0.578 4 0 0 0 3 3 5
5 0 2 1 4 0 4
TABLE. V. CLUSTER SIZE AND DUNN INDEX VALUE OF THE ROAD WEIGHT
6 0 2 2 4 1 6
No of cluster k Dunn index value
7 0 2 0 3 1 0
2 0.08
8 0 2 1 4 3 3
3 0.09
9 0 2 0 3 0 4
4 0.10
10 0 2 1 5 0 4
5 0.11
6 0.11
4) Optimal Cluster of Road Weight : From the previous
experiments it is clear that k-means with Dunn index performs
7 0.13
better for all features. Thus, for the classification of the road
8 0.12 weight k-means with Dunn index can be chosen. Table V
9 0.12 shows the no of cluster of road weight and Dunn index value
of that corresponding cluster. This table represents that
2) Result of K-means Clustering Method: The sum of maximum value of Dunn index achieves in k=7. Thus, the
square error (SSE) [26] of all features using k-means optimal cluster size of road weight is seven (7) and there
clustering algorithm with Davies Bouldin index, Dunn index should be seven (7) different type of classes for road weight
and Silhouette index are presented in Table III. Shaded block updates. Table VI presents some sample experimental results
(in Table III) indicates the minimum value of SSE. Table III of road weight updates.
reflects that Dunn index provides minimum value of the SSE
in all features. Thus, we conclude that Dunn index performs VII. CONCLUSION AND FUTURE WORKS
better for k-means algorithm to find optimal cluster numbers. In this section, we summarize our work. The features data
The optimal classes of each feature using k-means are – are collected from the external feeds (like web site, RSS feed,
Rainfall (k=3), Temperature (k=3), Wind (k=4), Humidity web service etc.) for classifying data. We cluster the data using
(k=6) and Peak hour (k=5). two approaches (partition k-means and hierarchical k-means)
3) Comparison of HDC and K-means: Hierarchical and find the optimal number of clusters for each feature using
clustering and K-means clustering are compared by computing Davies-Bouldin index, Dunn index and Silhouette coefficient.
the correlation on their optimal cluster numbers in each Thereafter, conclusion has been drawn which algorithm is
feature. It is clear from Table IV that the correlations of K- better for which feature data and then find the optimal number
of clusters of road weights with the input of the measured five
means are higher than the correlations of HDC, for all
(5) feature clusters.
features. Thus, we conclude k-means performs better than
HDC.
270 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
In future, we can also measure validity of the classes by [7] Asadi, R., Mustapha, N., Sulaiman, N. and Shiri, N. 2009. New
other probabilistic and statistical methods. Dunn index method supervised multi layer feed forward neural network model to accelerate
classification with high accuracy. European Journal of Scientific
needs lots of computational cost. Improvement on the Research. Vol. 33. No. 1. 2009. pp.163-178.
computation cost and error of the cluster building procedure [8] Pelliccioni, R. Cotroneo, and F. Pung 2010. Optimization of neural net
can be reduced using other statistical models. At present, we training using patterns selected by cluster analysis: a case-study of ozone
are not considering other characteristics of environmental and prediction level‖, 8th Conference on Artificial Intelligence and its
road status such as: accidents, road works, etc. Roads and Applications to the Environmental Sciences, 2010.
Highway authorities in Bangladesh does not provide/publish [9] Kayri, M. and Çokluk, Ö. 2010. Using multinomial logistic regression
any road construction, maintenance status and thus, these analysis In artificial neural network: an application, Ozean Journal of
Applied Sciences. Vol. 3. No. 2. 2010.
attributes will be considered in our future research direction.
[10] Khan, U., Bandopadhyaya, T. K. and Sharma, S. 2009. Classification of
Online multi data feeds capability supports the proposed Stocks Using Self Organizing Map. International Journal of Soft
model to be connected with different Social Medias (facebook, Computing Applications. Issue 4. pp.19-24.
twitters etc.), and collects necessary information (mishap, [11] Hagan, M.T., Demuth, H.B., and Beale, M. 1996. Neural Network
Design.PWS publishing company, Boston, Massachusetts.
disaster situations), and uses analytical tools to make proper
decisions. However, special consideration is required on [12] Ray, S., and Turi, R., H. 1999. Determination of number of clusters in
K-means clustering and application in color image segmentation,
internet securities as all of the information is available on the Published in the 4th International Conference on Advances in Pattern
internet. Recently, deep learning (DL) techniques are also used Recognition and Digital Techniques. Pg. 137-143. Culcutta, India.
to solve unsupervised clustering problem. Interpolation of deep [13] Maulik, U., and Bandyopadhyay, S., ―Performance Evaluation of Some
leaning is much complex than k-means. In addition, deep Clustering Algorithms and Validity Indices‖, Published in Journal IEEE
learning works with multi-layer data representation and Transactions on Pattern Analysis and Machine Intelligence archive
sometimes degrades the performance due to the limited amount Volume 24 Issue 12, Pg.1650-1654, December 2002.
of data. It addresses over fitting problem also. Thus, a [14] Steinbach, M., Karypis, G., and Kumar, V., ―A Comparison of
Document Clustering Techniques‖, Published in the 6th ACM SIGKDD,
comparative study with simple k-means and DL is required and World Text Mining Conference, (2000).
will be applied in near future. [15] Begum, S., F., Kaliyamurthie, K., P., and Rajesh, A., ―Comparative
Still, the proposed TMS is in construction phase and cover Study of Clustering Methods over Ill-Structured Datasets using Validity
Indices‖, published in Indian Journal of Science and Technology, Vol
small road networks. City level broader area will be considered 9(12), March 2016.
in near future. A GSM and GPS based micro controller with [16] Reddy, C.K. and Vinzamuri, B., ―A Survey of Partitional and
different embedded sensors is in developing phase. This device Hierarchical Clustering Algorithms‖, Data Clustering: Algorithms and
will help to collect real time environmental data at an instant Applications. CRC Press 2014, ISBN 978-1-46-655821-2 (pg.88-91).
time. [17] Hierarchical-clustering-algorithm available at:
https://sites.google.com/site/dataclusteringalgorithms/hierarchical-
ACKNOWLEDGEMENTS clustering-algorithm,15-5-2016.
[18] Ray, S., and Turi, R., H., ―Determination of Number of Clusters in K-
East West University (EWU), Bangladesh is gratefully Means Clustering and Application in Colour Image Segmentation‖,
acknowledged for providing article processing charge (APC) to Published in the 4th International Conference on Advances in Pattern
cover the costs of publication. Recognition and Digital Techniques (Pg. 137-143).
REFERENCES [19] K-mean Algorithm Available at:
http://www.academia.edu/3438357/K_Means_Algorithm_Example,12-
[1] Rahman, M. R. and Akhter, S. 2015. Real time bi-directional traffic 4-2016.
management support system with GPS and websocket, Proc. of the 15th
IEEE International Conference on Computer and Information [20] Kovács, F., Legány, C. and Babos, A., ―Cluster Validity Measurement
Technology (CIT-2015). Liverpool. UK. 26-28 Oct. 2015. Techniques‖ ,AIKED'06 Proceedings of the 5th WSEAS International
Conference on Artificial Intelligence, Knowledge Engineering and Data
[2] Rahman, M. R. and Akhter, S. 2015. Bidirectional traffic management Bases (2006), ISBN:111-2222-33-9 (pg. 388-393).
with multiple data feeds for dynamic route computation and prediction
system. International Journal of Intelligent Computing Research [21] Davies, D.L. and Bouldin, D.W., ―A Cluster Separation Measure‖, IEEE
(IJICR). Special Issue. Volume 7. Issue 2. ISSN: 2042 4655. Mar/2015. Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224–
http://infonomics-society.ie/ijicr/ 227.
[3] Rahman, M. R. and Akhter, S. 2015. Bi-directional traffic management [22] Dunn, J.C., ―A Fuzzy Relative of the ISODATA Process and Its Use in
support system with decision tree based dynamic routing, Proc. of 10th Detecting Compact Well-Separated Clusters,‖ J. Cybernetics, vol. 3,
International Conference for Internet Technology and Secured 1973, (pg. 32-57).
Transactions, ICITST 2015. London. United Kingdom. December 14-16. [23] Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J.,‖Understanding of
2015. Internal Clustering Validation Measures‖, Proceeding ICDM '10
[4] Akhter,S. Rahman, M.R., and Islam M. A. 2016. Neural Network (NN) Proceedings of the 2010 IEEE International Conference on Data Mining
based route computation for bi-directional traffic management system. ISBN: 978-0-7695-4256-0 (Pg. 911-916).
International Journal of Applied Evolutionary Computation- special [24] Zoubi, M., and Rawi, M., ―An efficient approach for computing
issue on Emerging Research Trend in Computing and Communication silhouette coefficients,‖ Journal of Computer Science 4,(2008) (pg.252–
Technologies. Volume 7. Issue 4. 255).
[5] AccuWeather. (2016, March 7). Retrieved from [25] Correlation Definition available at:
http://www.accuweather.com/en/bd/dhaka/28143/january- www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_basics.pdf,
weather/28143?monyr=1%2F1%2F2016&view=table 23-4-2016.
[6] Moghadassi, F. Parvizian, and S. Hosseini. 2009. A new approach based [26] Sum of square error
on artificial neural networks for prediction of high pressure vapor-liquid at:https://hlab.stanford.edu/brian/error_sum_of_squares.html,15-5-2016.
equilibrium. Australian Journal of Basic and Applied Sciences. Vol. 3.
[27] Pelleg, D., Moore, A.W., ―X-means: Extending K-means with Efficient
No. 3. pp. 1851-1862. 2009.
Estimation of the Number of Clusters‖, Proceeding ICML '00
271 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 3, 2017
Proceedings of the Seventeenth International Conference on Machine on Intelligent Control and Automation (WCICA 0'6), vol. 2, pp. 6078–
Learning, ISBN: 1-55860-707-2 (Pg.727-734). 6081, Dalian, China, June 2006.
[28] ID3 Decision Tree Algorithm - Part 1 at: [32] Caceres, N., Romero, L. M. and Benitez, F. G., ―Estimating traffic flow
http://www.codeproject.com/Articles/259241/ID-Decision-Tree- profiles according to a relative attractiveness factor‖, Procedia—Social
Algorithm-Part and Behavioral Sciences, vol. 54, pp. 1115–1124, 2012.
[29] https://datajobs.com/data-science-repo/Decision-Trees-[Rokach-and- [33] Guardiola, I. G., Leon, T. and Mallor, F., ―A functional approach to
Maimon].pdf monitor and recognize patterns of daily traffic profiles‖, Transportation
[30] Liu,Y., Li,Z., Xiong,H., Gao,X.,Wu,J., ―Understanding of Internal Research, Part B: Methodological, vol. 65, pp. 119–136, 2014.
Clustering Validation Measures‖, Proc. of IEEE International [34] Mirakhorli, A., "A Comparative Study: Utilizing Data Mining
Conference on Data Mining,2010 Techniques to Classify Traffic Congestion Status", UNLV Theses,
http://datamining.rutgers.edu/publication/internalmeasures.pdf Dissertations, Professional Papers, and Capstones. Paper 2197.
[31] Wang, Y., Chen, Y., Qin, M. and Zhu, Y., ―Dynamic traffic prediction
based on traffic flow mining,‖ in Proceedings of the 6th World Congress
272 | P a g e
www.ijacsa.thesai.org
View publication stats