Professional Documents
Culture Documents
Enhancement of Qualities of Clusters by Eliminating Outlier For Data Mining Application in Education
Enhancement of Qualities of Clusters by Eliminating Outlier For Data Mining Application in Education
Enhancement of Qualities of Clusters by Eliminating Outlier For Data Mining Application in Education
Presentation Outline
Introduction Problem Definition Literature Survey Plan of Research Work Research Methodology Framework Expected Outcomes & Observations Publications References
Introduction
Data Mining is the process of getting hidden information from the piles of databases. It is also known as Knowledge Discovery in Databases (KDD). KDD is extraction of data from large databases.
Problem Definition
Design and Implementation of scalable Parameter free MPC (Multiple-Phase Clustering) algorithm for Outlier detection and elimination on high dimensional categorical data
0-D(apex) cuboid
1-D cuboids
time,item
time,location
item,location item,supplier
location,supplier
time,supplier time,item,location
2-D cuboids
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
9
Literature Survey
Previous research works consist of distancebased clustering algorithms called CLARANS [1] and DBSCAN [2] (Density Based Spatial Clustering of Applications with Noise). DBSCAN and CLARANS, both were developed in C++ language.
Literature Survey
Both requires one input parameter and they do not withstand with high dimensional data. Yang et al. proposed an approach based on histograms called CLOPE [3] (Clustering with Slope). CLOPE is very fast and scalable to high dimensional data and requires one input parameter.
Literature Survey
CLIQUE [4] automatically finds subspaces with high-density clusters. It generates cluster descriptions in the form of DNF expression and strives to generate minimal descriptions for ease of understanding. CURE [5] is robust to outliers and identifies clusters having non-spherical shapes and wide variances in size. BIRCH [6] incrementally and dynamically clusters the data points to produce best quality clustering with available resources.
Literature Survey
For clustering categorical data algorithms such as LIMBO [7], COOLCAT [8] and CACTUS [9] are proposed. The available algorithms range from Subspace Clustering for Categorical Data to Concept of similarity/dissimilarity Other algorithms like CLICKS [10], CABOSFV_C [11], SWHOT [12] and ROCK [13] are also proposed for clustering.
Research Methodology
1. Problem definition 2. Deciding the required framework of work to be carried out. 3. Collection of data 4. Selecting a sample institutional high dimensional categorical data 5. Writing a research proposal steps in conducting study with industry and social perspective
Proposed Framework
Dataset Detection of Outliers Clustering Algorithm No. of Clusters
Detection of outliers will be performed by Frequent Patterns based Outlier Test (FPOT) algorithm. Clustering will be performed by Multi - Phase Clustering (MPC) algorithm.
distance.
Frequent item-set will be used instead of distance metrics, so that the distance-based computation is not needed. This will enhance the robustness for handling missingvalue data.
Quality of Cluster
Quality of a cluster is given by,
Where, Ck cluster Pr(Ck) relative strength of Ck a M an item M = {a1,., am} is set of Boolean attributes Pr(a|Ck) - relative strength of a within C Pr(a|D) - relative strength of a within D D = {x1,., xn} is data set of tuples defined on M
Journal Publication
1. Effective Multi-Stage Clustering for Inter- and Intra-Cluster Homogeneity in IJCSIS (International Journal of Computer Science and Information Security) at USA, Vol. 8 No. 6, September 2010, ISSN: 1947-5500.
http://www.docstoc.com/docs/56990080/Effective-Multi-Stage-Clustering-for-Interand-Intra-Cluster-Homogeneity
2. Efficient Scalable Multi-Level Classification Scheme for Credit Card Fraud Detection in IJCSNS ((International Journal in Computer Science & Network Security) at South Korea, Vol. 10 No. 8. August 2010, ISSN: 1738-7906.
http://paper.ijcsns.org/07_book/201008/20100819.pdf
References
[1]R. Ng and J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge and Data Eng., vol. 14, no. 5, pp. 1003-1016, Sept./Oct. 2002. [2]M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. Eighth ACM Intl Conf. Knowledge Discovery and Data Mining (SIGKDD 96), pp. 226-231, 1996. [3]Y. Yang, X. Guan, and J. You, CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data, Proc. Eighth ACM Conf. Knowledge Discovery and Data Mining (KDD 02), pp. 682-687, 2002. [4]R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 98), pp. 94-105, 1998. [5]S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proc. ACM SIGMOD Conf. Management of Data (SIGMOD 98), pp. 73-84, 1998.
[6] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc. ACM SIGMOD Conf. Management of Data (SIGMOD 96), pp. 103-114, 1996. [7] P. Andritsos, P. Tsaparas, R. Miller, and K. Sevcik, LIMBO: Scalable Clustering of Categorical Data, Proc. Ninth Intl Conf. Extending Database Technology (EDBT 04), pp. 123-146, 2004. [8] D. Barbara , J. Couto, and Y. Li, COOLCAT: An Entropy-Based Algorithm for Categorical Clustering, Proc. 11th ACM Conf. Information and Knowledge Management (CIKM 02), pp. 582-589, 2002. [9] V. Ganti, J. Gehrke, and R. Ramakrishnan, CACTUS: Clustering Categorical Data Using Summaries, Proc. Fifth ACM Conf. Knowledge Discovery and Data Mining (KDD 99), pp. 73-83, 1999. [10]M. Zaki and M. Peters, CLICK: Mining Subspace Clusters in categorical Data via k-Partite Maximal Cliques, Proc. 21st Intl Conf. Data Eng. (ICDE 05), 2005.
[11]Sen Wu and Guiying Wei, High Dimensional Data Clustering Algorithm Based on Sparse Feature Vector for Categorical Attributes. International Conference on Logistics Systems and Intelligent Management, Volume: 2, pages 973 976, IEEE Xplore, 9-10 Jan. 2010. [12]YinZhao Li, Di Wu, JiaDong Ren and ChangZhen Hu, An improved Outlier Detection Method in high-dimension Based on Weighted Hypergraph, Second International Symposium on Electronic Commerce and Security, IEEE 2009. [13]S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, Information Systems, vol. 25,no. 5, pp. 345-366, 2001. [14] Sunita M. Karad, M. U. Kharat, V. M. Wadhai, Prasad S. Halgaonkar, Dipti D. Patil, Effective Multi-Stage Clustering for Interand Intra-Cluster Homogeneity, IJCSIS (International Journal of Computer Science and Information Security) at USA, Vol. 8 No. 6, September 2010, ISSN: 1947-5500.