Enhancement of Qualities of Clusters by Eliminating Outlier For Data Mining Application in Education

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

ENHANCEMENT OF QUALITIES OF CLUSTERS BY ELIMINATING OUTLIER FOR DATA MINING APPLICATION IN EDUCATION

Presented by:Mrs. Sunita M.Karad Under the Guidance of:Dr. M.U.Kharat

Presentation Outline
Introduction Problem Definition Literature Survey Plan of Research Work Research Methodology Framework Expected Outcomes & Observations Publications References

Introduction
Data Mining is the process of getting hidden information from the piles of databases. It is also known as Knowledge Discovery in Databases (KDD). KDD is extraction of data from large databases.

Importance of Data Mining


Data Mining is used in several applications such as understanding consumer research marketing, product analysis, demand and supply analysis and so on. Data is often vast, and noisy, that is, its imprecise and the data structure is complex. Here, purely statistical technique would not succeed, so data mining is a solution. The types of problems that occur in large amounts of data are: noisy data (discrepancies due to human error in data entry, outdated data) , missing values (no recorded values for several attributes such as students family income),

Importance of Data Mining(cont)


sparse data (no enough data samples available), relevance ( data relevant to problem definition), interestingness ( selecting attributes which are required for analysis such as students marks), heterogeneity ( Attributes having different name but same role like customer_no & cust_no)

Types of Clustering Techniques


Mainly there are two types of clustering techniques:1. Agglomerative or Bottom-up clustering i. Consider each object as an invidual cluster. ii. Find the 2 closest objects and merge them into a cluster. iii. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects. iv. If more than one cluster remains , return to step (ii).

Types of Clustering Techniques


2. Divisive or Top-down clustering i. Consider whole dataset as one partition. ii. Make the first object the centroid for the first cluster. iii. For the next object, calculate the similarity, S, with each existing cluster centroid, using some similarity coefficient. iv. If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step (ii).

Problem Definition
Design and Implementation of scalable Parameter free MPC (Multiple-Phase Clustering) algorithm for Outlier detection and elimination on high dimensional categorical data

Cube: A Lattice of Cuboids


all time item location supplier

0-D(apex) cuboid

1-D cuboids

time,item

time,location

item,location item,supplier

location,supplier

time,supplier time,item,location

2-D cuboids

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier
9

Literature Survey
Previous research works consist of distancebased clustering algorithms called CLARANS [1] and DBSCAN [2] (Density Based Spatial Clustering of Applications with Noise). DBSCAN and CLARANS, both were developed in C++ language.

Literature Survey
Both requires one input parameter and they do not withstand with high dimensional data. Yang et al. proposed an approach based on histograms called CLOPE [3] (Clustering with Slope). CLOPE is very fast and scalable to high dimensional data and requires one input parameter.

Literature Survey
CLIQUE [4] automatically finds subspaces with high-density clusters. It generates cluster descriptions in the form of DNF expression and strives to generate minimal descriptions for ease of understanding. CURE [5] is robust to outliers and identifies clusters having non-spherical shapes and wide variances in size. BIRCH [6] incrementally and dynamically clusters the data points to produce best quality clustering with available resources.

Literature Survey
For clustering categorical data algorithms such as LIMBO [7], COOLCAT [8] and CACTUS [9] are proposed. The available algorithms range from Subspace Clustering for Categorical Data to Concept of similarity/dissimilarity Other algorithms like CLICKS [10], CABOSFV_C [11], SWHOT [12] and ROCK [13] are also proposed for clustering.

Drawbacks of Previous algorithms


All of these algorithms share some common drawbacks. These all algorithms depends on one or more parameters. They dont scale well with high dimensional data. None of the algorithms support outlier elimination from high dimensional categorical data.

Plan of Research Work


1-6 months: Literature survey 7-12 months: Study of existing algorithms for outlier detection and clustering of high dimensional categorical data. 13-18 months: Theoretical understanding of Experimental setup and evaluation. 19-24 months: Implementation, analysis & validation of the results 25-30 months: Comparison of implemented results and extraction of conclusions from research work carried out. 31-36 months: Report writing work based on findings.

Research Methodology
1. Problem definition 2. Deciding the required framework of work to be carried out. 3. Collection of data 4. Selecting a sample institutional high dimensional categorical data 5. Writing a research proposal steps in conducting study with industry and social perspective

Research Methodology Contd..


6. Modification and corrective measures as per requirement if any 7. Implementation, Validation and Testing of the proposed MPC and FPOT algorithms.

Proposed Framework
Dataset Detection of Outliers Clustering Algorithm No. of Clusters

Detection of outliers will be performed by Frequent Patterns based Outlier Test (FPOT) algorithm. Clustering will be performed by Multi - Phase Clustering (MPC) algorithm.

Frequent Patterns based Outlier Test


FPOT will process categorical data effectively as it will be able to overcome the problem of curse of dimensionality( as number of dimensions and size of associated concept hierarchies grow computations require excessive amount of storage space).

FPOT will be robust to incomplete data as it is independent of

distance.

Frequent item-set will be used instead of distance metrics, so that the distance-based computation is not needed. This will enhance the robustness for handling missingvalue data.

Multi - Phase Clustering


MPC scales well to high dimensional data both in terms of effectiveness and efficiency. MPC is a parameter-free, fully-automatic approach. The clusters formed will be based on quality of clusters, which will represent a numerical value. For calculating quality of cluster, both intercluster homogeneity and intra-cluster homogeneity will be considered.

Quality of Cluster
Quality of a cluster is given by,

Where, Ck cluster Pr(Ck) relative strength of Ck a M an item M = {a1,., am} is set of Boolean attributes Pr(a|Ck) - relative strength of a within C Pr(a|D) - relative strength of a within D D = {x1,., xn} is data set of tuples defined on M

Expected Outcomes & Observations


Various Outlier detection and Clustering algorithms will be studied and will be analyzed thoroughly to check their impact & effectiveness in implementation with high dimensional categorical data. There qualities of clusters will be verified with respect to homogeneity of data Proposed innovative algorithms labeled as FPOT for outlier detection and MPC for Clustering will be implemented and verified for its effectiveness and efficiency.

Expected Outcomes & Observations Contd


As per algorithm and its implementation support for high dimensional categorical data along with free parameter is expected to withstand with industrial/ educational applications. Quality enhancement in support of elimination of outlier data mining clusters are expected for educational applications.

Journal Publication
1. Effective Multi-Stage Clustering for Inter- and Intra-Cluster Homogeneity in IJCSIS (International Journal of Computer Science and Information Security) at USA, Vol. 8 No. 6, September 2010, ISSN: 1947-5500.
http://www.docstoc.com/docs/56990080/Effective-Multi-Stage-Clustering-for-Interand-Intra-Cluster-Homogeneity

2. Efficient Scalable Multi-Level Classification Scheme for Credit Card Fraud Detection in IJCSNS ((International Journal in Computer Science & Network Security) at South Korea, Vol. 10 No. 8. August 2010, ISSN: 1738-7906.
http://paper.ijcsns.org/07_book/201008/20100819.pdf

References
[1]R. Ng and J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge and Data Eng., vol. 14, no. 5, pp. 1003-1016, Sept./Oct. 2002. [2]M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proc. Eighth ACM Intl Conf. Knowledge Discovery and Data Mining (SIGKDD 96), pp. 226-231, 1996. [3]Y. Yang, X. Guan, and J. You, CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data, Proc. Eighth ACM Conf. Knowledge Discovery and Data Mining (KDD 02), pp. 682-687, 2002. [4]R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 98), pp. 94-105, 1998. [5]S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proc. ACM SIGMOD Conf. Management of Data (SIGMOD 98), pp. 73-84, 1998.

[6] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc. ACM SIGMOD Conf. Management of Data (SIGMOD 96), pp. 103-114, 1996. [7] P. Andritsos, P. Tsaparas, R. Miller, and K. Sevcik, LIMBO: Scalable Clustering of Categorical Data, Proc. Ninth Intl Conf. Extending Database Technology (EDBT 04), pp. 123-146, 2004. [8] D. Barbara , J. Couto, and Y. Li, COOLCAT: An Entropy-Based Algorithm for Categorical Clustering, Proc. 11th ACM Conf. Information and Knowledge Management (CIKM 02), pp. 582-589, 2002. [9] V. Ganti, J. Gehrke, and R. Ramakrishnan, CACTUS: Clustering Categorical Data Using Summaries, Proc. Fifth ACM Conf. Knowledge Discovery and Data Mining (KDD 99), pp. 73-83, 1999. [10]M. Zaki and M. Peters, CLICK: Mining Subspace Clusters in categorical Data via k-Partite Maximal Cliques, Proc. 21st Intl Conf. Data Eng. (ICDE 05), 2005.

[11]Sen Wu and Guiying Wei, High Dimensional Data Clustering Algorithm Based on Sparse Feature Vector for Categorical Attributes. International Conference on Logistics Systems and Intelligent Management, Volume: 2, pages 973 976, IEEE Xplore, 9-10 Jan. 2010. [12]YinZhao Li, Di Wu, JiaDong Ren and ChangZhen Hu, An improved Outlier Detection Method in high-dimension Based on Weighted Hypergraph, Second International Symposium on Electronic Commerce and Security, IEEE 2009. [13]S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, Information Systems, vol. 25,no. 5, pp. 345-366, 2001. [14] Sunita M. Karad, M. U. Kharat, V. M. Wadhai, Prasad S. Halgaonkar, Dipti D. Patil, Effective Multi-Stage Clustering for Interand Intra-Cluster Homogeneity, IJCSIS (International Journal of Computer Science and Information Security) at USA, Vol. 8 No. 6, September 2010, ISSN: 1947-5500.

You might also like