International Conference on Computer Science and Information Technology 2008
Handling Noisy Data using Attribute Selection and Smart Tokens
J. Jebamalar Tamilselvi Dr. V. Saravanan
PhD Research Scholar Associate Professor & HOD Department of Computer Application Department of Computer Application Karunya University Karunya University Coimbatore – 641 114,Tamilnadu, INDIA Coimbatore – 641 114, Tamilnadu, INDIA E-mail: jebamalar@karunya.edu E-mail: saravanan@karunya.edu
Abstract initial step in data cleaning. There are many
approaches available for selecting the attributes for the Data cleaning is a process of identifying or mining process to reduce dimensionality of the data determining expected problem when integrating data warehouse. However, the recent increase of from different sources or from a single source. There dimensionality of data poses difficulty with respect to are so many problems can be occurred in the data efficiency and effectiveness in data cleaning process. warehouse while loading or integrating data. The The efficiency and effectiveness of attribute selection main problem in data warehouse is noisy data. This method is demonstrated through extensive noisy data error is due to the misuse of abbreviations, comparisons with this proposed method using real- data entry mistakes, duplicate records and spelling world data of high dimensionality [3], [5]. errors. The proposed algorithm will be efficient in The token based approach will be applied in the handling the noisy data by expanding abbreviation, selected attribute fields only. This attribute is selected removing unimportant characters and eliminating based on the certain criteria. This attribute selection is duplicates. The attribute selection algorithm is used mainly for the data cleaning process. The similarity for the attribute selection before the token formation. function with long sting will take more time for the An attribute selection algorithm and token formation comparison process as well as it requires multi-pass algorithm is used for data cleaning to reduce a approach. The proposed token formation algorithm is complexity of data cleaning process and to clean data used to form a token for the selected attribute fields. flexibly and effortlessly without any confusion. This The token based approach is proposed to reduce the research work uses smart token to increase the speed time for the comparison process and to increase the of the mining process and improve the quality of the speed of the data cleaning process [8], [9]. This data. research paper deals about developing an approach to handle noisy data using Attribute selection and Smart. 1. Introduction 2. Related Work The data cleaning process is used to improve the quality of the data before the mining process. There The large data file is sorted on designated fields to are so many error will be introduced while integrating bring potentially identical records together. However, the data warehouses or while loading a single data sorting is based on “dirty” fields, which may fail to warehouse by the misuse of data entry problem. One bring matching records together, and its time of the main errors in data warehouse is noisy data. The complexity is quadratic in the number of records. This noisy data is a random error or variance in a measured sorting technique is inefficient while dealing with variable. The noisy data errors are due to the misuse of large data file [BD, 83]. The merge/purge problem in a abbreviations, data entry mistakes, duplicate records large database is solved by forming keys from some and spelling errors [16]. selected fields, sorting the entire data set on the keys, An attribute selection is very important to reduce clustering the sorted records and using a scanning the time of the data cleaning process. An attribute window of a fixed size to reduce the number of selection algorithm is effective in reducing attribute, comparisons [6], [7]. removing irrelevant attribute, increasing speed of the The several steps are used to clean the data data cleaning process, and improving result in clarity. warehouse [10]. The first step is Scrub dirty data An intelligent attribute selection is proposed as an fields. This step attempts to remove typographical