Siva Sankar

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013

A Cryptographic Privacy Preserving Approach over Classification


Sivasankar Vakkalagadda *, Satyanarayana Mummana#
Final M.Tech Student , Assistant Professor Department of CSE, Avanthi Institute of Engineering & Technology, Visakhapatnam. Andhra Pradesh
Abstract:We proposed an efficient privacy preserving technique during the classification of data. We introduce a Cryptographic based approach that protects centralized sample data sets utilized for decision tree mining of data. Preservation of privacy is applied to sanitize the samples prior to their release to third parties in order to mitigate the threat of their inadvertent disclosure or reveal. In contrast to other sanitization approaches, our approach does not affect the accuracy efficiency of results of data mining .The decision tree can be built directly from the pre-processed data sets, it means originals do not need to be formed. Moreover, this approach provides an efficient privacy preserving technique over data mining and can be applied at any time during the data collection process so that privacy protection can be in effect even while samples are still being collected.

I. INTRODUCTION Explosive progress in networking, storage, and processor technologies has led to the creation of ultra large databases that record unprecedented amount of transactional information. In tandem with this dramatic growth in digital data, concerns about informational privacy have emerged globally [1] [2] [3]. Privacy issues are further exacerbated now that the World Wide Web makes it easy for the new data to be automatically collected and added to databases [4] [5] [6] [17] The concerns over massive collection of data are naturally extending to analytic tools applied to relevant data. During the mining, with its promise to efficiently discover worthy, non-obvious information from large databases, is particularly vulnerable to misuse [9] [10][18].A fruitful direction for future research in data mining will be the development of techniques that incorporate Privacy reasons. Particularly, we address the following question. Because the primary task in data mining is the development of models about computed data, can we develop accurate models without access to Precise information in individual data records? The underlying assumption is that a person will be willing to selectively divulge information in exchange of value such models can provide Example of the value provided include Filtering to weed out unwanted information, better search results with less effort, and Automatic triggers [11]. A recent survey of web users [12] classified 17% of respondents as privacy fun damentalists who will not provide data to a web site even if privacy protection measures are in place. How-

ever, the concerns of 56% of respondents constituting the pragmatic majority were significantly reduced by the presence of privacy protection measures. The remaining 27% were marginally concerned and generally willing to provide data to web sites, although they often expressed a mild general concern about privacy. Another recent survey of web users [Wes99] found that 86% of respondents believe that participation in information for benefits programs is a matter of individual privacy choice. A resounding 82% said that having a privacy policy would matter; only 14% said that was not important as long as they got benefit. Furthermore, people are not equally protective of every field in their data records [14] [16]. Specifically, a person may not divulge at all the values of certain fields; may not mind giving true values of certain fields, may be willing to give not true values but modified values of certain fields. Given a population that satisfies the above assumptions, we address the concrete problem of building decision-tree classifiers [14] [15] and show that it is possible to develop accurate models while respecting users' privacy concerns. Classification is one the most used tasks in data mining. Decision-tree classifiers are relatively fast, efficient and yield comprehensible models, and obtain similar and sometimes better accuracy than other classification methods [13]. We introduce a new perturbation and randomization based approach that protects centralized sample data sets utilized for decision tree mining of data. II. RELATED WORK Previous work in privacy-preserving data mining has addressed two issues. In one, the aim is to preserve customer privacy by perturbing the data values [1]. In this scheme random noise data is introduced to distort sensitive values, and the distribution of the random data is used to generate a new data distribution which is close to the original data Distribution without revealing the original data values. The estimated original data distribution is used to reconstruct the data, and data mining techniques, such as classifiers and Association rules are applied to the reconstructed data set and after refinement of this approach have tightened estimation of original values based on the distorted data [2]. The data distortion approach has also been applied to Boolean values in research work. Perturbation methods and their privacy protection have been criticized because some methods may derive private information from the reconstruction step [9].

ISSN: 2231-5381

http://www.ijettjournal.org

Page 3191

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Different to the original noise additive method in [1], many distinctive perturbation methods have been proposed. One important category is multiplicative perturbation method. During the concern of geometric property data multiplying the original data values with a random noise matrix is to rotate the matrix representation of the original data, so it is also called rotated based perturbation. In [4], authors have given a sound proof of Rotation invariant Classifiers to show some data mining tools can be directly applied to the rotation based perturbed data. In the later work [11], Liu et al have proposed multiplicative random projection which provided more enhanced privacy protection. There are some other interesting techniques, such as condensation based approach [10] matrix decomposition and so on. As pointed out in [12], these recently research on perturbation based approaches apply the data mining techniques directly on the perturbed data skipping the reconstruction step. Choosing the suitable data mining techniques is determined by the method which noise has been introduced and our knowledge, very few works focus on mapping or modifying the data mining techniques to meet the perturbation data needs. The other approach uses cryptographic tools to build data mining models. For example, in [10], the goal is to securely Build an ID3 decision tree where the training set is distributed between two parties. Different solutions were given to address different data mining problems using cryptographic techniques (e.g., [6, 8, 18]). This approach treats privacypreserving data mining as a special case of secure multiparty computation and not only aims for preserving individual privacy but also tries to preserve leakage of any information other than the final result.

Encoder

Plain Dataset

Cipher Dataset Cipher Dataset Cipher Classification Rules

Data Owner

Analyst

Cipher Rules CC rules Original Rules


Dataset

Decoder

Classifier

Figure1.Privacy preserving Architecture The above architecture describes as follows Initialize Training Datasets for Machine Learning: Datasets are the collection of tuples with respect to different attributes and possible values for each attribute and with class labels, is given for the classification process for analyzing the testing set behaviour with machine learning approach. Synthetic dataset can be gathered for the classification of results. Initially data set can be forwarded to the encoder, encoder returns the cipher dataset. Unrealized Dataset Creation Usually data can be passed to the analysts for the machine learning purpose, but there is a privacy preserving issue regarding the confidential information. So in this paper we introduced AES algorithm for the privacy issue. After applying this mechanism dataset can be constructed as unrealized dataset. i.e cipher dataset can be passed to the analyst for the classification instead of plain sensitive or confidential information.

In this paper we are introducing an efficient privacy preserving cryptographic approach for the classification of the datasets without exposing the user sensitive information to the external world III. PROPOSED WORK In this paper we are proposing a cryptographic classification approach

Classification with ID3 : ID3 is one of the efficient Machine learning approaches for implementing the decision trees. Decision trees are used for classification purpose. Tree can be constructed based on the attribute based entropy or information gain values. We can efficiently analyze the classification rules by sending the testing data on to the training datasets.

ISSN: 2231-5381

http://www.ijettjournal.org

Page 3192

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013

Retrieval of Original classified results After generating the classification results, results can be passed to the Data owner, there administrator can perform attribute oriented decryption for the resulted set. Original data set can reconstruct by the decoder and classified rules can be obtained finally at the data owner end. Experimental Analysis ID3 builds a decision tree from a static sanitized examples and the resulting tree is used to classify future samples and example has several attributes and belongs to a class (like yes or no decision label) and the leaf nodes of the decision tree contain the class name whereas a non-leaf node is a conditional node and that decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. ID3 uses information gain to help it decide which attribute goes into a decision node. The advantage of learning a decision tree is that a program, better than a knowledge engineer, draw out knowledge from an expert.

Gain measures how well a given attribute separates training examples into targeted classes. The one with the highest information is selected and in order to define gain, we first borrow an idea from information theory called entropy and it measures the amount of information in an attribute This is the formula for calculating homogeneity of a sample.

It helps to measure the information gain with respect to the attributes

Gain( A) E (Current set ) E ( all child sets )


Our Experimental result purposes we are using a synthetic dataset, the following dataset at Data owner side before converting to unrealized dataset, after converting the dataset to unrealized dataset, data owner forwards to the analyst.

Figure2.

Original Data at Owner information gain and analyzes the testing data with training or unrealized dataset.

At analyst end ,he constructs the decision tree for Unrealized dataset which is encrypted ,based on

ISSN: 2231-5381

http://www.ijettjournal.org

Page 3193

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013

Figure3. Unrealized Dataset Decision tree constructed with the class labels based on information gain, in terms of entropy, the tree can be shown as follows.

Figure 4. Tree and Eligible Data

ISSN: 2231-5381

http://www.ijettjournal.org

Page 3194

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Final eligible Data After decryption at Data owner end can be shown as follows

Figure5.

Eligible

Data

after

Classification

and

Decryption

IV. CONCLUSION In this paper we proposed an efficient privacy preservation technique during classification of unreal datasets. It prevents the data owner from the un authorized access and privacy issues, Our proposed approach works efficiently with our violating the classification properties. Meanwhile, an accurate decision tree can be built directly from those unreal data sets. Finally the results yield accurate results even though classification applies on the cipher dataset. REFERENCES [1] R. Agrawal and R. Srikant, Privacy Preserving Data Mining, Proc. ACM SIGMOD Conf. Management of Data (SIGMOD 00), pp. 439-450, May 2000. [2] S.L. Wang and A. Jafari, Hiding Sensitive Predictive Association Rules, Proc. IEEE Intl Conf. Systems, Man and Cybernetics, pp. 164-169, 2005. [3] S. Ajmani, R. Morris, and B. Liskov, A Trusted ThirdParty Computation Service, Technical Report MIT-LCSTR-847, MIT, 2001. [4] Q. Ma and P. Deng, Secure Multi-Party Protocols for Privacy Preserving Data Mining, Proc. Third Intl Conf. Wireless Algorithms, Systems, and Applications (WASA 08), pp. 526-537, 2008. [5] N. Lomas, Data on 84,000 United Kingdom Prisoners is Lost, Retrieved Sept. 12, 2008, http://news.cnet.com/83011009_3-10024550-83.html, Aug. 2008.

[6] J. Gitanjali, J. Indumathi, N.C. Iyengar, and N. Sriman, A Pristine Clean Cabalistic Foruity Strategize Based Approach for Incremental Data Stream Privacy Preserving Data Mining, Proc. IEEE Second Intl Advance Computing Conf. (IACC), pp. 410-415, 2010. [7] S. Bu, L. Lakshmanan, R. Ng, and G. Ramesh, Preservation of Patterns and Input-Output Privacy, Proc. IEEE 23rd Intl Conf. Data Eng., pp. 696-705, Apr. 2007. [8] S. Russell and N. Peter, Artificial Intelligence. A Modern Approach 2/ E. Prentice-Hall, 2002.

[9] D. Goodin, Hackers Infiltrate TD Ameritrade client Database, Retrieved Sept.2008,http://www.channelregister.co.uk/2007/09/15/a meritrade_database_burgled/, Sept. 2007. [10] L. Liu, M. Kantarcioglu, and B. Thuraisingham, Privacy Preserving Decision Tree Mining from Perturbed Data, Proc. 42nd Hawaii Intl Conf. System Sciences (HICSS 09), 2009. [11] Y. Zhu, L. Huang, W. Yang, D. Li, Y. Luo, and F. Dong, Three New Approaches to Privacy-Preserving Add to Multiply Protocol and Its Application, Proc. Second Intl Workshop Knowledge Discovery and Data Mining, (WKDD 09), pp. 554-558, 2009. [12] J. Vaidya and C. Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data,

ISSN: 2231-5381

http://www.ijettjournal.org

Page 3195

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 2013
Proc Eighth ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD 02), pp. 23- 26, July 2002. [13] J. Dowd, S. Xu, and W. Zhang, Privacy-Preserving Decision Tree Mining Based on Random Substitions, Proc. Intl Conf. Emerging Trends in Information and Comm. Security (ETRICS 06), pp. 145-159, 2006. (HICSS), pp. 1-9, 2010. [14] C. Aggarwal and P. Yu, Privacy-Preserving Data Mining:, Models and Algorithms. Springer, 2008. [15] L. Sweeney, k-Anonymity: A Model for Protecting Privacy, Intl J. Uncertainty, Fuzziness and Knowledgebased Systems, vol. 10, pp. 557-570, May 2002. [16] M. Shaneck and Y. Kim, Efficient Cryptographic Primitives for Private Data Mining, Proc. 43rd Hawaii Intl Conf. System Sciences [7] BBC News Brown Apologises for Records Loss. Retrieved Sept. 12, 2008, http://news.bbc.co.uk/2/hi/uk_news/politics/ 7104945.stm, Nov. 2007. [8] D. Kaplan, Hackers Steal 22,000 Social Security Numbers from Univ. of Missouri Database, Retrieved Sept. 2008, http://www.scmagazineus. com/Hackers-steal22000-Social-Security-numbers-from- Univ.-of-Missouridatabase/article/34964/, May 2007. [19] P.K. Fong, Privacy Preservation for Training Data Sets in Database: Application to Decision Tree Learning, masters thesis, Dept. of Computer Science, Univ. of Victoria, 2008. [20] R. Buyya, C. S. Yeo, and S. Venugopal, Marketoriented cloud computing: Vision, hype, and reality for delivering it services as computing utilities, in Proc. IEEE Conf. High Performance Comput. Commun., Sep. 2008, pp. 513. [21] W. K. Wong, D. W. Cheung, E. Hung, B. Kao, and N. Mamoulis, Security in outsourcing of association rule mining, in Proc. Int. Conf. Very Large Data Bases, 2007, pp. 111122. [22] F. Giannotti, L. V. Lakshmanan, A. Monreale, D. Pedreschi, and H. Wang, Privacy-preserving data mining from outsourced databases, in Proc. SPCC2010 Conjunction with CPDP, 2010, pp. 411426. [23] S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in association rule mining, in Proc. Int. Conf. Very Large Data Bases, 2002, pp. 682 693 BIOGRAPHIES Satyanarayana Mummana is working as an Asst. Professor in Avanthi Institute of Engineering & Technology, Visakhapatnam, Andhra Pradesh. He has received his Masters degree (MCA) from Gandhi Institute of Technology and Management (GITAM), Visakhapatnam and M.Tech (CSE) from Avanthi Institute of Engineering & Technology, Visakhapatnam. Andhra Pradesh. His research areas include Image Processing, Computer Networks, Data Mining, Distributed Systems, Cloud Computing. Sivasankar Vakkalagadda Completed his B.Tech and pursuing M.Tech in from Avanthi Institute of Engineering & Technology, Visakhapatnam. Andhra Pradesh Interesting areas are Java and data mining and web technologies and Oracle database.

ISSN: 2231-5381

http://www.ijettjournal.org

Page 3196

You might also like