Clustering Based Anonymization For Privacy Preservation PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

2015 International Conference on Pervasive Computing (ICPC)

Clustering Based Anonymization For Privacy


Preservation

Rashmi B. Ghate Rasika Ingle


PG Student, Dept. CT Asst. Prof. Dept. CT
Y.C.C.E Y.C.C.E.
Nagpur, India Nagpur, India
ghaterashmi0@gmail.com meet.rasika@gmail.com

Abstract— while registering on social networking site, it is suppression techniques are employed. In the generalization
necessary to give the personal information; some of this Quasi Identifier attributes value can be more generalized and
information is sensitive and needed to be preserved. To sustain in suppression Quasi Identifier attributes value replace the
the privacy of user on a social network Anonymization technique original value with specific value. By implementing
is employed. In Anonymization approach individuals personal generalization and suppression on to the dataset it prevents the
information is either mask or remove from the dataset so
danger of unwanted disclosure data maintain the
individual’s data become anonymous. When a dataset is released
it is important to prevent data from unwanted disclosure, balance confidentiality of the dataset and preserves the secrecy of data
the usefulness and privacy of published dataset. Proposed work set while publishing dataset used for research areas and
gives the Anonymized view of a data set and the result of business areas. Anonymization is used to increase the privacy
implementation of the single pass k-means Anonymization of social network users.
algorithm. To Anonymized the dataset generalization and
suppression approaches are used. This work proposes Anonymization techniques,
based on clustering. Clustering involves modifying K-Means
Clustering for clustering of the dataset. Generalization and
Keywords— Anonymization; Generalization; K-Means; Social
Network; Suppression
suppression are the techniques used to Anonymized the
dataset and avoid the direct access to the dataset, which also
protects the Quasi Identifier attributes from unwanted
I. INTRODUCTION disclosure and preserve the privacy. The structure of the paper
Social networks have tremendously gained popularity contains the following sections. Literature survey is given
over the internet. On social networks individuals are sharing regarding Anonymization given in section II. Overview of
data about their professional business and private lives. These Clustering based Anonymization is given for privacy
networks store the personal information about the individuals; preservation of dataset in Section III, we give the
thus social networks become a vast pool of data. This Implementation of single Pass K-Means Anonymization
information holds user’s public data as well as private data. algorithm in Section IV, and Conclude the paper in section V
On social network public information is publicly shared by with the Future Work.
individuals and by default seen by everyone and private
information contains personal data about them which II. LITERATURE SURVEY
individual wants to be protected. Now a day’s social network Now a day’s keeping the privacy while revealing
data is released, this data used by research people, business individuals information over social network is essential. A
people for professional and analysis purpose. Directly brief survey on Anonymization technique [1] Zhou, Pei, Luk
releasing such information on the internet may damage the gives reviews of the existing methods used for
privacy of social network user. On social network may harm Anonymization for sustaining the privacy of revealing data on
the user’s privacy and direct disclosure of individual’s social network, Recognize the challenges in the maintaining
information. Thus to preserve the privacy on social network secrecy while publishing of social network information and
many users prefer to hide their real identity this is caused by analyze the feasible issues in these important categories
using Anonymization Technique. Anonymization is the privacy, background knowledge, and data utility.
technique where encryption or removal personal information Anonymization approach is based on clustering-based and
QI (Quasi Identifier Attributes) from the dataset so that to graph modification based. [2] Elena zheleva & Lise Getoor
whom data set is described remains protected. When datasets presents a potential privacy breach in ONS (online social
is released it is important to prevent data from unwanted networks) with some current privacy definitions and
disclosure and balance the usefulness and privacy of released techniques for maintaining the confidentiality of the users. [3]
dataset. To minimize the risk of direct access of Quasi Prateek Joshi and c. -c. Jay Kuo gives mathematical
Identifier attributes from the dataset, generalization and formulation as well as computational models privacy and

978-1-4799-6272-3/15/$31.00(c)2015 IEEE
security of social network data, and present the metrics for [9]. Anonymized view of the dataset prevents individual’s data
computing the total amount of privacy as well as security in from unwanted disclosure.
social network. [4] Meng-Cheng Wei and Jun-Lin Lin and,
presents a model for k-anonymity, to reduce the loss of
information during generalization process to anonymized the
data, To combine the same type of data in single group is
necessary; individual anonymous set of data and clustering
based k-Anonymity technique executes in O (n2/k) time.
Author practically compares their techniques with other
clustering based k-Anonymity techniques. [5] A.
Machanavajjhala, j. Gehrke and d. Kifer gives two types of
and attacks on k-Anonymized dataset and gives a new
powerful privacy approach called as “l-diversity”. [6] Fig. 1. Block Diagram of Complete Anonymization Process
Mingxun Yuan Lie Chen presents k-degree l-diversity
representation for the preservation of structural data as well
IV. IMPLEMENTATION
as personal label of individuals and creates a new
Aanonymization method by including noise nodes to bring on A. Creation of Data Set
a novel algorithm. [7] N. Mogre, g. Agrawal and p. Patil gives
status-of-art methods for privacy of the high dimensional For creation of the Social network dataset, MS Excel
database and admit a new slicing technique for high database used, In data set ten fields (attributes) are there.
dimensional data sets. [8] Dror j. Cohen and Tamir Tassa Attributes of data set are UID, E-Mail, Age, Gender, Mobile
gives the issues of privacy protection on social networks and No., Zip Code, Country, Relationship status, Income,
the goal is to get the Anonymizied view of data without Profession. From these attributes Age, Gender, Mobile No,
revealing any information about the individuals and the and Relationship Status are the Quasi Identifier attributes,
relationships in between the other individuals on social which are anonymized.
networks. Author starts with the centralized setting and
provides two forms of Anonymization algorithm based on
sequential clustering.

Thus the propose work states that Anonymization of


social network dataset is an important concern to keep the
secrecy of the dataset while publishing. To Anonymized the
data set clustering technique is applied. Clustering is of the
data set is done by implementing K-Means Algorithm (Single
Pass K-Means Anonymization Algorithm). Generalization and
Suppression approaches are used to maintain confidentiality of
the dataset and get the anonymized view of the dataset.
Fig. 2. Data Set of Social Network

III. AN OVERVIEW OF CLUSTERING BASED


ANONYMIZATION B. Creating Clusters by Implementing Single Pass K-Means
Anonymization
In terms of data mining, clustering is a effective and
efficient technique that partition data from dataset into clusters To implement Single Pass K-Means Anonymization
such that data within a cluster are more similar to other, while algorithm [9], first all the records in data set D are sorted by
other data in different clusters are most distinct can be used for their QI (quasi identifiers). Then arbitrarily select k number of
k-Anonymization. The primary goal in publishing the records as the initial cluster centers of cluster to generate a K
Anonymized data set of social network is to deduce the number of clusters, each record belongs to data set D. Assigns
methods of predicting the private data from public data [9]. K- record r to the its closest cluster center. Apart from K-Means
means clustering is widely used for clustering of data on a clustering algorithm, whenever record added to the cluster,
social network. As a technique for clustering of data from the every time center of cluster gets updated. The effectiveness of
dataset, K-means algorithm is widely practiced because of its the next insertion of the records to the clusters gets improved.
ease and ability to converge very fast in practice [10]. To After forming clusters required changes are worked on cluster
preserve the privacy on social networks need to Anonymized to minimize the data losses. The count of records in clusters
the dataset. To achieve the Anonymization, have to create the with more than k number of records some records are removed
clusters of the dataset. K-Means clustering algorithm is used and afterword added these records to the clusters with less than
to achieve the K-Anonymity. In proposing work sequential k records. If all clusters contain a count of records no less than
clustering algorithm of K-Means are used for Anonymization k records and still there are some records have not been
assigned to any cluster means those records will be assigned to their respective closest cluster's center.
V. CONCLUSION AND FUTURE WORK
In this paper, to achieve clustering based
Anonymization for preserve the privacy on social network,
Anonymization technique is used. Here we succeffully
Implement the Single Pass K Means Anonymization
Algorithm, apply generalization and suppression process on
dataset to get the Anonymized view of data set. In future we
Measure the information loss during generalization process
of dataset; apply t-closeness measure. Study and
implementation of the different approach of Single pass K-
Fig. 3. Cluster of Data Set means Anonymization algorithm and conclude the
comparative study of different approaches of algorithm.
C. Generalization And Suppression
In Generalization technique replace quasi-identifiers and REFERENCES
set of non-identifying attributes such as age or zip codes
with less specific and less informative but semantically [1] Bin Zhou, Jian Pei and Wo-Shun Luk,” A Brief Survey on
consistent values or replace the attribute value with its more Anonymization Techniques for Privacy Preserving Publishing
of Social Network Data”, ACM Newsletter Journal , Vol. 10,
generalized value. For example the zip codes {47824, pp. 12-22, December 2008.
47865, 47843} can be generalized to 478** by stripping the [2] Elena Zheleva & Lise Getoor ,“Privacy in Social Networks: A
rightmost digits. Suppression does not release the value at Survey”, Journal Springer Science+Business Media, pp. 277-
all or replace the attribute value with specialized value. For 306, March 2011.
[3] Prateek Joshi and C. –C. Jay Kuo,” Security and Privacy in
example the gender {Male, Female} attribute is completely Online Social Networks: a Survey”, IEEE International
suppressed to *.It can also be done by removing individual Conference on Multimedia and Expo , pp. 11-15, July 2011.
attribute values or by partitioning the attribute domain into [4] Jun-Lin Lin, Meng-Cheng Wei,” An Efficient Method for K-
intervals [11]. anonymization “, Journal ACM 08 proceeding of International
Workshop o Privacy and Anonymity in Information Society, pp.
46-50, 2008
[5] Ashwin Machanavajjhala Johannes Gehrke Daniel Kifer,” l-
Diversity: Privacy Beyond k-Anonymity”, Journal ACM
Transaction on Knowledge Discovery from Data, Vol. 33, pp.
451-464, March 2007.
[6] Mingxuan Yuan, Lei Chen,” Protecting Sensitive Labels in
Social Network Data Anonymization”, IEEE transaction on
knowledge and data Engineering, Vol. 25, pp. 633-647, March
2013.
[7] Neha V. Mogre, Prof. Girish Agarwal, and Prof. Pragati Patil.”
Privacy Preserving for High-dimensional Data using
Anonymization Technique”, International Journal of Advanced
Research in Computer Science and Software Engineering, Vol.
3, pp. 185-189, June 2013.
[8] Tamir Tassa and Dror J.Cohen, “Anonymization of centralized
Fig. 4. Generalization of Data Set and distributed social networks by sequential clustering” IEEE
Transactions on Knowledge and data Engineering , Vol. 25, pp.
311-324, Feb 2013.
[9] J.Poulin, IIM.Mathina Kani “Preserving the Privacy on Social
Network by Clustering Based Anonymization” International
Journal of Advanced Research in Computer Science &
Technology, Vol. 2, pp. 11-14, march 2014.
[10] Zekeriya Erkin, Thijs Veugen, Tomas Toft, Reginald l.
lagendik,” Privacy-Preserving User Clustering in a Social
Network”, IEEE Workshop on Information Forensics and
Security WIFS, pp. 180-187, 2009.
[11] L.Sweeney, “Achieving k-anonymity privacy protection using
generalization and suppression” International Journal of
Uncertainty, pp. 571-588, 2002.

Fig. 5. Suppression of Data Set

You might also like