Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Mukt Shabd Journal ISSN NO : 2347-3150

Protecting User Privacy from External Applications


Using Semi-Supervised Based PDS Systems
Thurimerla Prasanth*1, J.Sai#2, K.Govardhan#3, J.Pradeep#4, Ch.Naveen#5
Asst. Professor*1, UG Scholar #2,#3,#4,#5
1, 2, 3,4,5
Department of Computer Science and Engineering, Geethanjali Institute of Science & Technology, Nellore, Andhra
Pradesh-524137

Abstract— Nowadays, Personal Data Storage (PDS) has brought a substantial change to the way business individuals or enterprises can
store and control their personal data. PDS offers individuals the potential to keep their data in a unique logical repository which will
be connected and exploited by proper analytical tools, or shared with third parties under the control of end users. In previous studies
PDS has focused on how to enforce user privacy preferences and how to secure data when stored into the PDS. Our aim is to design a
Privacy-aware Personal Data Storage (P-PDS), that is, a PDS which is able to automatically take privacy-aware decisions on third
parties access requests in accordance with user preferences. In this paper, we have deeply revised the learning process to have a more
usable P-PDS, in terms of reduced effort for the training phase, as well as a more conservative approach w.r.t. user’s privacy, when
handling conflicting access requests.

Keywords— Personal Data Storage, Cloud Computing, Active Learning, Ensemble Learning, Data Protection
I. INTRODUCTION
Nowadays personal data we are digitally producing are scattered in different online systems
managed by different providers (e.g., online social media, hospitals, banks, airlines, etc). In this way, on the
other hand users are losing control on their data, whose protection is under the responsibility of the data
provider, and, on the other, they cannot fully exploit their data, since each provider keeps a separate view of
them. To overcome this scenario, Personal Data Storage (PDS) [2]–[4] has inaugurated a substantial change
to the way people can store and control their personal data, by moving from a service-centric to a user-
centric model.
PDSs enable individuals to collect into a single logical vault personal information they are
producing. Such data can then be connected and exploited by proper analytical tools, as well as shared with
third parties under the control of end users. This view is also enabled by recent developments in privacy
legislation and, in particular, by the new EU General Data Protection Regulation (GDPR), whose art. 20
states the right to data portability, according to which the data subject shall have the right to receive the
personal data concerning him or her, which he or she has provided to a controller, in a structured, commonly
used and machine-readable format, thus making possible data collection into a PDS.
Up to now, most of the research on PDS has focused on how to enforce user privacy preferences
and the way to secure data when stored into the PDS. In contrast, the key issue of helping users to specify
their privacy preferences on PDS data has not been so far deeply investigated. This is a fundamental issue
since average PDS users are not skilled enough to understand how to translate their privacy requirements
into a set of privacy preferences. As several studies have shown, average users might have difficulties in
properly setting potentially complex privacy preferences [5]–[7]. For example, let us consider Facebooks
privacy setting, where users need to configure the options manually according to their desire. In [8], [9],

Volume IX, Issue VII, JULY/2020 Page No : 2956


Mukt Shabd Journal ISSN NO : 2347-3150

authors survey users awareness, attitudes and privacy concerns on profile information and find that only a
few users change the default privacy preferences on Facebook. Interestingly, authors find that even when
users have changed their default privacy settings, the modified settings do not match the expectations (these
are reached only for 39% of users). Moreover, another survey has shown that Facebook users are not aware
enough on protection tools that designed to protect their personal data. According to their study the majority
(about 88%) of users had never read the Facebook privacy policy.
To help users on protecting their PDS data, in [1], we have evaluated the use of different semi-
supervised machine learning approaches for learning privacy preferences of PDS owners. The idea is to find
a learning algorithm that, after a training period by the PDS owner, returns a classifier able to automatically
decide if access requests submitted by third parties are to be authorized or denied.
To further reduce the required user effort, in the current paper, we leverage on active learning (AL)
to minimize user burden for getting the training dataset by, at the same time, achieving better accuracy in
determining user privacy preferences. The main idea of active learning is to select from the training dataset
the most representative instances to be labeled by users. Literature offers several methods driving the
selection of these new instances. The most commonly adopted method is uncertainty sampling. According
to this approach, to be labeled by human annotators, active learning selects those instances for which it is
highly uncertain how to label them according to the preliminary built model. As reported, this improvement
brings benefits in term of accuracy and usability. Additionally, to further improve the performance of the
system, we define an alternative uncertainty sampling strategy, which is based on the observation that, for
taking a privacy-related decision, some fields of access requests (i.e., data consumer and type of service
requesting the data) are more informative than others. Thus, if a new access request presents new values for
these fields, the system pushes for a new training (i.e., asking data owner a label for the access request). To
enforce this behavior, we introduce a penalization of the uncertainty measure based on the distance of the
new access request w.r.t. the access requests previously labelled by the P-PDS owner (we call this strategy
history-based active learning). As it will show in the experiments, history-based active learning shows better
results than AL in terms of users satisfaction. As a further improvement, in this paper, we propose a revised
version of the ensemble learning algorithm proposed in [1], to enforce a more conservative approach w.r.t.
users privacy. In particular, we reconsider how ensemble learning handles decisions for access requests for
which classifiers return conflicting classes. In general, the final decision is taken selecting the class with the
highest aggregated probabilities. However, this presents the limit of not considering user perspective, in that,
it does not take into account which classifier is more relevant for the considered user. To cope with this
issue, we propose an alternative strategy for aggregating the class labels returned by the classifiers.
According to this approach, we assign a personalized weight to each single classifier used in ensemble
learning. We also show how it is possible to learn these weights from the training dataset, thus without the

Volume IX, Issue VII, JULY/2020 Page No : 2957


Mukt Shabd Journal ISSN NO : 2347-3150

need of further input from the P-PDS owner. Experiments show that this approach increases users
satisfaction as well as the learning effectiveness.

II. RELATED WORK

In [1], to learn privacy habits of PDS users, we have defined a learning model tailored for the
PDS scenario, that is, defined such as to consider different driving dimensions, derived from a typical data
access request. More precisely, in [1], an access request AR is modeled as a tuple (DC, st, d0, p, o), where
DC is the third party requesting data to the PDS, st is the type of service provided by DC, d0 is the requested
data, p is the access purpose, whereas o is an offer that is modelled as a value in the range 0 to 100%.

The learning model in [1] leveraged on semi-supervised algorithms, showing that these
algorithms provide a better accuracy than supervised learning (i.e., SVM) even with a small training dataset.
The distinct feature between supervised and semi-supervised learning is that supervised learning uses only
the labeled data, whereas semi-supervised learning exploits both labeled and unlabeled data to build
prediction models. Prediction models can then be used for mapping the class labels of new access requests.
Moreover, in selecting the semi-supervised learning algorithms to be used, in [1] we have kept in account
that individuals give different relevance to each field of an access request. As an example, a user might
prefer not to release high-sensitive information, thus in this case the type of requested data impacts the user’s
decision more than other fields in the access request. Additionally, decisions on releasing a piece of data
might be impacted by a combination of different access request fields. As a further example, users with
conservative decisions for high-sensitive information might be more prone to data release if data consumers
have a high reputation or the returned benefits are relevant (e.g., benefits in terms of service type and/or
offer). To cope with all the above dimensions, in [1], the following approaches have been evaluated: single-
view, where a classifier is built on the whole set of access request fields all together; multiview, where two
disjoint views of access request fields (i.e., a view on fields about the requested data and one containing
fields describing how data are used) are used to build two separate classifiers, which are then combined to
take the final decision; ensemble, that considers each combination of fields in the access request separately
and builds a classifier for each of them. In [1], it has been shown that this latter approach outperforms the
single-view and multi-view approaches. As such, in the following we briefly overview this latter approach.

Ensemble learning is a learning technique where multiple classifiers are trained and combined
to solve a particular computational problem. Generally, ensemble learning is used to improve the prediction
or minimize the error, by combining the individual decisions of each classifier. In applying ensemble
learning, the work in [1] has built a single classifier for each distinct pair of fields of an access request to
model each possible relationship among AR fields. In particular, each classifier is built using the
Expectation-Maximization (EM) algorithm. Then, the 10 obtained classifiers, denoted as ensemble, are used
to compute 10 different probability of having a new access request AR being labeled as authorized (i.e.,

Volume IX, Issue VII, JULY/2020 Page No : 2958


Mukt Shabd Journal ISSN NO : 2347-3150

class yes), denied (i.e., class no) or to be evaluated by user (i.e., class maybe). The final decision is taken as
the class with the greatest aggregate probability value.

III. PROPOSED METHODOLOGY

The proposal discussed in [1] demonstrates that semi-supervised ensemble learning can be
exploited to train a classifier to make a PDS able to automatically decide whether an access request has to
be authorized or not. However, to build a classifier using a predictive learning model, it is essential to label
an initial set of instances, called the training dataset. It is matter of fact that obtaining a sufficient number of
labeled instances is time-consuming and costly due to the required human input. On the other hand, the size
and quality of the training dataset impact the accuracy the classifier might reach. Therefore, Active learning
(AL) may be exploited to reduce the size of the training dataset. The key idea of AL is to build the training
dataset by properly selecting a reduced number of instances from unlabeled items, rather than randomly
choosing them as done by traditional supervised learning algorithms. This makes it possible to efficiently
exploit unlabeled instances for developing effective prediction models as well as to reduce the time and cost
of labeling.

More precisely, the main idea of AL is to first select very few instances for being labeled by
humans and build on them a preliminary prediction model. After that, AL exploits this preliminary model
to select new instances from the training dataset to be labeled to reinforce the model. Literature offers several
methods driving the selection of these new instances. The most commonly adopted method is uncertainty
sampling, where those instances for which it is highly uncertain how to label them according to the
preliminary built model are selected to be labeled by human annotators.

Although AL greatly reduces human participation on labeling training dataset and leads to good
performance, researchers have further investigated how to combine active learning with semi-supervised
approaches.

As depicted in Figure 1, the proposed P-PDS selects a first small set of incoming access requests
(see interaction an in Figure 1) in order to create an initial training dataset, to be labeled by the P-PDS owner,
which is then used to build the preliminary learning model. Then, using this preliminary model, P-PDS
measures the uncertainty of the newly arriving access requests ̅̅̅̅
𝐴𝑅 (see b in Figure 1) and asks PPDS owner
̅̅̅̅ only if its uncertainty level is high (c). Otherwise, 𝐴𝑅
to directly label 𝐴𝑅 ̅̅̅̅ is immediately labeled by the
semi-supervised ensemble classifier using the preliminary model. Even if this improvement brings benefits
in term of accuracy and usability, it can be further extended to be more protective w.r.t. P-PDS owner’s
privacy.

Volume IX, Issue VII, JULY/2020 Page No : 2959


Mukt Shabd Journal ISSN NO : 2347-3150

Fig. 1 System Overview

A. Implementation Modules
1)Data Owner:
 In this module, Data owner can register and login to the system, and he uploads the patient health details in to untrusted
cloud server.
 Here, he can view the uploaded file details and set the access control permissions to access the file by the users.
2)User:
 In this module, data user/end user is a doctor/physician can register and login to the system, and he searches the patient
details based on their attribute set.
 Here, he/ she can view the file details and request the secrete key from cloud server.
 After getting secrete key from cloud server he/she can decrypt and download files.
3)Authority:
 In this module, data user/end user is a doctor/physician can register and login to the system, and he searches the patient
details based on their attribute set.
 Here, he/ she can view the file details and request the secrete key from cloud server.
 After getting secrete key from cloud server he/she can decrypt and download files.
4)Cloud Server:
 In this module, cloud server login to the system and view all user’s details, authorize them.
 Cloud provider can view the files uploaded to the system, and he check the file decrypt and download requests and
accept/reject the request.
 He can also generate the various reports.

Volume IX, Issue VII, JULY/2020 Page No : 2960


Mukt Shabd Journal ISSN NO : 2347-3150

B. Proposed Algorithm

Algorithm 1 provides a summary of the ensemble approach presented in [1]. It takes as input the training dataset LAR and the set
of unlabeled access requests UAR, and returns the updated classifiersΘensemble . More precisely, Algorithm builds the set of
classifiers Θensemble considering, for each classifier, only the projection of LAR on the corresponding pairs of access request
fields. Then, it computes the probability of each class labels (e.g., yes, no) produced by the classifiersΘensemble for newly arrived
access request ̅̅̅̅
𝐴𝑅 and returns the class label as final decision which has the greatest aggregate probability value.

IV. RESULTS ANALYSIS

Fig. 2 Cloud Server Login

Volume IX, Issue VII, JULY/2020 Page No : 2961


Mukt Shabd Journal ISSN NO : 2347-3150

Fig. 3 View All Patient Details

Fig. 4 View No. of Same Disease

Volume IX, Issue VII, JULY/2020 Page No : 2962


Mukt Shabd Journal ISSN NO : 2347-3150

Fig. 5 PDS Owner Login

Fig. 6 Grant Access Permissions

Volume IX, Issue VII, JULY/2020 Page No : 2963


Mukt Shabd Journal ISSN NO : 2347-3150

Fig. 7 Professional Login Page

Fig. 8 Search Patient Details

V. CONCLUSION AND FUTURE ENHANCEMENT


This paper implemented a Privacy-aware Personal Data Storage, which automatically takes
privacy-aware decisions on third parties access requests with user preferences. The system relies on active
learning complemented with strategies to strengthen user privacy protection. The obtained results show the
effectiveness of the proposed approach. We decide to extend this work along several directions. First, we
are interested to investigate how P-PDS could scale in the IoT scenario, where access requests decision

Volume IX, Issue VII, JULY/2020 Page No : 2964


Mukt Shabd Journal ISSN NO : 2347-3150

might depend also on contexts, not only on user preferences. Also, we would like to integrate P-PDS with
cloud computing services (e.g., storage and computing) to design a more powerful P-PDS by, at the same
time, protecting user’s privacy.

In the future work, we will design new approaches to make sure data integrity. To further tap
the potential of PDS, we will also study how to improve inconsistent evaluators are satisfied by the taken
decisions in the worst cases.

REFERENCES
[1] [1] B. C. Singh, B. Carminati, and E. Ferrari, “Learning privacy habits of pds owners,” in Distributed Computing Systems (ICDCS), 2017 IEEE 37th
International Conference on. IEEE, 2017, pp. 151–161.
[2] [2] Y.-A. De Montjoye, E. Shmueli, S. S. Wang, and A. S. Pentland, “openpds: Protecting the privacy of metadata through safeanswers,” PloS one, vol. 9,
no. 7, p. e98790, 2014.
[3] [3] B. M. Sweatt et al., “A privacy-preserving personal sensor data ecosystem,” Ph.D. dissertation, Massachusetts Institute of Technology, 2014.
[4] [4] B. C. Singh, B. Carminati, and E. Ferrari, “A risk-benefit driven architecture for personal data release,” in Information Reuse and Integration (IRI), 2016
IEEE 17th International Conference on. IEEE, 2016, pp. 40–49.
[5] [5] M. Madejski, M. Johnson, and S. M. Bellovin, “A study of privacy settings errors in an online social network,” in Pervasive Computing and
Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on. IEEE, 2012, pp. 340–345.
[6] [6] L. N. Zlatolas, T. Welzer, M. Heriˇcko, and M. H¨ olbl, “Privacy antecedents for sns self-disclosure: The case of facebook,” Computers in Human
Behavior, vol. 45, pp. 158–167, 2015.
[7] [7] D. A. Albertini, B. Carminati, and E. Ferrari, “Privacy settings recommender for online social network,” in Collaboration and Internet Computing (CIC),
2016 IEEE 2nd International Conference on. IEEE, 2016, pp. 514–521.
[8] [8] A. Acquisti and R. Gross, “Imagined communities: Awareness, information sharing, and privacy on the facebook,” in International workshop on privacy
enhancing technologies. Springer, 2006, pp. 36– 58.
[9] [9] R. Gross and A. Acquisti, “Information revelation and privacy in online social networks,” in Proceedings of the 2005 ACM workshop on Privacy in the
electronic society. ACM, 2005, pp. 71–80.

Volume IX, Issue VII, JULY/2020 Page No : 2965

You might also like