Professional Documents
Culture Documents
Nuicone 2015 7449597
Nuicone 2015 7449597
Abstract— Several real-world applications generate data In recent years, data stream classification has been an active
streams where the opportunity to examine each instance is area of research in data stream mining due to its several real-
concise. Effective classification of such data streams is an world applications. Some examples of such applications are: e-
emerging issue in data mining. However, such classification can mail spam detection, credit card fraud detection, malicious web
cause severe threats to privacy. There are several applications page detection, intrusion detection, detection of any abnormal
like credit card fraud detection, disease outbreak or biological disease outbreaks from continuous streams of data arriving at
attack detection, loan approval, etc. where the data is hospitals, etc. Predicting class label of the incoming data
homogeneously distributed among different parties. These streams, that is, data stream classification is an emerging issue
parties may wish to collaboratively build a classifier to obtain
and is discussed in the paper.
certain global patterns but will be reluctant to disclose their
Further, advancement in networking technologies has
private data. Privacy-preserving classification of such
homogeneously distributed data is a challenging issue too. In this triggered mining of distributed data. Different organizations
paper, we present a brief review of the work carried out in data (data holders) want to undertake a joint data mining task to
stream classification and privacy-preserving classification of obtain certain global patterns. Such collaboration is essential
homogeneously distributed data; followed by an empirical because of the mutual benefits it brings. However, free sharing
evaluation and performance comparison of some methods in both of data is restricted due to privacy and security concerns. For
these areas. We also propose and evaluate an approach of e.g., Health Insurance Portability and Accountability Act
creating an ensemble of anonymous decision trees to classify (HIPAA) laws [1] require that medical data should not be
homogeneously distributed data in a privacy-preserving manner. released without appropriate anonymization. Similar
We further identify the need to develop efficient methods for constraints arise in many applications. Hence, such
privacy-preserving classification of homogeneously distributed collaborating parties are reluctant to disclose their private data
data streams and propose a suitable approach for the same. sets to one another. The problem of preserving the privacy of
individuals while mining such distributed data is a challenge
Keywords- Data Streams Classification; Privacy-Preserving and is known as privacy-preserving distributed data mining.
Classification; Privacy-Preserving Classification of The need of privacy-preserving distributed data mining
Homogeneously Distributed Data Streams exists in several domains. For example, several pharmaceutical
companies want to collaboratively mine their data and perform
I. INTRODUCTION genetic experiments on it to discover meaningful patterns
With technological advancements, several applications among genes [4]. Multiple competing supermarkets wish to
generate a vast amount of data. Such data, also known as data conduct data mining on their joint dataset because the result of
streams, are characterized as continuously arriving at this collaboration will be valuable to them [5]. The field of
unprecedented rates [1]. Examples of data streams include homeland security with an aim to counter-terrorism depends on
internet traffic streams, stock trading streams, streams collaboration and information sharing across different mission
generated by e-commerce sites, data generated from scientific areas and nations [5]. Sharing healthcare and consumer data
projects, supermarket data, multimedia data, medical data also enables study and early detection of a disease outbreak.
streams, data streams generated through industry production Recent developments in the field of privacy-preserving data
processes, remote sensor and video surveillance streams, etc. mining focus on these two major subjects: preserving the
As these data streams contain valuable knowledge, there is an privacy of homogeneously distributed data and preserving the
enormous demand for analyzing and mining them. privacy of heterogeneously distributed data. Homogeneously
Since last few decades, the data mining techniques have distributed data or horizontally partitioned data are the terms
been successfully applied to several real-world problems [2]. used when different sites collect similar kind of data over
However, these traditional data mining methods are not different individuals. Heterogeneously distributed data or
feasible for applications generating stream data as they require vertically partitioned data refers to data collected by different
the data to be first stored and then perform multiple scans on sites on same individuals but with different feature sets.
that data in order to mine it. But, storing the data streams The issue of privacy-preserving data classification has
consistently and reliably is not possible as its daily volume may emerged to address several problems abound in various diverse
possibly range in the millions. Even if the storage is possible, domains. The goal is to build accurate classifiers without
gathering the infinite streams at one time, in one place, in a unveiling the privacy of the data being mined [6, 7]. We focus
format suitable for mining, and performing multiple scans on it on homogeneously distributed data [8, 9] as numerous
is not cost-effective [3]. Extracting patterns and models from applications fall under this data model. For example, several
such continuous data streams is called data stream mining. banks collect transactional information for credit card