Professional Documents
Culture Documents
An Empirical Study of Differentially-Private Analytics For High-Speed Network Data
An Empirical Study of Differentially-Private Analytics For High-Speed Network Data
An Empirical Study of Differentially-Private Analytics For High-Speed Network Data
ACM Reference Format: the network under the management of a single organization. This is
Oana-Georgiana Niculaescu and Gabriel Ghinita. 2018. An Empirical Study an important problem to solve for two reasons: (i) organizations are
of Differentially-Private Analytics for High-Speed Network Data. In Pro- interested how much traffic goes to/from their peers, which is useful
ceedings of Eighth ACM Conference on Data and Application Security and for equipment provisioning or billing, and (ii) there is typically no
Privacy (CODASPY’18). ACM, New York, NY, USA, 4 pages. https://doi.
established relationship of trust among peer organizations, to the
org/10.1145/3176258.3176944
extent that allows them to directly share internal information about
their users, hence the need for privacy. In this context, we present
1 INTRODUCTION an empirical study of a differentially-private analytics system for
High-speed research networks are essential to support scientific high-speed research networks that we developed. We consider two
projects and applications that have high bandwidth demands. These different aspects: the accuracy of answers returned by our system
networks differ from conventional ones as they provide much compared to non-sanitized data and the response time to return
higher line rates (up to 100Gbps), which are often required by the query results. Our proposed system uses Apache HDFS and
scientific research. To efficiently run and maintain such networks, HBase for data storage and indexing, and builds contingency (i.e.,
it is necessary to develop monitoring tools that provide usage sta- summary) tables to support fast and accurate private analytics.
tistics and information about network health status. Such data are
important to researchers who develop new transport protocols, or 2 BACKGROUND
to network engineers who must maintain the network infrastruc-
2.1 HDFS and HBase. High-speed research networks generate
ture within optimal working parameters. However, collecting and
large amounts of data that need to be collected and processed ef-
sharing such high-speed network data also poses serious privacy
ficiently. We use Apache Hadoop and HBase for processing and
risks. Through careful analysis of network data, an adversary may
storing network data at flow granularity. Hadoop and HBase run
be able to determine the identity of a user associated to a specific
atop the Hadoop Distributed File System (HDFS) environment [1].
network flow, and can subsequently infer potentially sensitive in-
HDFS is a Java-based file system that provides scalable and reliable
formation about that individual from her usage pattern, e.g., health
data storage across large clusters of commodity servers. The MapRe-
status, political affiliation, personal lifestyle, etc.
duce computation model [2] is a popular model for distributed big
To counter such threats, it is important to sanitize network data
data processing. The core idea behind MapReduce is mapping data
before making them available for analysis. The current de-facto
into a collection of <key, value> pairs, and then reducing over all
standard in privacy protection is the differential privacy (DP) model
pairs with the same key. Both operations can be done in parallel.
[3]. DP provides formal protection guarantees, and ensures that
The overall concept is simple, but it is very powerful when we
an adversary cannot learn with significant probability if a certain
consider that: (i) most datasets can be meaningfully mapped into
individual’s data is included or not in a dataset (in our case, an
<key, value> pairs, and (ii) the keys and values can be of any type
individual’s data consist of the network flows generated by a user).
(e.g., strings, integers, etc.). Both map and reduce phases use HDFS
DP achieves protection by adding noise to the data, so one must
files as input and output. However, HDFS is designed for sequential
be careful when deploying this model in practice, such that the
access, and does not work well for random access. Since flexible
distortion is minimized. In the case of high-speed network data,
network analytics may need to access multiple data regions, we
accurate and efficient sanitization is even more challenging, due to
use HBase for effective data indexing. HBase is a column-oriented,
the high volumes of generated data.
highly-distributed NoSQL solution that runs on top of Hadoop and
In our work, we aim to achieve fast and accurate sanitization
HDFS. HBase supports efficient random access to data.
of high-speed network data, by focusing on network analytics in
2.2 Autonomous System (AS) is a collection of networks man-
the form of statistical queries. For instance, one important use
aged and supervised by a single entity or organization. An AS
of such data is to determine the amount of total traffic flowing
comprises heterogeneous networks governed by a large enterprise,
between distinct autonomous systems (AS), which are segments of
and has different subnetworks with combined routing logic and
Permission to make digital or hard copies of part or all of this work for personal or common policies. Each AS is assigned a globally unique 16 digit
classroom use is granted without fee provided that copies are not made or distributed identification number (ASN).
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
2.3 Contingency Tables summarize data into a set of counts cor-
For all other uses, contact the owner/author(s). responding to all value combinations across certain attributes. A
CODASPY’18, March 19–21, 2018, Tempe, AZ, USA two-dimensional contingency table is based on two variables, one
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5632-9/18/03.
determining the row categories and the other defining the column
https://doi.org/10.1145/3176258.3176944 categories. The combinations of row and column categories are
149
Reception and Posters CODASPY'18, March 19–21, 2018, Tempe, AZ, USA
150
Reception and Posters CODASPY'18, March 19–21, 2018, Tempe, AZ, USA
the row key, and a single column family as illustrated in Table 3, 25000
Real Bytes 40000 Real Packets
with two columns: bytes and packets. This schema affords for the 20000
Noisy Bytes Noisy Packets
Packets (x 1e3)
creation of different versions of tables with varying time granulari- 30000
Bytes (x 1e6)
15000
ties and network addresses. These summaries have three types of 10000
20000
151