An Empirical Study of Differentially-Private Analytics For High-Speed Network Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Reception and Posters CODASPY'18, March 19–21, 2018, Tempe, AZ, USA

An Empirical Study of Differentially-Private Analytics for


High-Speed Network Data
Oana-Georgiana Niculaescu Gabriel Ghinita
UMass Boston UMass Boston
onic@cs.umb.edu gabriel.ghinita@umb.edu

ACM Reference Format: the network under the management of a single organization. This is
Oana-Georgiana Niculaescu and Gabriel Ghinita. 2018. An Empirical Study an important problem to solve for two reasons: (i) organizations are
of Differentially-Private Analytics for High-Speed Network Data. In Pro- interested how much traffic goes to/from their peers, which is useful
ceedings of Eighth ACM Conference on Data and Application Security and for equipment provisioning or billing, and (ii) there is typically no
Privacy (CODASPY’18). ACM, New York, NY, USA, 4 pages. https://doi.
established relationship of trust among peer organizations, to the
org/10.1145/3176258.3176944
extent that allows them to directly share internal information about
their users, hence the need for privacy. In this context, we present
1 INTRODUCTION an empirical study of a differentially-private analytics system for
High-speed research networks are essential to support scientific high-speed research networks that we developed. We consider two
projects and applications that have high bandwidth demands. These different aspects: the accuracy of answers returned by our system
networks differ from conventional ones as they provide much compared to non-sanitized data and the response time to return
higher line rates (up to 100Gbps), which are often required by the query results. Our proposed system uses Apache HDFS and
scientific research. To efficiently run and maintain such networks, HBase for data storage and indexing, and builds contingency (i.e.,
it is necessary to develop monitoring tools that provide usage sta- summary) tables to support fast and accurate private analytics.
tistics and information about network health status. Such data are
important to researchers who develop new transport protocols, or 2 BACKGROUND
to network engineers who must maintain the network infrastruc-
2.1 HDFS and HBase. High-speed research networks generate
ture within optimal working parameters. However, collecting and
large amounts of data that need to be collected and processed ef-
sharing such high-speed network data also poses serious privacy
ficiently. We use Apache Hadoop and HBase for processing and
risks. Through careful analysis of network data, an adversary may
storing network data at flow granularity. Hadoop and HBase run
be able to determine the identity of a user associated to a specific
atop the Hadoop Distributed File System (HDFS) environment [1].
network flow, and can subsequently infer potentially sensitive in-
HDFS is a Java-based file system that provides scalable and reliable
formation about that individual from her usage pattern, e.g., health
data storage across large clusters of commodity servers. The MapRe-
status, political affiliation, personal lifestyle, etc.
duce computation model [2] is a popular model for distributed big
To counter such threats, it is important to sanitize network data
data processing. The core idea behind MapReduce is mapping data
before making them available for analysis. The current de-facto
into a collection of <key, value> pairs, and then reducing over all
standard in privacy protection is the differential privacy (DP) model
pairs with the same key. Both operations can be done in parallel.
[3]. DP provides formal protection guarantees, and ensures that
The overall concept is simple, but it is very powerful when we
an adversary cannot learn with significant probability if a certain
consider that: (i) most datasets can be meaningfully mapped into
individual’s data is included or not in a dataset (in our case, an
<key, value> pairs, and (ii) the keys and values can be of any type
individual’s data consist of the network flows generated by a user).
(e.g., strings, integers, etc.). Both map and reduce phases use HDFS
DP achieves protection by adding noise to the data, so one must
files as input and output. However, HDFS is designed for sequential
be careful when deploying this model in practice, such that the
access, and does not work well for random access. Since flexible
distortion is minimized. In the case of high-speed network data,
network analytics may need to access multiple data regions, we
accurate and efficient sanitization is even more challenging, due to
use HBase for effective data indexing. HBase is a column-oriented,
the high volumes of generated data.
highly-distributed NoSQL solution that runs on top of Hadoop and
In our work, we aim to achieve fast and accurate sanitization
HDFS. HBase supports efficient random access to data.
of high-speed network data, by focusing on network analytics in
2.2 Autonomous System (AS) is a collection of networks man-
the form of statistical queries. For instance, one important use
aged and supervised by a single entity or organization. An AS
of such data is to determine the amount of total traffic flowing
comprises heterogeneous networks governed by a large enterprise,
between distinct autonomous systems (AS), which are segments of
and has different subnetworks with combined routing logic and
Permission to make digital or hard copies of part or all of this work for personal or common policies. Each AS is assigned a globally unique 16 digit
classroom use is granted without fee provided that copies are not made or distributed identification number (ASN).
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
2.3 Contingency Tables summarize data into a set of counts cor-
For all other uses, contact the owner/author(s). responding to all value combinations across certain attributes. A
CODASPY’18, March 19–21, 2018, Tempe, AZ, USA two-dimensional contingency table is based on two variables, one
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5632-9/18/03.
determining the row categories and the other defining the column
https://doi.org/10.1145/3176258.3176944 categories. The combinations of row and column categories are

149
Reception and Posters CODASPY'18, March 19–21, 2018, Tempe, AZ, USA

Row Key CF1 CF2 CF3


ts|1|srcAS|md5 destAS, port destAS, protocol BC,PC
ts|2|destAS|md5 srcAS, port srcAS, protocol BC,PC
ts|3|port|md5 srcAS destAS, protocol BC,PC
Table 1: HBase schema for AS raw flow tables
terminated. The flows contain information about AS numbers that
we save using the HBase schema presented in Table 1. In our schema,
ts is the flow time stamp, and BC and PC are bytes and packet
counts for that flow, respectively. The other attributes are source
and destination AS numbers, and service ports.

TimeStamp Src AS Protocol Port Dst AS


4 bytes (int) 4 bytes (int) 1 byte 2 bytes (short) 4 bytes (int)
Table 2: HBase schema for AS contingency table
Figure 1: System Architecture After storing network flows into HBase raw tables, a query en-
called cells. In order to use the statistical methods usually applied gine computes differentially-private results. However, using the
to such tables, subjects must fall into one and only one row and raw flow table directly requires intensive disk scans, slowing per-
column category. Such categories are said to be exclusive and ex- formance. Since each flow is stored as an individual row in the
haustive. Exclusive means the categories don’t overlap, so a subject table, the size of the table is too large, and the required process-
falls into only one category. Exhaustive means that the categories ing may take a long time to complete. Furthermore, if multiple
include all possibilities, so every subject falls within a category. queries are executed against the dataset such that their result sets
2.4 Differential privacy (DP) [3] guarantees that for any two contain a common record, it would be difficult to keep track of
sibling datasets D1 , D2 that differ in a single net flow π , the proba- the sensitivity associated with that query set. Therefore, creating
bility of an adversary learning which of the two datasets was used contingency tables that summarize the dataset on different attribute
P r [A(D )]
to obtain a certain output A is bounded by ln P r [A(D1 )] ≤ ϵ, value combinations allows us to keep track of sensitivity and al-
2
where parameter ϵ > 0 represents the privacy budget. To achieve locate appropriately the privacy budget. These summaries may
privacy for numerical queries, the Laplace mechanism [3] adds to include aggregating the flows based on source/destination AS num-
each query result noise randomly distributed according to a Laplace ber, service, or time resolution.
distribution with parameter λ = S/ϵ where S is the sensitivity of the To illustrate how these network flows can be aggregated, con-
query, i.e., the maximum change in the result of the query for any sider the following example queries for a given time range:
two sibling databases. Sequential composability guarantees that exe- (1) What is the total amount of traffic between all pairs of ASs?
cuting algorithm A1 with privacy budget ϵ1 followed by algorithm • This query is best answered by a table with schema or-
A2 with budget ϵ2 produces a differentially private algorithm with dered by timestamp at the front of the row key.
parameter ϵ1 + ϵ2 . This allows composing algorithms which use (2) How many packets originate from AS number 10437?
the results of simpler queries to produce a more accurate result for • A table whose row keys are ordered by source AS number
a highly sensitive query/algorithm. would best serve this query.
3 SYSTEM OVERVIEW (3) What is the total amount of DNS traffic served by AS 10437?
• A table whose row keys are ordered by protocol and port
The proposed system architecture is presented in Figure 1 (the high-
(protocol UDP, port 53), followed by the destination net-
level system architecture was introduced in [4], but this submission
work for the DNS server, would answer this query most
brings design and implementation details, as well as evaluation
efficiently.
results that represent new contributions compared to [4]). The in-
put flows are collected from network devices such as switches or Given the above examples, it is clear that no single schema can
routers compliant with the NetFlow v9 standard. The flow attributes serve well all queries. Each of these queries is best answered by
are saved in a HBase column store that provides efficient random scanning tables with different schemas. Query 1 does not specify
access data. Contingency tables are created and maintained using any non-time predicates, and could be answered by tables whose
MapReduce jobs on top of the raw flow data. At query time, instead row keys begin with timestamp, whereas queries 2 and 3 specify
of querying directly the raw data, we answer queries using contin- non-time predicates, and could be answered by tables whose row
gency tables, which are more compact due to their summarizing keys begin with such information, followed by timestamp.
role. The challenge is to find an appropriate set of contingency Next, we focus on efficiently supporting type 1 queries. To that
tables to materialize, such that the disk I/O is reduces at query time, extent, we create a HBase table with the schema from Table 2 for
while at the same time the sensitivity, and implicitly the amount of
noise required by differential privacy, are kept low. Column Family size
Flow data are collected from a series of devices located at several Qualifier bytes packets
Internet2 academic participants. Each flow is being stored in the Type 8 bytes (long) 8 bytes (long)
raw flow table repository after the generation of the flow has been Table 3: HBase schema for AS raw flow tables

150
Reception and Posters CODASPY'18, March 19–21, 2018, Tempe, AZ, USA

the row key, and a single column family as illustrated in Table 3, 25000
Real Bytes 40000 Real Packets
with two columns: bytes and packets. This schema affords for the 20000
Noisy Bytes Noisy Packets

Packets (x 1e3)
creation of different versions of tables with varying time granulari- 30000

Bytes (x 1e6)
15000
ties and network addresses. These summaries have three types of 10000
20000

fixed alignment for timestamps: minute (granule=60), hour (gran- 10000


5000
ule=3600), and day (granule=86400). Each timestamp is represented
0 0
according to Unix time value. For example, aligned timestamps 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Rank Rank
for 1492287912 are 1492287900 (minute), 1492286400 (hour), and
1492214400 (day). So, for hourly aligned timestamps, the contin- (a) (b)
gency table for hours aggregates all flows with timestamps within

Execution Time (msec)


Bytes
15 Packets 1e5
the range [1492286400, 1492290000) into the row with aligned times-

Relative Error (%)


12
tamp 1492286400. Considering those three granules (day, hour, 9
1e4
MapReduce
Contingency
minute) we create three contingency tables that allow to flexibly 6
1e3
answer query type 1, while reducing the amount of rows scanned. 3

In terms of differential privacy protection, each contingency 0 1e2


0.2 0.4 0.6 0.8 1 2 4 6 8 10
table increases sensitivity by one (as time granularities overlap). So ε Data Range (hours)
for three granularities, the respective sensitivities for bytes/packets (c) (d)
are multiplied by 3. We believe that summarizing all these three
levels can increase accuracy for broader queries (in such cases, Figure 2: Evaluation Results
the error would be too large if considering only minute counts a wide range of privacy budgets, ranging from 0.2 to 1.0 (lower
when answering a multi-day query). However, for specific scenarios, values correspond to stricter privacy constraints). We observe that
one can alter the design by only materializing a subset of these the relative error is below 10% for most of the range, with slightly
contingency tables. The decision would depend on the length of larger values at the lowest ϵ = 0.2 setting. The relative error is
expected time ranges to analyze, and also the skewness of data over lower for bytes, since the absolute data values are considerably
time. According to the Laplace mechanism outlined in Section 2 [3], larger than those of packets.
the noise added to a query result is proportional to the sensitivity Fig. 2(d) show the performance of our approach (measured in
of the entire query set. In turn, sensitivity is proportional to the msec) in comparison with the MapReduce-only baseline (without
maximum number of queries that overlap a specific region of the contingency tables). Even for queries with small time range (1 hour),
attribute space. Thus, to maintain data accuracy, it is important to the baseline performs quite poorly, requiring close to 30 seconds of
limit the amount of overlap. For our experiment the overlapping processing. This increases to several minutes for the longer time
data is equal to the number of granules that we are interested in, ranges. In contrast, our approach is able to keep the response time
in this case 3. The noise added to each of the answers to query below 1 second in all cases, or two orders of magnitude lower than
type 1 are proportional to the value of the maximum difference the baseline.
that adding or removing one flow from the database will produce. 5 CONCLUSION
Since we are interested in the number of bytes and the number of
packets, the noise is proportional to the maximum value of bytes We proposed a system for computing private flow-level granularity
or packets that a flow can contain. analytics on top of high-speed network data. Using the de-facto
standard of differential privacy, our system builds on open-source
4 EXPERIMENTAL EVALUATION tools like Apache Hadoop/HBase, and maintains contingency tables
We evaluated our proposed system using real flow data collected that summarize data on relevant attributes. It significantly outper-
over a period of 24 hours from an Internet2 site, resulting in a raw forms benchmarks in terms of performance, while keeping data
dataset of 25GB. We measure the total amount of traffic (in terms accuracy high. In future work, we will investigate approaches to
of both packets and bytes) across all pairs of AS in a specified time efficiently support more complex query types by building more
period, ranging from one to ten hours. Our testbed consists of a advanced contingency tables, while at the same time keeping sensi-
dual Xeon E5-2430 v2 2.5GHz CPU system with 128GB of RAM tivity (and hence added noise) at low levels.
running Ubuntu 16, HBase 1.2.4 and Hadoop 2.7.3. We compare our Acknowledgment: work supported by NSF grant 1450975.
approach against a baseline which uses MapReduce to compute REFERENCES
analytics directly from the raw data. The baseline is similar in [1] R. Chansler, H. Kuang, S. Radia, K. Shvachko, and S. Srinivas. The Architecture of
accuracy to our method, so the direct comparison is done only with Open Source Applications. 2011.
respect to performance. [2] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.
In Proceedings of the 6th Symposium on Operating Systems Design & Implementation,
Fig. 2(a)-(b) shows the accuracy of the proposed approach for pages 10–10, 2004.
top-10 talker AS systems, i.e., the AS pairs with the highest amount [3] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity
in private data analysis. In TCC, pages 265–284, 2006.
of traffic across them, for a privacy budget ϵ = 0.4. We note that [4] O.-G. Niculaescu, M. Maruseac, and G. Ghinita. Differentially-private big data
the error incurred by our approach is small, and that the precision analytics for high-speed research network traffic measurement. In Proceedings
and recall are at 100% for both bytes and packets (i.e., even after of the Seventh ACM on Conference on Data and Application Security and Privacy,
CODASPY ’17, pages 151–153, New York, NY, USA, 2017. ACM.
adding noise, the relative order of AS pairs when ordered according
to traffic does not change). Fig. 2(c) presents the relative error for

151

You might also like