Big Data Analytics Over Encrypted Datasets With Seabed

Big Data Analytics over Encrypted
Datasets with Seabed

Eng. Cosmin-Dumitru Oprea
Eng. Marius-Sorin Neculoiu
Eng. Ștefan Stănculescu
Abstract
● Processing large amount of sensitive data is done by operating mostly on
encrypted data: CryptDB, Monomi
● CryptDB and Monomi:
○ expensive
○ asymmetric encryption schemes
○ performance limited in "true big data"
● Seabed
○ ASHE: additively symmetric homomorphic encryption scheme
■ faster than CryptDB and Monomi
○ SPLASHE: Splayed ASHE
■ a novel randomized encryption scheme
■ prevent frequency attacks based on auxiliary data.
Introduction
● Consider a retail business which wants to analyze its records
● Sensitive business data => processing of encrypted data using CryptDB,
Monomi:
○ deterministic encryption schemes: vulnerable to frequency attacks
○ partially homomorphic cryptoschemes: high computational cost
● Existing solutions use asymmetric homomorphic encryption: Paillier
○ usefull if data is produced by Alice and analysed by Bob
● In most cases Alice and Bob are from the same business
○ is sufficient to use symmetric encryption which is much faster
○ new additively symmetric homomorphic encryption scheme, ASHE
■ 3x faster than Paillier
Introduction(2)
● Second problem, frequency attacks using auxiliary data in the context of
deterministic encryption:
○ consider a column which contains gender: male, female
○ if the atacker knows which gender occurs more frequently he can decode the column
● Solution: SPLASHE(Splayed ASHE):
○ splays sensitive columns to multiple columns
Seabed: a client-side planner and a proxy

Introduction(3)
● Seabed
○ planner + proxy
○ based on Apache Spark
○ increases the query latency by only 8% to 45%; Monomi(based on Paillier) increases by 100%-
200%
● Planner
○ is applied once to each new data set;
○ transforms the plain-text schema into an encrypted schema
○ chooses suitable encryption schemes for each column, based on the kinds of queries that the
user wants to perform
● Proxy
○ transparently rewrites queries for the encrypted schema
○ decrypts results that arrive from the cloud
○ performs any computations that cannot be performed directly on the cloud
Overview
● An analyst can issue queries to a query processor on the cloud
● The responses will be encrypted
● The analyst can decrypt them with a secret key she shares with the data
collector
Background
● Homomorphic encryption
○ aditive homomorphism (used by Paillier)
■ 2 ciphertexts C(x), C(y): C(x) + C(y) = C(x + y)
○ fully homomorphic
■ can be used to compute arbitrary functions on encrypted data
○ schemes are typically randomized, there are many different possible ciphertexts for each value
■ cannot compute a join operation
● in this case, one can use deterministic encryption: each value v is mapped to exactly
one ciphertext C(v), but in vulnerable to frequency attacks
● order preserving encryption (OPE): x < y, use only C(x) and C(y).
Seabed Encryption Schemes
● ASHE - Additive Symmetric Homomorphic Encryption, improving performance
○ Encrypt by masking values with random numbers
○ No need to remember random numbers
○ ASHE ciphertexts are 32/64-bit integers
■ Homomorphic addition only takes a few nanoseconds
Seabed Encryption Schemes(2)
● ASHE - Optimizations
○ Optimize encryption so that the randomness cancels out for consecutive IDs
○ Fast evaluation of pseudorandom function via AES-NI
○ Compression techniques to make ID list as small as possible
● SPLASHE - Why are encrypted databases vulnerable?
● SPLASHE - How can we avoid deterministic encryption?
○ basic SPLASHE: use d columns instead of 1, where d is the number of values
■ increases a column’s storage by a factor of d, which is expensive if d is large
● SPLASHE - How can we avoid deterministic encryption?
○ enhanced SPLASHE
Limmitations
● ASHE
○ growing ciphertexts can create memory stress at the workers
○ the ciphertext should not grow with the number of operations that are performed on it
● SPLASHE
○ its requirement for a-priori knowledge of query workload or data distribution
○ its difficulty in handling data with rapidly changing distribution
○ its storage overhead
Design
● Implemented on top of unmodified Spark
○ ASHE and SPLASHE implemented in Scala
● Seabed’s high-level design is similar to CryptDB’s
○ Accepts SQL queries;
○ Transparently answers them on encrypted data
○ Client proxy handles encryption/decryption
● The user can issue three kinds of requests:
○ Create Plan
○ Upload Data
○ Query Data
Applications
● Can Seabed support a wide range of big data analytics applications?
● Three studies were performed to understand this:
○ Analysing MDX and Spark (interfaces that BI applications use at the back-end)
○ Evaluating a month-long query log made on an OLAP platform to determine how effectively
Seabed can support the functionality of these systems
○ Analysing the TPC-DS query set
Applications(2)
Seabed’s functionality support falls into four categories:
1. Support fully on the server

○ Seabed’s techniques can fully support operations with no client support (ex. computing the
sum, average, count, and min)
2. Support with client pre-processing
○ Seabed can support quadradic computation necessary for more complex analytics (anomaly
detection, linear regression in one dimension)
Applications(3)
3. Support with client post-processing
○ All applications and APIs allow users to specify arbitrary functions of data
○ Seabed cannot perform them at the server
○ Data is post-processed at the client
4. Support with two client round-trips
○ The client computes an intermediate result , re-encryptes it and sends it back to the server for
further processing
Evaluation
● Report results from experimental evaluation of Seabed
● System was evaluated with microbenchmarks (Synthetic)
● Evaluation has two high-level goals
○ performance benefits of Seabed over systems that use the Paillier cryptosystem.
○ performance and storage overhead of Seabed as compared to a system with no encryption. NoEnc
● Implementation and setup
○ Apache/Spark with 100 cores
○ Seabed server-side - Scala Seabed client-side - Scala & C++
○ MS Azure - 10 nodes, each equipped with a 16-core Intel Xeon E5 2.4 GHz processor and 112 GB of
memory
Evaluation - End to End Latency
● End-to-end latency for the three approaches with varying input sizes (250 million to 1.75 billion rows).
● Results
○ Paillier: up to 16.6 minutes
○ No Enc. : < 1 second
○ Seabed: varying between 1.8s to 11s
● Seabed is 100x faster than Pallier, even in the worst case
Evaluation - Server Scalability
● Using a larger cluster can only speed up the server side

● Sataset at 1.75 billion rows
● Varied the number of cores from 10 to 100
● Results
○ NoEnc 20 cores -> 1s,
○ Paillier 100 cores ->1000s
○ Seabed 50 cores -> beetween 1.8s to 8.0s
Evaluation
● Datasets
○ 760M rows, real ad-analytics application from MS
● We replaced 10 DET columns with SPLASHE, one by one
● Measured: Relative size increase vs. plaintext dataset
● Results
○ SPLASHE has substantial storage cost
○ Enhanced SPLASHE reduces this cost by up to 10x
● With 10x more storage, we avoid DET entirely
○ Reduces risk of information leaks

Evaluation - Real World Applications
● Same ad-analytics application from Microsoft

○ Measured: End-to-end latency of 15 queries
● Results
○ No enc is about 10 time faster than Paillier across all queries
○ Seabed is almost as fast as no encryption (within 15-44%)
● It is possible to do analitytics on encypted big data
?

Big Data Analytics Over Encrypted Datasets With Seabed

Uploaded by

Copyright:

Available Formats

You might also like

Big Data Analytics Over Encrypted Datasets With Seabed

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics Over Encrypted Datasets With Seabed

Uploaded by

Copyright:

Available Formats

Big Data Analytics over Encrypted

Datasets with Seabed

Seabed: a client-side planner and a proxy

1. Support fully on the server

● Using a larger cluster can only speed up the server side

○ Reduces risk of information leaks

● Same ad-analytics application from Microsoft

You might also like