Big Data Analytics Over Encrypted Datasets With Seabed

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Big Data Analytics over Encrypted

Datasets with Seabed


Eng. Cosmin-Dumitru Oprea
Eng. Marius-Sorin Neculoiu
Eng. Ștefan Stănculescu
Abstract
● Processing large amount of sensitive data is done by operating mostly on
encrypted data: CryptDB, Monomi
● CryptDB and Monomi:
○ expensive
○ asymmetric encryption schemes
○ performance limited in "true big data"
● Seabed
○ ASHE: additively symmetric homomorphic encryption scheme
■ faster than CryptDB and Monomi
○ SPLASHE: Splayed ASHE
■ a novel randomized encryption scheme
■ prevent frequency attacks based on auxiliary data.
Introduction
● Consider a retail business which wants to analyze its records
● Sensitive business data => processing of encrypted data using CryptDB,
Monomi:
○ deterministic encryption schemes: vulnerable to frequency attacks
○ partially homomorphic cryptoschemes: high computational cost
● Existing solutions use asymmetric homomorphic encryption: Paillier
○ usefull if data is produced by Alice and analysed by Bob
● In most cases Alice and Bob are from the same business
○ is sufficient to use symmetric encryption which is much faster
○ new additively symmetric homomorphic encryption scheme, ASHE
■ 3x faster than Paillier
Introduction(2)
● Second problem, frequency attacks using auxiliary data in the context of
deterministic encryption:
○ consider a column which contains gender: male, female
○ if the atacker knows which gender occurs more frequently he can decode the column
● Solution: SPLASHE(Splayed ASHE):
○ splays sensitive columns to multiple columns

Seabed: a client-side planner and a proxy


Introduction(3)
● Seabed
○ planner + proxy
○ based on Apache Spark
○ increases the query latency by only 8% to 45%; Monomi(based on Paillier) increases by 100%-
200%
● Planner
○ is applied once to each new data set;
○ transforms the plain-text schema into an encrypted schema
○ chooses suitable encryption schemes for each column, based on the kinds of queries that the
user wants to perform
● Proxy
○ transparently rewrites queries for the encrypted schema
○ decrypts results that arrive from the cloud
○ performs any computations that cannot be performed directly on the cloud
Overview
● An analyst can issue queries to a query processor on the cloud
● The responses will be encrypted
● The analyst can decrypt them with a secret key she shares with the data
collector
Background
● Homomorphic encryption
○ aditive homomorphism (used by Paillier)
■ 2 ciphertexts C(x), C(y): C(x) + C(y) = C(x + y)
○ fully homomorphic
■ can be used to compute arbitrary functions on encrypted data
○ schemes are typically randomized, there are many different possible ciphertexts for each value
■ cannot compute a join operation
● in this case, one can use deterministic encryption: each value v is mapped to exactly
one ciphertext C(v), but in vulnerable to frequency attacks
● order preserving encryption (OPE): x < y, use only C(x) and C(y).
Seabed Encryption Schemes
● ASHE - Additive Symmetric Homomorphic Encryption, improving performance
○ Encrypt by masking values with random numbers
○ No need to remember random numbers
○ ASHE ciphertexts are 32/64-bit integers
■ Homomorphic addition only takes a few nanoseconds
Seabed Encryption Schemes(2)
● ASHE - Optimizations
○ Optimize encryption so that the randomness cancels out for consecutive IDs
○ Fast evaluation of pseudorandom function via AES-NI
○ Compression techniques to make ID list as small as possible
Seabed Encryption Schemes(3)
● SPLASHE - Why are encrypted databases vulnerable?
Seabed Encryption Schemes(4)
● SPLASHE - How can we avoid deterministic encryption?
○ basic SPLASHE: use d columns instead of 1, where d is the number of values
■ increases a column’s storage by a factor of d, which is expensive if d is large
Seabed Encryption Schemes(5)
● SPLASHE - How can we avoid deterministic encryption?
○ enhanced SPLASHE
Limmitations
● ASHE
○ growing ciphertexts can create memory stress at the workers
○ the ciphertext should not grow with the number of operations that are performed on it
● SPLASHE
○ its requirement for a-priori knowledge of query workload or data distribution
○ its difficulty in handling data with rapidly changing distribution
○ its storage overhead
Design
● Implemented on top of unmodified Spark
○ ASHE and SPLASHE implemented in Scala
● Seabed’s high-level design is similar to CryptDB’s
○ Accepts SQL queries;
○ Transparently answers them on encrypted data
○ Client proxy handles encryption/decryption
● The user can issue three kinds of requests:
○ Create Plan
○ Upload Data
○ Query Data
Applications
● Can Seabed support a wide range of big data analytics applications?
● Three studies were performed to understand this:
○ Analysing MDX and Spark (interfaces that BI applications use at the back-end)
○ Evaluating a month-long query log made on an OLAP platform to determine how effectively
Seabed can support the functionality of these systems
○ Analysing the TPC-DS query set
Applications(2)
Seabed’s functionality support falls into four categories:

1. Support fully on the server


○ Seabed’s techniques can fully support operations with no client support (ex. computing the
sum, average, count, and min)
2. Support with client pre-processing
○ Seabed can support quadradic computation necessary for more complex analytics (anomaly
detection, linear regression in one dimension)
Applications(3)
3. Support with client post-processing
○ All applications and APIs allow users to specify arbitrary functions of data
○ Seabed cannot perform them at the server
○ Data is post-processed at the client
4. Support with two client round-trips
○ The client computes an intermediate result , re-encryptes it and sends it back to the server for
further processing
Evaluation
● Report results from experimental evaluation of Seabed
● System was evaluated with microbenchmarks (Synthetic)
● Evaluation has two high-level goals
○ performance benefits of Seabed over systems that use the Paillier cryptosystem.
○ performance and storage overhead of Seabed as compared to a system with no encryption. NoEnc
● Implementation and setup
○ Apache/Spark with 100 cores
○ Seabed server-side - Scala Seabed client-side - Scala & C++
○ MS Azure - 10 nodes, each equipped with a 16-core Intel Xeon E5 2.4 GHz processor and 112 GB of
memory
Evaluation - End to End Latency

● End-to-end latency for the three approaches with varying input sizes (250 million to 1.75 billion rows).
● Results
○ Paillier: up to 16.6 minutes
○ No Enc. : < 1 second
○ Seabed: varying between 1.8s to 11s
● Seabed is 100x faster than Pallier, even in the worst case
Evaluation - Server Scalability

● Using a larger cluster can only speed up the server side


● Sataset at 1.75 billion rows
● Varied the number of cores from 10 to 100
● Results
○ NoEnc 20 cores -> 1s,
○ Paillier 100 cores ->1000s
○ Seabed 50 cores -> beetween 1.8s to 8.0s
Evaluation
● Datasets
○ 760M rows, real ad-analytics application from MS
● We replaced 10 DET columns with SPLASHE, one by one
● Measured: Relative size increase vs. plaintext dataset
● Results
○ SPLASHE has substantial storage cost
○ Enhanced SPLASHE reduces this cost by up to 10x
● With 10x more storage, we avoid DET entirely

○ Reduces risk of information leaks


Evaluation - Real World Applications

● Same ad-analytics application from Microsoft


○ Measured: End-to-end latency of 15 queries
● Results
○ No enc is about 10 time faster than Paillier across all queries
○ Seabed is almost as fast as no encryption (within 15-44%)
● It is possible to do analitytics on encypted big data
?

You might also like