Professional Documents
Culture Documents
FoundationDB Record Layer
FoundationDB Record Layer
that provides a record-oriented data store with semantics backend that provides durable storage. Developing and op-
similar to a relational database implemented on top of Foun- erating such backends presents many challenges. First, the
dationDB, an ordered, transactional key-value store. The high volume of data collected by modern applications, cou-
Record Layer provides a lightweight, highly extensible way pled with the large number of users and high access rates,
to store structured data. It offers schema management and a requires smart partitioning and placement solutions for stor-
rich set of query and indexing facilities, some of which are ing data to achieve horizontal scalability. Modern storage
not usually found in traditional relational databases, such systems must scale elastically in response to increases in
as nested record types, indexes on commit versions, and in- user demand for both storage capacity and computation. Sec-
dexes that span multiple record types. The Record Layer is ond, as systems become larger, formerly “rare” events such
stateless and built for massive multi-tenancy, encapsulating as network, disk, and machine failures become everyday
and isolating all of a tenant’s state, including indexes, into a occurrences. A core challenge of distributed systems is main-
separate logical database. We demonstrate how the Record taining system availability and durability in the face of these
Layer is used by CloudKit, Apple’s cloud backend service, to problems. Third, many techniques for addressing scalabil-
provide powerful abstractions to applications serving hun- ity and availability challenges, such as eventual consistency,
dreds of millions of users. CloudKit uses the Record Layer create immense challenges for application developers. Im-
to host billions of independent databases, many with a com- plementing transactions in a distributed setting remains one
mon schema. Features provided by the Record Layer enable of the most challenging problems in distributed systems.
CloudKit to provide richer APIs and stronger semantics with Fourth, as stateful services grow, they must support the
reduced maintenance overhead and improved scalability. needs of many diverse users and applications. This multi-
tenancy brings many challenges, including isolation, resource
ACM Reference Format: sharing, and elasticity in the face of growing load. Many data-
Christos Chrysafis, Ben Collins, Scott Dugas, Jay Dunkelberger, base systems intermingle data from different tenants at both
Moussa Ehsan, Scott Gray, Alec Grieser, Ori Herrnstadt, Kfir Lev- the compute and storage levels. Retrofitting resource iso-
Ari, Tao Lin, Mike McMahon, Nicholas Schiefer, and Alexander Shraer. lation to such systems is challenging. Stateful services are
2019. FoundationDB Record Layer: A Multi-Tenant Structured Data- especially difficult to scale elastically because state cannot
store. In 2019 International Conference on Management of Data (SIG- be partitioned arbitrarily. For example, data and indexes can-
MOD ’19), June 30-July 5, 2019, Amsterdam, Netherlands. ACM, New not be stored on entirely separate storage clusters without
York, NY, USA, 16 pages. https://doi.org/10.1145/3299869.3314039
sacrificing transactional updates, performance, or both.
These problems must be addressed by any company of-
fering stateful services. Yet despite decades of academic and
Permission to make digital or hard copies of all or part of this work for industrial research, they are notoriously difficult to solve
personal or classroom use is granted without fee provided that copies correctly and require a high level of expertise and experi-
are not made or distributed for profit or commercial advantage and that ence. Big companies use in-house solutions, developed and
copies bear this notice and the full citation on the first page. Copyrights evolved over many years by large teams of experts. Smaller
for components of this work owned by others than the author(s) must
be honored. Abstracting with credit is permitted. To copy otherwise, or
companies often have little choice but to pay larger cloud
republish, to post on servers or to redistribute to lists, requires prior specific providers or sacrifice durability.
permission and/or a fee. Request permissions from permissions@acm.org. FoundationDB [10] democratizes industrial-grade highly-
SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands available and consistent storage, making it freely available to
© 2019 Copyright held by the owner/author(s). Publication rights licensed anyone as an open source solution [11] and is currently used
to ACM. in production at companies such as Apple, Snowflake, and
ACM ISBN 978-1-4503-5643-5/19/06.
https://doi.org/10.1145/3299869.3314039
Wavefront. While the semantics, performance, and reliability
of FoundationDB make it extremely useful, FoundationDB’s data to another cluster as everything needed to interpret and
data model, an ordered mapping from binary keys to binary operate each record store is found in its key range.
values, is often insufficient for applications. Many of them The Record Layer is used by multiple systems at Apple.
need structured data storage, indexing capabilities, a query We demonstrate the power of the Record Layer at scale by de-
language, and more. Without these, application developers scribing how CloudKit, Apple’s cloud backend service, uses
are forced to reimplement common functionality, slowing it to provide strongly-consistent data storage for a large and
development and introducing bugs. diverse set of applications [43]. Using the Record Layer’s ab-
To address these challenges, we present the FoundationDB stractions, CloudKit offers multi-tenancy at the extreme by
Record Layer: an open source record-oriented data store built maintaining independent record stores for each user of each
on top of FoundationDB with semantics similar to a relational application. As a result, we use the Record Layer on Foun-
database [19, 20]. The Record Layer provides schema manage- dationDB to host billions of independent databases sharing
ment, a rich set of query and indexing facilities, and a variety thousands of schemata. In the future, we envision that the
of features that leverage FoundationDB’s advanced capa- Record Layer will be combined with other storage models,
bilities. It inherits FoundationDB’s strong ACID semantics, such as queues and graphs, leveraging FoundationDB as a
reliability, and performance in a distributed setting. These general purpose storage engine to provide transactional con-
lightweight abstractions allow for multi-tenancy at an ex- sistency across all these data models. In summary, this work
tremely large scale: the Record Layer allows creating isolated makes the following contributions:
logical databases for each tenant—at Apple, it is used to man- • An open source layer on top of FoundationDB with se-
age billions of such databases—all while providing familiar mantics akin to those of a relational database.
features such as structured storage and transactional index • The record store abstraction and a suite of techniques
maintenance. to manipulate it, enabling billions of logical tenants to
The Record Layer represents structured values as Protocol operate independent databases in a FoundationDB cluster.
Buffer [4] messages called records that include typed fields • An extensible architecture allowing clients to customize
and even nested records. Since an application’s schema in- core features including schema management and indexing.
evitably changes over time, the Record Layer includes tools • A lightweight design that provides rich features on top of
for schema management and evolution. It also includes facili- the underlying key-value store.
ties for planning and efficiently executing declarative queries
using a variety of index types. The Record Layer leverages ad- 2 BACKGROUND ON FOUNDATIONDB
vanced features of FoundationDB; for example, many aggre- FoundationDB is a distributed, ordered key-value store that
gate indexes are maintained using FoundationDB’s atomic runs on clusters of commodity servers and provides ACID
mutations, allowing concurrent, conflict-free updates. Be- transactions over arbitrary sets of keys using optimistic con-
yond its rich feature set, the Record Layer provides a large currency control. Its architecture draws from the virtual
set of extension points, allowing its clients to extend its synchrony paradigm [23, 26] whereby FoundationDB is com-
functionality even further. For example, client-defined index posed of two logical clusters: one that stores data and pro-
types can be seamlessly “plugged in” to the index maintainer cesses transactions and another coordination cluster (run-
and query planner. Similarly, record serialization supports ning Active Disk Paxos [27]) that maintains membership and
client-defined encryption and compression algorithms. configuration for the first cluster. This allows FoundationDB
The Record Layer supports multi-tenancy at scale through to achieve high availability while requiring only F +1 storage
two key architectural choices. First, the layer is completely replicas to tolerate F failures [23]. FoundationDB is distin-
stateless, so scaling the compute service is as easy as launch- guished by its deterministic simulation testing framework
ing more stateless instances. A stateless design means that which can quickly simulate entire clusters under a variety
load-balancers and routers need only consider where the of failure conditions in a single thread with complete deter-
data are located rather than which compute servers can serve minism. In the past year, we have run more than 250 million
them. Furthermore, a stateless server has fewer resources simulations equivalent to more than 1870 years and 3.5 mil-
that need to be apportioned among isolated clients. Second, lion CPU-hours. This rigorous testing in simulation makes
the layer achieves resource sharing and elasticity with its FoundationDB extremely stable and allows its developers
record store abstraction, which encapsulates the state of an to introduce new features and releases in a rapid cadence,
entire logical database, including serialized records, indexes, unusual among similar strongly-consistent distributed—or
and even operational state. Each record store is assigned even centralized—databases.
a contiguous range of keys, ensuring that data belonging Layers. Unlike most databases, which bundle together a
to different tenants is logically isolated. If needed, moving storage engine, data model, and query language, forcing
a tenant is as simple as copying the appropriate range of users to choose all three or none, FoundationDB takes a
modular approach: it provides a highly scalable, transac- key and value sizes (10kB for keys and 100kB for values)
tional storage engine with a minimal yet carefully chosen with much smaller recommended sizes (32B for keys and
set of features. For example, it provides no structured se- up to 10KB for values). Transactions are limited to 10MB,
mantics, and each cluster provides a single, logical keyspace including the key and value size of all keys written and the
which it automatically partitions and replicates. Layers can sizes of all keys in the read and write conflict ranges of the
be constructed on top to provide various data models and commit. In production at Apple, with workloads generated
other capabilities. Currently, the Record Layer is the most by CloudKit through the Record Layer, the median and 99th
substantial layer built on FoundationDB. percentile transaction sizes are approximately 7KB and 36KB,
Transactions and Semantics. FoundationDB provides ACID respectively. Keys are part of a single global keyspace, and
multi-key transactions with strictly-serializable isolation, im- it is up to applications to divide and manage that keyspace
plemented using multi-version concurrency control (MVCC) with the help of several convenience APIs such as the tuple
for reads and optimistic concurrency for writes. As a result, and directory layers described next. FoundationDB supports
neither reads nor writes are blocked by other readers or writ- range reads based on the binary ordering of keys. Finally,
ers. Instead, conflicting transactions fail at commit time and range clear operations are supported and can be used to clear
are usually retried by the client. Specifically, a client perform- all keys in a certain range or starting with a certain prefix.
ing a transaction obtains a read version, chosen as the latest Tuple and directory layers. Key ordering in FoundationDB
database commit version, by performing a getReadVersion makes tuples a convenient and simple way to model data. The
(GRV) call and performs reads at that version, effectively tuple layer, included within the FoundationDB client library,
observing an instantaneous snapshot of the database. Trans- encodes tuples into keys such that the binary ordering of
actions that contain writes can be committed only if none of those keys preserves the ordering of tuples and the natural
the values they read have been modified by another transac- ordering of typed tuple elements. In particular, a common
tion since the transaction’s read version. Committed transac- prefix of the tuple is serialized as a common byte prefix
tions are written to disk on multiple cluster nodes and then and defines a key subspace. For example, a client may store
acknowledged to the client. FoundationDB executes opera- the tuple (state,city) and later read using a prefix like
tions within a transaction in parallel, while preserving the (state,*).
program order of accesses to each key and guaranteeing that The directory layer provides an API for defining a log-
a read following a write to the same key within a transaction ical directory structure which maps potentially long-but-
returns the written value. FoundationDB imposes a 5 second meaningful binary strings to short binary strings, reducing
transaction time limit, which the Record Layer compensates the amount of space used by keys. For example, if all of an ap-
for using techniques described in Section 4. plication’s keys are prefixed with its name, the prefix might
Besides create, read, update, and delete (CRUD) opera- be added to the directory layer. Internally, the directory layer
tions, FoundationDB provides atomic read-modify-write op- assigns values to its entries using a sliding window alloca-
erations on single keys (e.g., addition, min/max, etc.). Atomic tion algorithm that concurrently allocates unique mappings
operations occur within a transaction like other operations, while keeping the allocated integers small.
but they do not create read conflicts, so a concurrent change
to that value would not cause the transaction to abort. For ex-
ample, a counter incremented concurrently by many clients 3 DESIGN GOALS AND PRINCIPLES
would be best implemented using atomic addition. The Record Layer is designed to provide powerful abstrac-
FoundationDB allows clients to customize the default con- tions for operations on structured data including transaction
currency control behavior by trading off isolation semantics processing, schema management, and query execution. It
to reduce transaction conflicts. Snapshot reads do not cause aims to provide this rich set of features in a massively multi-
an abort even if the read key was overwritten by a later trans- tenant environment where it must serve the needs of a great
action. For example, a transaction may wish to read a value number of tenants with diverse storage and retrieval needs.
that is monotonically increasing over time just to determine For example, some tenants store and locally synchronize a
if the value has reached some threshold. In this case, reading small amount of data while others store large amounts of
the value at a snapshot isolation level (or clearing the read data and query it in complex ways.
conflict range for the value’s key) would allow the transac- To illustrate this challenge concretely, Figure 3 shows the
tion to see the state of the value, but it would not result in distribution of the sizes of record stores used to store Cloud-
conflicts with other transactions that may be modifying the Kit’s private databases [43]. Although the vast majority of
value concurrently. databases are quite small, much of the data is stored in rela-
Keys, values, and order. Keys and values in FoundationDB tively large databases. Furthermore, the data in the figure ex-
are opaque binary values. FoundationDB imposes limits on cludes larger databases such as the “public” databases shared
across all users of an application, which can be terabytes in For example, the Record Layer does not maintain any state
size. about the position of a cursor in memory; instead, the context
In the face of such a range of database sizes, it is tempting needed to advance a cursor in a future request is serialized
to provision and operate separate systems to store “small and returned to the client as a continuation.
data” and “big data.” However, such an approach poses both The stateless nature of the layer has three primary benefits,
operational challenges and difficulties to application devel- which we illustrate with the same example of streaming data
opers, who must contend with varying semantics across from a cursor. First, it simplifies request routing: a request
systems. Instead, the Record Layer offers the semantics of a can be routed to any of the stateless servers even if it requests
structured data store with massive multi-tenancy in mind by more results from an existing cursor since there is no buffer
providing features that serve the diverse needs of its tenants on a particular server that needs to be accessed. Second, it
and mechanisms to isolate them from each other. substantially simplifies the operation of the system at scale:
if the server encounters a problem, it can be safely restarted
3.1 Design principles without needing to transfer cursor state to another server.
Achieving these goals required a variety of careful design Lastly, storing state in FoundationDB ensures that all state
decisions, some of which differ from traditional relational has the same ACID semantics as the underlying key-value
databases. This section highlights several of these principles. store: we do not need separate facilities for verifying the
Statelessness. In many distributed databases, individual integrity of our cursor-specific metadata.
database servers maintain ephemeral, server-specific state, Streaming model for queries. The Record Layer controls its
such as memory and disk buffers, in addition to persistent resource consumption by limiting its semantics to those that
information such as data and indexes. In contrast, the Record can be implemented on streams of records. For example, it
Layer stores all of its state in FoundationDB or returns it supports ordered queries (as in SQL’s ORDER BY clause) only
to the client so that the layer itself is completely stateless. when there is an available index supporting the requested
sort order. This approach enables supporting concurrent
workloads without requiring stateful memory pools in the
1
server. This also reflects the layer’s general design philoso-
fraction of record stores
user 1 Metadata Store CloudKit Service CloudKit Service CloudKit Service CloudKit Service
(Schema)
application 1 Record Layer Record Layer Record Layer Record Layer
record store
data
index
record store: application 4
Indexes are defined by an index type and a key expression /“a” /“b” /“d”
Layer 1: prefix/1
1 2 3
which defines a function from a record to one or more tuples
consumed by the index maintainer and used to form the index Layer 0: prefix/0
/“a” /“b” /“c” /“d” /“e” /“f”
1 1 1 1 1 1
key. The Record Layer includes a variety of key expressions
and also allows clients to define custom ones. (a) Each contiguous key range used in by the skip-list is rep-
The simplest key expression is field. When evaluated resented by a blue line and prefixed with the leading prefix
against the sample record in Figure 4, field ("id") yields of that subspace. Key-value pairs are shown as points on the
keyspace line with a key and value (with a grey background).
the tuple (1066). Unlike standard SQL columns, Protocol /“a”
Layer 2: prefix/2
Buffer fields permit repeated fields and nested messages. 6