Advanced Data Management

Lena Wiese
Advanced Data Management

De Gruyter Graduate
Weitere empfehlenswerte Titel
Datenbanksysteme, 10. Auflage
A. Kemper, 2016
ISBN 978-3-11-044375-2
Analyse und Design mit der UML 2.5, 11. Auflage

B. Oestereich, A. Scheithauer, 2011
ISBN 978-3-486-72140-9
Die UML-Kurzreferenz 2.5 für die Praxis, 6. Auflage

B. Oestereich, A. Scheithauer, 2014
ISBN 978-3-486-74909-0
Algorithmen – Eine Einführung, 4. Auflage

T. Cormen et al., 2013
ISBN 978-3-486-74861-1
Lena Wiese
Advanced Data
Management
For SQL, NoSQL, Cloud and Distributed Databases

Author
Dr. Lena Wiese
Georg-August-Universität Göttingen
Fakultät für Mathematik und Informatik
Institut für Informatik
Goldschmidtstraße 7
37077 Göttingen
Germany
lena.wiese@udo.edu
ISBN 978-3-11-044140-6
e-ISBN (PDF) 978-3-11-044141-3
e-ISBN (EPUB) 978-3-11-043307-4
Library of Congress Cataloging-in-Publication Data

A CIP catalog record for this book has been applied for at the Library of Congress.
Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available on the Internet at http://dnb.dnb.de.
© 2015 Walter de Gruyter GmbH, Berlin/Boston

Cover image: Tashatuvango/iStock/thinkstock
Printing and binding: CPI books GmbH, Leck
♾ Printed on acid-free paper
Printed in Germany
www.degruyter.com
|
To my family
Preface
During the last two decades, the landscape of database management systems has
changed immensely. Based on the fact that data are nowadays stored and managed in
network of distributed servers (“clusters”) and these servers consist of cheap hardware
(“commodity hardware”), data of previously unthinkable magnitude (“big data”) are
produced, transferred, stored, modified, transformed, and in the end possibly deleted.
This form of continuous change calls for flexible data structures and efficient dis-
tributed storage systems with both a high read and write throughput. In many novel
applications, the conventional table-like (“relational”) data format may not the data
structure of choice – for example, when easy exchange of data or fast retrieval become
vital requirements. For historical reasons, conventional database management sys-
tems are not explicitly geared toward distribution and continuous change, as most im-
plementations of database management systems date back to a time where distributed
storage was not a major requirement. These deficiencies might as well be attributed
to the fact that conventional database management systems try to incorporate several
database standards as well as have high safety guarantees (for example, regarding
concurrent user accesses or correctness and consistency of data).
Several kinds of database systems have emerged and evolved over the last years
that depart from the established tracks of data management and data formats in differ-
ent ways. Development of these emergent systems started from scratch and gave rise to
new data models, new query engines and languages, and new storage organizations.
Two things are particularly remarkable features of these systems: on the one hand, a
wide range of open source products are available (though some systems are supported
by or even originated from large international companies) and development can be
observed or even be influenced by the public; on the other hand, several results and
approaches achieved by long-standing database research (having its roots at least as
early as the 1960s) have been put into practice in these database systems and these
research results now show their merits for novel applications in modern data manage-
ment. On the downside, there are basically no standards (with respect to data formats
or query languages) in this novel area and hence portability of application code or
long-term support can usually not be guaranteed. Moreover, these emerging systems
are not as mature (and probably not as reliable) as conventional established systems.
The term NOSQL has been used as an umbrella term for several emerging database
systems without an exact formal definition. Starting with the notion of NoSQL (which
can be interpreted as saying no to SQL as a query language) it has evolved to mean
“not only SQL” (and hence written as NOSQL with a capital O). The actual origin
of the term is ascribed to the 2009 “NOSQL meetup”: a meeting with presentations
of six database systems (Voldemort, Cassandra, Dynomite, HBase, Hypertable, and
CouchDB). Still, the question of what exactly a NOSQL database system is cannot be
answered unanimously; nevertheless, some structure slowly becomes visible in the
VIII | Preface
NOSQL field and has led to a broad categorization of NOSQL database systems. Main
categories of NOSQL systems are key-value stores, document stores, extensible record
stores (also known as column family stores) and graph databases. Yet, other creatures
live out there in the database jungle: object databases and XML databases do not
espouse the relational data model nor SQL as a query language – but they typically
would not be considered NOSQL database systems (probably because they predate
the NOSQL systems). Moreover, column stores are an interesting variant of relational
database systems.
This book is meant as a textbook for computer science lectures. It is based on
Master-level database lectures and seminars held at the universities of Hildesheim
and Göttingen. As such it provides a formal analysis of alternative, non-relational data
models and storage mechanisms and gives a decent overview of non-SQL query lan-
guages. However, it does not put much focus on installing or setting up database sys-
tems and hence complements other books that concentrate on more technical aspects.
This book also surveys storage internals and implementation details from an abstract
point of view and describes common notions as well as possible design choices (rather
than singling out one particular database system and specializing on its technical fea-
tures).
This book intends to give students a perspective beyond SQL and relational
database management systems and thus covers the theoretical background of mod-
ern data management. Nevertheless this book is also aimed at database practitioners:
it wants to help developers or database administrators coming to an informed de-
cision about what database systems are most beneficial for their data management
requirements.
Overview
This book consists of four parts. Part I Introduction commences the book with a general
introduction to the basics of data management and data modeling.
Chapter 1 Background (page 3) provides a justification why we need databases
in modern society. Desired properties of modern database systems like scalabil-
ity and reliability are defined. Technical internals of database management sys-
tems (DBMSs) are explained with a focus on memory management. Central com-
ponents of a DBMS (like buffer manager or recovery manager) are explored. Next,
database design is discussed; a brief review of Entity-Relationship Models (ERM)
and the Unified Modeling Language (UML) rounds this chapter off.
Chapter 2 Relational Database Management Systems (page 17) contains a review
of the relational data model by defining relation schemas, database schemas and
database constraints. It continues with a example of how to transform an ERM into
a relational database schema. Next, it illustrates the core concepts of relational
database theory like normalization to avoid anomalies, referential integrity, rela-
tional query languages (relational calculus, relational algebra and SQL), concur-
rency management and transactions (including the ACID properties, concurrency
control and scheduling).
Part II NOSQL And Non-Relational Databases comprises the main part of this book. In
its eight chapters it gives an in-depth discussion of data models and database systems
that depart from the conventional relational data model.
Chapter 3 New Requirements, “Not only SQL” and the Cloud (page 33) admits that
relational databases mangement systems (RDMBSs) have their strengths and mer-
its but then contrasts them with cases where the relational data model might
be inadequate and touches on weaknesses that current implementations of re-
lational DBMSs might have. The chapter concludes with a description of current
challenges in data management and a definition of NOSQL databases.
Chapter 4 Graph Databases (page 41) begins by explaining some basics of graph
theory. Having presented several choices for graph data structures (from adja-
cency matrix to incidence list), it describes the predominant data model for graph
databases: the property graph model. After a brief digression of how to map
graphs to an RDBMS, two advanced types of graphs are introduced: hypergraphs
and nested graphs.
Chapter 5 XML Databases (page 69) expounds the basics of XML (like XML docu-
ments and schemas, and numbering schemes) and surveys XML query languages.
Then, the chapter shifts to the issue of storing XML in an RDBMS. Finally, the chap-
ter describes the core concepts of native XML storage (like indexing, storage man-
agement and concurrency control).
X | Overview
Chapter 6 Key-value Stores and Document Databases (page 105) puts forward the
simple data structure of key-value pairs and introduces the map-reduce concept
as a pattern for parallelized processing of key-value pairs. Next, as a form of
nested key-value pairs, the Java Script Object Notation (JSON) is introduced. JSON
Schema and Representational State Transfer are further topics of this chapter.
Chapter 7 Column Stores (page 143) outlines the column-wise storage of tabular
data (in contrast to row-wise storage). Next, the chapter delineates several ways
for compressed storage of data to achieve a more compact representation based
on the fact that data in a column is usually more uniform than data in a row. Lastly,
column striping is introduced as a recent methodology to convert nested records
into a columnar representation.
Chapter 8 Extensible Record Stores (page 161) describes a flexible multidimen-
sional data model based on column families. The surveyed database technologies
also include ordered storage and versioning. After defining the logical model, the
chapter explains the core concepts of the storage structures used on disk and the
ways to handle writes, reads and deletes with immutable data files. This also in-
cludes optimizations like indexing, compaction and Bloom filters.
Chapter 9 Object Databases (page 193) starts with a review of object-oriented no-
tions and concepts; this review gives particular focus to object identifiers, object
normalization and referential integrity. Next, several options for object-relational
mapping (ORM) – that is, how to store object in an RDBMS – are discussed; the
ORM approach is exemplified with the Java Persistence API (JPA). The chapter
moves on to object-relational databases that offer object-oriented extensions in
addition to their basic RDBMS functionalities. Lastly, several issues of storing ob-
jects natively with an Object Database Management System (ODBMS) – like for
example, object persistence and reference management – are attended to.
Part III Distributed Data Management treats the core concepts of data management
when data are scaled out – that is, data are distributed in a network of database
servers.
Chapter 10 Distributed Database Systems (page 235) looks at the basics of data
distribution. Failures in distributed systems and requirements for distributed
database management systems are addressed.
Chapter 11 Data Fragmentation (page 245) targets ways to split data across a set
of servers which are also known under the terms partitioning or sharding. Sev-
eral fragmentation strategies for each of the different data models are discussed.
Special focus is given to consistent hashing.
Chapter 12 Replication And Synchronization (page 261) elucidates the background
on replication for sake of increased availability and reliability of the database sys-
tems. Afterwards, replication-related issues like distributed concurrency control
and consensus protocols as well hinted handoff and Merkle trees are discussed.
Overview | XI
Chapter 13 Consistency (page 295) touches upon the topic of relaxing strong con-
sistency requirements known from RDBMSs into weaker forms of consistency.
Part IV Conclusion is the final part of this book.

Chapter 14 Further Database Technologies (page 311) gives a cursory overview
of related database topics that are out of the scope of this book. Among other
topics, it glimpses at data stream processing, in-memory databases and NewSQL
databases.
Chapter 15 Concluding Remarks (page 317) summarizes the main points of this
book and discusses approaches for database reengineering and data migration.
Lastly, it advocates the idea of polyglot architectures: for each of the different data
storage and processing tasks in an enterprise, users are free to choose a database
system that is most appropriate for one task while using different database sys-
tems for other tasks and lastly integrating these systems into a common storage
and processing architecture.
Contents
Preface | VII
Overview | IX
List of Figures | XIX
List of Tables | XXII
Part I: Introduction
1 Background | 3
1.1 Database Properties | 3
1.2 Database Components | 5
1.3 Database Design | 7
1.3.1 Entity-Relationship Model | 8
1.3.2 Unified Modeling Language | 11
1.4 Bibliographic Notes | 14
2 Relational Database Management Systems | 17

2.1 Relational Data Model | 17
2.1.1 Database and Relation Schemas | 17
2.1.2 Mapping ER Models to Schemas | 18
2.2 Normalization | 19
2.3 Referential Integrity | 20
2.4 Relational Query Languages | 22
2.5 Concurrency Management | 24
2.5.1 Transactions | 24
2.5.2 Concurrency Control | 26
Part II: NOSQL And Non-Relational Databases
3 New Requirements, “Not only SQL” and the Cloud | 33

3.1 Weaknesses of the Relational Data Model | 33
3.1.1 Inadequate Representation of Data | 33
3.1.2 Semantic Overloading | 34
3.1.3 Weak Support for Recursion | 34
XIV | Contents
3.1.4 Homogeneity | 35
3.2 Weaknesses of RDBMSs | 36
3.3 New Data Management Challenges | 37
4 Graph Databases | 41
4.1 Graphs and Graph Structures | 41
4.1.1 A Glimpse on Graph Theory | 42
4.1.2 Graph Traversal and Graph Problems | 44
4.2 Graph Data Structures | 45
4.2.1 Edge List | 46
4.2.2 Adjacency Matrix | 46
4.2.3 Incidence Matrix | 48
4.2.4 Adjacency List | 50
4.2.5 Incidence List | 51
4.3 The Property Graph Model | 53
4.4 Storing Property Graphs in Relational Tables | 56
4.5 Advanced Graph Models | 58
4.6 Implementations and Systems | 62
4.6.1 Apache TinkerPop | 62
4.6.2 Neo4J | 65
4.6.3 HyperGraphDB | 66
5 XML Databases | 69
5.1 XML Background | 69
5.1.1 XML Documents | 69
5.1.2 Document Type Definition (DTD) | 71
5.1.3 XML Schema Definition (XSD) | 73
5.1.4 XML Parsers | 75
5.1.5 Tree Model of XML Documents | 76
5.1.6 Numbering Schemes | 78
5.2 XML Query Languages | 81
5.2.1 XPath | 81
5.2.2 XQuery | 82
5.2.3 XSLT | 83
5.3 Storing XML in Relational Databases | 84
5.3.1 SQL/XML | 84
5.3.2 Schema-Based Mapping | 86
5.3.3 Schemaless Mapping | 89
5.4 Native XML Storage | 90
5.4.1 XML Indexes | 90
Contents | XV
5.4.2 Storage Management | 92

5.4.3 XML Concurrency Control | 97
5.5.1 eXistDB | 100
5.5.2 BaseX | 102
6 Key-value Stores and Document Databases | 105

6.1 Key-Value Storage | 105
6.1.1 Map-Reduce | 106
6.2 Document Databases | 109
6.2.1 Java Script Object Notation | 110
6.2.2 JSON Schema | 112
6.2.3 Representational State Transfer | 116
6.3.1 Apache Hadoop MapReduce | 118
6.3.2 Apache Pig | 121
6.3.3 Apache Hive | 127
6.3.4 Apache Sqoop | 128
6.3.5 Riak | 129
6.3.6 Redis | 132
6.3.7 MongoDB | 133
6.3.8 CouchDB | 136
6.3.9 Couchbase | 139
7 Column Stores | 143

7.1 Column-Wise Storage | 143
7.1.1 Column Compression | 144
7.1.2 Null Suppression | 149
7.2 Column striping | 151
7.3.1 MonetDB | 158
7.3.2 Apache Parquet | 158
8 Extensible Record Stores | 161

8.1 Logical Data Model | 161
8.2 Physical storage | 166
8.2.1 Memtables and immutable sorted data files | 166
8.2.2 File format | 169
8.2.3 Redo logging | 171
XVI | Contents
8.2.4 Compaction | 173

8.2.5 Bloom filters | 175
8.3.1 Apache Cassandra | 181
8.3.2 Apache HBase | 185
8.3.3 Hypertable | 187
8.3.4 Apache Accumulo | 189
9 Object Databases | 193

9.1 Object Orientation | 193
9.1.1 Object Identifiers | 194
9.1.2 Normalization for Objects | 196
9.1.3 Referential Integrity for Objects | 200
9.1.4 Object-Oriented Standards and Persistence Patterns | 200
9.2 Object-Relational Mapping | 202
9.2.1 Mapping Collection Attributes to Relations | 203
9.2.2 Mapping Reference Attributes to Relations | 204
9.2.3 Mapping Class Hierarchies to Relations | 204
9.2.4 Two-Level Storage | 208
9.3 Object Mapping APIs | 209
9.3.1 Java Persistence API (JPA) | 209
9.3.2 Apache Java Data Objects (JDO) | 215
9.4 Object-Relational Databases | 217
9.5 Object Databases | 222
9.5.1 Object Persistence | 223
9.5.2 Single-Level Storage | 224
9.5.3 Reference Management | 226
9.5.4 Pointer Swizzling | 226
9.6.1 DataNucleus | 229
9.6.2 ZooDB | 230
Part III: Distributed Data Management
10 Distributed Database Systems | 235

10.1 Scaling horizontally | 235
10.2 Distribution Transparency | 236
10.3 Failures in Distributed Systems | 237
10.4 Epidemic Protocols and Gossip Communication | 239
Contents | XVII
10.4.1 Hash Trees | 241

10.4.2 Death Certificates | 243
11 Data Fragmentation | 245

11.1 Properties and Types of Fragmentation | 245
11.2 Fragmentation Approaches | 249
11.2.1 Fragmentation for Relational Tables | 249
11.2.2 XML Fragmentation | 250
11.2.3 Graph Partitioning | 252
11.2.4 Sharding for Key-Based Stores | 253
11.2.5 Object Fragmentation | 254
11.3 Data Allocation | 255
11.3.1 Cost-based allocation | 256
11.3.2 Consistent Hashing | 257
12 Replication And Synchronization | 261

12.1 Replication Models | 261
12.1.1 Master-Slave Replication | 262
12.1.2 Multi-Master Replication | 263
12.1.3 Replication Factor and the Data Replication Problem | 263
12.1.4 Hinted Handoff and Read Repair | 265
12.2 Distributed Concurrency Control | 266
12.2.1 Two-Phase Commit | 266
12.2.2 Paxos Algorithm | 268
12.2.3 Multiversion Concurrency Control | 276
12.3 Ordering of Events and Vector Clocks | 276
12.3.1 Scalar Clocks | 277
12.3.2 Concurrency and Clock Properties | 280
12.3.3 Vector Clocks | 281
12.3.4 Version Vectors | 284
12.3.5 Optimizations of Vector Clocks | 289
13 Consistency | 295
13.1 Strong Consistency | 295
13.1.1 Write and Read Quorums | 298
13.1.2 Snapshot Isolation | 300
13.2 Weak Consistency | 302
13.2.1 Data-Centric Consistency Models | 303
13.2.2 Client-Centric Consistency Models | 305
XVIII | Contents
13.3 Consistency Trade-offs | 306

Part IV: Conclusion
14 Further Database Technologies | 311

14.1 Linked Data and RDF Data Management | 311
14.2 Data Stream Management | 312
14.3 Array Databases | 313
14.4 Geographic Information Systems | 314
14.5 In-Memory Databases | 315
14.6 NewSQL Databases | 315
15 Concluding Remarks | 317

15.1 Database Reengineering | 317
15.2 Database Requirements | 318
15.3 Polyglot Database Architectures | 320
15.3.1 Polyglot Persistence | 320
15.3.2 Lambda Architecture | 322
15.3.3 Multi-Model Databases | 322
15.4.1 Apache Drill | 324
15.4.2 Apache Druid | 326
15.4.3 OrientDB | 327
15.4.4 ArangoDB | 330
Bibliography | 333
Index | 347
List of Figures
1.1 Database management system and interacting components | 5
1.2 ER diagram | 11
1.3 UML diagram | 15
2.1 An algebra tree (left) and its optimization (right) | 24
3.1 Example for semantic overloading | 34
4.1 A social network as a graph | 41

4.2 Geographical data as a graph | 42
4.3 A property graph for a social network | 55
4.4 Violation of uniqueness of edge labels | 56
4.5 Two undirected hyperedges | 58
4.6 A directed hyperedge | 59
4.7 An oriented hyperedge | 60
4.8 A hypergraph with generalized hyperedge “Citizens” | 60
4.9 A nested graph | 62
5.1 Navigation in an XML tree | 77

5.2 XML tree | 78
5.3 XML tree with preorder numbering | 79
5.4 Pre/post numbering and pre/post plane | 79
5.5 DeweyID numbering | 80
5.6 Chained memory pages | 93
5.7 Chained memory pages with text extraction | 94
5.8 B-tree structure for node IDs in pages | 95
5.9 Page split due to node insertion | 96
5.10 Conflicting accesses in an XML tree | 98
5.11 Locks in an XML tree | 99
6.1 A map-reduce example | 107

6.2 A map-reduce-combine example | 109
7.1 Finite state machine for record assembly | 157
8.1 Writing to memory tables and data files | 167

8.2 Reading from memory tables and data files | 168
8.3 File format of data files | 170
8.4 Multilevel index in data files | 171
XX | List of Figures
8.5 Write-ahead log on disk | 172

8.6 Compaction on disk | 173
8.7 Leveled compaction | 175
8.8 Bloom filter for a data file | 176
8.9 A Bloom filter of length m = 16 with three hash functions | 178
8.10 A partitioned Bloom filter with k = 4 and partition length m′ = 4 | 181
9.1 Generalization (left) versus abstraction (right) | 195

9.2 Unnormalized objects | 197
9.3 First object normal form | 198
9.4 Second object normal form | 198
9.5 Third object normal form | 199
9.6 Fourth object normal form | 200
9.7 Simple class hierarchy | 205
9.8 Resident Object Table (grey: resident, white: non-resident) | 227
9.9 Edge Marking (grey: resident, white: non-resident) | 228
9.10 Node Marking (grey: resident, white: non-resident) | 228
10.1 A hash tree for four messages | 242
11.1 XML fragmentation with shadow nodes | 252

11.2 Graph partitioning with shadow nodes and shadow edges | 253
11.3 Data allocation with consistent hashing | 257
11.4 Server removal with consistent hashing | 258
11.5 Server addition with consistent hashing | 259
12.1 Master-slave replication | 262

12.2 Master-slave replication with multiple records | 263
12.3 Multi-master replication | 263
12.4 Failure and recovery of a server | 264
12.5 Failure and recovery of two servers | 264
12.6 Two-phase commit: commit case | 267
12.7 Two-phase commit: abort case | 268
12.8 A basic Paxos run without failures | 270
12.9 A basic Paxos run with a failing leader | 272
12.10 A basic Paxos run with a dueling proposers | 273
12.11 A basic Paxos run with a minority of failing acceptors | 274
12.12 A basic Paxos run with a majority of failing acceptors | 275
12.13 Lamport clock with two processes | 279
12.14 Lamport clock with three processes | 279
12.15 Lamport clock totally ordered by process identifiers | 280
12.16 Lamport clock with independent events | 281
List of Figures | XXI
12.17 Vector clock | 283

12.18 Vector clock with independent events | 284
12.19 Version vector synchronization with union merge | 287
12.20 Version vector synchronization with siblings | 288
12.21 Version vector with replica IDs and stale context | 291
12.22 Version vector with replica IDs and concurrent write | 292
13.1 Interfering operations at three replicas | 296

13.2 Serial execution at three replicas | 297
13.3 Read-one write-all quorum (left) and majority quorum (right) | 298
15.1 Polyglot persistence with integration layer | 321

15.2 Lambda architecture | 323
15.3 A multi-model database | 324
List of Tables
2.1 A relational table | 17
2.2 Unnormalized relational table | 20
2.3 Normalized relational table | 21
3.1 Base table for recursive query | 35

3.2 Result table for recursive query | 35
4.1 Node table and attribute table for a node type | 56

4.2 Edge table | 57
4.3 Attribute table for an edge type | 57
4.4 General attribute table | 57
5.1 Schema-based mapping | 88

5.2 Schemaless mapping | 89
7.1 Run-length encoding | 145

7.2 Bit-vector encoding | 145
7.3 Dictionary encoding | 146
7.4 Dictionary encoding for sequences | 146
7.5 Frame of reference encoding | 147
7.6 Frame of reference encoding with exception | 147
7.7 Differential encoding | 148
7.8 Differential encoding with exception | 148
7.9 Position list encoding | 150
7.10 Position bit-string encoding | 150
7.11 Position range encoding | 151
7.12 Column striping example | 157
8.1 Library tables revisited | 161

8.2 False positive probability for m = 4 · n | 180
8.3 False positive probability for m = 8 · n | 180
9.1 Unnormalized representation of collection attributes | 203

9.2 Normalized representation of collection attributes | 204
9.3 Collection attributes as sets | 219
11.1 Vertical fragmentation | 249

11.2 Horizontal fragmentation | 250
|
Part I: Introduction
1 Background
Database systems are fundamental for the information society. Every day, an ines-
timable amount of data is produced, collected, stored and processed: online shop-
ping, sending emails, using social media, or seeing your physician are just some of
the day-to-day activities that involve data management. A properly working database
management system is hence crucial for a smooth operation of these activities. In this
chapter, we introduce the principles and properties that a database system should ful-
fill. Database management systems and their components as well as data modeling are
the other two basic concepts treated in this chapter.
1.1 Database Properties
As data storage plays such a crucial role in most applications, database systems
should guarantee a correct and reliable execution in several use cases. From an ab-
stract perspective, we desire that a database system fulfill the following properties:
Data management. A database system not only stores data, it must just as well
support operations for retrieval of data, searches for data and updates on data. To
enable interoperability with external applications, the database system must pro-
vide communication interfaces or application programming interfaces for several
communication protocols or programming languages. A database system should
also support transactions: A transaction is a sequence of operations on data in a
database that must not be interrupted. In other words, the database executes op-
erations within a transaction according to the “all or nothing” principle: Either all
operations succeed to their full extent or none of the operations is executed (and
the subsequence of operations that was already executed is undone).
Scalability. The amount of data processed daily with modern information tech-
nology is tremendous. Processing these data can only be achieved by distribu-
tion of data in a network of database servers and a high level of parallelization.
Database systems must flexibly react and adapt to a higher workload.
Heterogeneity. When collecting data or producing data (as output of some pro-
gram), these data are usually not tailored to being stored in a relational table for-
mat. While the data in relational format are called structured and have a fixed
schema which prescribes the structure of the data, data often come in different
formats. Data that have a more flexible structure than the table format are called
semi-structured; these can be tree-like structures (as used in XML documents) or –
more generally – graph structures. Furthermore, data can be entirely unstructured
(like arbitrary text documents).
Efficiency. The majority of database applications need fast database systems. On-
line shopping and web searches rely on high-performance search and retrieval
4 | 1 Background
operations. Likewise, other database operations like store and update must be
executed in a speedy fashion to ensure operability of database applications.
Persistence. The main purpose of a database system is to provide a long-term
storage facility for data. Some modern database applications (like data stream
processing) just require a kind of selective persistence: only some designated out-
put data have to be stored onto long-term storage devices, whereas the majority
of the data is processed in volatile main memory and discarded afterwards.
Reliability. Database systems must prevent data loss. Data stored in the database
system should not be distorted unintentionally: data integrity must be maintained
by the database system. Storing copies of data on other servers or storage media (a
mechanism called physical redundancy or replication) is crucial for data recovery
after a failure of a database server.
Consistency. The database system must do its best to ensure that no incorrect
or contradictory data persist in the system. This involves the automatic verifica-
tion of consistency constraints (data dependencies like primary key or foreign key
constraints) and the automatic update of distributed data copies (the replicas).
Non-redundancy. While physical redundancy is decisive for the reliability of a
database system, duplication of values inside the stored data sets (that is, logical
redundancy) should best be avoided. First of all, logical redundancy wastes space
on the storage media. Moreover, data sets with logical redundancy are prone to
different forms of anomalies that can lead to erroneous or inconsistent data. Nor-
malization is one way to transform data sets into a non-redundant format.
Multi-User Support. Modern database systems must support concurrent ac-
cesses by multiple users or applications. Those independent accesses should run
in isolation and not interfere with each other so that a user does not notice that
other users are accessing the database system at the same time. Another major
issue with multi-user support is the need for access control: data of one user
should be protected from unwanted accesses by other users. A simple strategy
for access control is to only allow users access to certain views on the data sets. A
well-defined authentication mechanism is crucial to implement access control.
A database system should manage large amounts of heterogeneous data in an efficient, persistent,
reliable, consistent, non-redundant way for multiple users.
Database systems often do not satisfy all of these requirements or only to the cer-
tain extent. When choosing a database system for a specific application, clarifying
all mandatory requirements and weighing the pros and cons of the available systems
is the first and foremost task.
1.2 Database Components | 5
External ... External

Application Application
File
System network interface
Operating
Stored Data
Main System Database
Memory Management
System
Database Server
Fig. 1.1. Database management system and interacting components
1.2 Database Components
The software component that is in charge of all database operations is the database
management system (DBMS). Several other systems and components interact with the
DBMS as shown in Figure 1.1. The DBMS relies on the operating system and the file sys-
tem of the database server to store the data on disk. The DBMS also relies on the oper-
ating system to be able to use the network interfaces for communication with external
applications or other database servers.
The low-level file system (or the operating system) does not have knowledge on
internal structure or meaning of stored data, it just handles the stored data as arbi-
trary records. Hence, the purpose of the database management system is to provide
the users with a higher-level interface and more structured data storage and retrieval
operations. The DBMS operates on data in the main memory; more precisely it handles
data in a particular portion of the main memory (called page buffer) that is reserved for
the DBMS. The typical storage unit on disk is a “block” of data; often this data block is
called a memory page. The basic procedure of loading stored data from disk into main
memory consists of the following steps:
1. the DBMS retrieves a query or command accessing some page not contained in
the database buffer (a page fault occurs);
2. the DBMS locates a page on disk containing some of the relevant data (possibly
using indexes or “scanning” the table);
3. the DBMS copies this page into its page buffer;
4. as the page usually contains more data than needed by the query or command,
the DBMS locates the relevant values (for example, certain attributes of a tuple)
inside the page and processes them;
6 | 1 Background
5. if data modified are by a command, the DBMS modifies the values inside the page
accordingly;
6. the DBMS eventually writes pages containing modified values back from the page
buffer onto disk.
Due to a different organization and size of main memory and disk storage, data man-
agement has to handle different kinds of addresses at which a page can be located. On
disk, each page has a physical disk address that consists of the disk’s name, the cylin-
der number, the track number and the number of the page inside the track. Records
inside the page can be addressed by an additional offset. Once a page is loaded into
main memory, it receives a physical main memory address. The main memory might
however be too small to hold all pages needed by an application. Virtual addresses
(also called logical addresses) can be used to make accesses from inside an applica-
tion independent from the actual organization of pages in memory or on disk. This
indirection must be handled with a look-up table that translates a virtual address of a
page into its current physical address (on disk or in main memory). Moreover, records
in pages can contain references (that is, pointers that contain an address) to records in
the same page or other pages. When using physical addresses for the pointers, pointer
swizzling is the process of converting the disk address of a pointer into a main memory
address when the referenced page is loaded into main memory. Hence, main mem-
ory management is the interface between the underlying file system and the database
management system.
While data storage management is the main task of a DBMS, data management
as a whole involves many more complex processes. The DBMS itself consists of sev-
eral subcomponents to execute theses processes; the specific implementation of these
components may vary from database system to database system. Some important com-
ponents are the following:
Authentication Manager. Users have to provide an identification and a creden-
tial (like a user name and a password) when establishing a connection to the
database.
Query Parser. The query parser reads the user-supplied query string. It checks
whether the query string has a valid syntax. If so, the parser breaks the query up
into several commands that are needed internally to answer the query.
Authorization Controller. Based on the authenticated user identity and the ac-
cess privileges granted to the users by the database administrator, the authoriza-
tion controller checks whether the accessing user has sufficient privileges to exe-
cute the query.
Command Processor. All the subcommands (into which a user’s query is broken)
are executed by the command processor.
File Manager. The file manager is aware of all the resources (in particular, disk
space) that the database management system may use. With the help of the file
manager, the required data parts (the memory pages containing relevant data) are
located inside the database files stored on disk. When storing modified data back
to disk from the main memory, the file manager finds the correct disk location
for writing the data; the basic unit for memory-to-disk transfer is again a memory
page.
Buffer Manager. The buffer manager is in charge of loading the data into the main
memory and handling the data inside the main memory buffer. From the memory
pages inside the buffer it retrieves those values needed to execute the database op-
erations. The buffer manager also initiates the writing of modified memory pages
back to disk.
Transaction Manager. Multiple concurrent transactions must be executed by the
database system in parallel. The transaction manager takes care of a correct ex-
ecution of concurrent transaction. When transaction can acquire locks on data
(for exclusive access on the data), the transaction manager handles locking and
unlocking of data. The transaction manager also ensures that all transactions are
either committed (successfully completed) or rolled back (operations of a trans-
action executed so far are undone).
Scheduler. The scheduler orders read and write operations (of several concur-
rent transactions) in such a way that the operation from different transactions are
interleaved. One criterion for a good scheduler is serializability of the obtained or-
dering of operations; that is, a schedule that is equivalent (regarding values that
are read and written by each transaction) to a non-interleaved, serial execution of
the transactions. Variants of schedulers are locking schedulers (that include lock
and unlock operations on the data that are read and written) and non-locking
schedulers (that usually order operations depending on the start time of transac-
tions).
Recovery Manager. To prepare for the case of server failures, the recovery man-
ager can set up periodical backup copies of the database. It may also use transac-
tion logs to restart the database server into a consistent state after a failure.
1.3 Database Design
Database design is a phase before using a database system – even before deciding
which system to use. The design phase should clearly answer basic questions like:
Which data are relevant for the customers or the external applications? How should
these relevant data be stored in the database? Which are usual access patterns on
the stored data? For conventional database systems (with a more or less fixed data
schema) changing the schema on a running database system is complex and costly;
that is why a good database design is essential for these systems. Nevertheless, for
database systems with more flexible schemas (or no schema at all), the design phase
is important, too: identifying relationships in the data, grouping data that are often
accessed together, or choosing good values for row keys or column names are all ben-
8 | 1 Background
eficial for a good performance of the database system. Hence, database design should
be done with due care and following design criteria like the following.
Completeness. All aspects of the information needed by the accessing applica-
tions should be covered.
Soundness. All information aspects and relationships between different aspects
should be modeled correctly.
Minimality. No unnecessary or logically redundant information should be stored;
in some situations however it might be beneficial to allow some form of logical
redundancy to obtain a better performance.
Readability. No complex encoding should be used to describe the information;
instead the chosen identifiers (like row keys or column names) should be self-
explanatory.
Modifiability. Changes in the structure of the stored data are likely to occur when
running a database system over a long time. While for schema-free database sys-
tems these changes have to be handled by the accessing applications, for database
systems with a fixed schema a “schema evolution” strategy has to be supported.
Modularity. The entire data set should be divided into subsets that form logically
coherent entities in order to simplify data management. A modular design is also
advantageous for easy changes of the schema.
There are several graphical languages for database design. We briefly review the
Entity-Relationship Model (ERM) and the Unified Modeling Language (UML). We in-
troduce the notation by using the example of a library: readers can borrow books from
the library. Other modeling strategies may also be used. For example, XML document
can be pictured by a tree; graph structures for graph databases can be depicted by
nodes and edges each annotated with a set of properties. These modeling strategies
will be deferred to later sections of the book when the respective data models (XML or
graph data) are introduced.
1.3.1 Entity-Relationship Model
Entity-Relationship (ER) diagrams have a long history for conceptual modeling – that
is, grouping data into concepts and describing their semantics. In particular, ER mod-
els have been used in database design to specify which real-world concepts will be
represented in the database system, what properties of these concepts will be stored
and how different concepts related to each other. We will introduce ER modeling with
the example of a library information system. The basic modeling elements of ER dia-
grams are:
Entities. Entities represent things or beings. They can range from physical objects,
over non-physical concepts to roles of persons. They are drawn as rectangles with their
entity names written into the rectangle.
For our library example, we first of all need the two entities Reader and Book:
Reader Book
Relationships. Relationships describe associations between entities. Relationships

are diamond-shaped with links to the entities participating in the relationship. In our
example, BookLending is a relationship between readers and books:
BookLending
Reader Book
Attributes. Attributes describe properties of entities and relationships; they carry the
information that is relevant for each entity or relationship. Attributes have an oval
shape and are connected to the entity or relationship they belong to. A distinction is
made between single-valued, multi-valued or composite attributes.
Simple single-valued attributes can have a single value for the property; for ex-
ample, the title of a book:
Book Title
Multi-valued attributes can have a set of values for the property; for example, the set
of authors of a book:
Book Author
Composite attributes are attributes that consist of several subattributes; for example,
the publisher information of a book consists of the name of the publisher and the city
where the publisher’s office is located:
City
Book Publisher
Name
10 | 1 Background
Moreover, key attributes are those attributes the values of which serve as unique iden-
tifiers for the corresponding entity. Key attributes are indicated by underlining them.
For example, an identifier for each copy of book is is a unique value issued by the li-
brary (like the library signature of the book and a counter for the different copies of a
book):
Book BookID
Cardinalities. Relationships can come in different complexities. In the simplest case,

these relationships are binary (that is, a relationship between two entities). Then,
these binary relationships can be distinguished into 1:1, 1:n and n:m relationship:
– a 1:1 relationship links an instance of one entity to exactly one instance of the other
entity; an example is a marriage between two persons
– a 1:n relationship links an instance of one entity to multiple instances of the other
entity; for example, a book copy can only be lent to a single reader at a time, but
a reader can borrow multiple books at the same time
– a n:m relationship is an arbitrary relationship without any restriction on the car-
dinalities
Such cardinalities are annotated to the relationship links in the ER diagram. In our
example, we have the case of a 1:n relationship between books and readers.
BookLending
1 n
Reader Book
The Enhanced Entity-Relationship Modeling (EERM) language offers some advanced

modeling elements. Most prominently, the “is-a” relationship is included in EERM to
express specializations of an entity. The “is-a” relationship is depicted by a triangle
pointing from the more specialized to the more general entity. For example, a novel
can be a specialization of a book:
Book
Novel
ReturnDate
BookLending
BookID
1 n
ReaderID
Title
Reader Book
Name Year
Email Publisher
Author
Name City
Fig. 1.2. ER diagram
Attributes of the more general entity will also be attributes of the more specialized
entity; in other words, attributes are inherited by the specialized entities.
The overall picture of our library example is shown in Figure 1.2. The entity Reader
is identified by the key attribute ReaderID (a unique value issued by the library) and
has a name and an email address as additional attributes. The entity Book is identified
by the BookID, has its title and its year of publication as single-valued attributes, its
list of authors as a multi-valued attribute and the publisher information as a composite
attribute. Books and readers are linked by a 1:n relationship which has the return date
for the book as an additional attribute.
1.3.2 Unified Modeling Language
The Unified Modeling Language (UML) is a widely adopted modeling language – in

particular in the object-oriented domain – and it is a standard of the Object Manage-
ment Group (OMG; see Section 9.1.4). As such it cannot only model entities (also known
as classes) and their relationships (also known as associations) but also other object-
oriented concepts like methods, objects, activities and interactions.
Web resources:
– UML resource page: http://www.uml.org/
– specification: http://www.omg.org/spec/UML/
The UML standard consists of several diagram types that can each illustrate a different
aspect of the modeled application. These diagrams can specify the application struc-
12 | 1 Background
ture (like class diagrams, object diagrams or component diagrams) or the application
behavior (like activity diagrams, use case diagrams or sequence diagrams). These di-
agrams can be used to model an application at different abstraction levels throughout
the entire design and implementation process.
From the database point of view, we will confine ourselves to the class diagram which
is used to express the general structure of the stored data and is hence closely related
to the Entity-Relationship diagram. We briefly review the most important notation el-
ements.
Classes, attributes and methods. Classes describe concepts or things and are hence
equivalent to entities of ER Modeling. A class is drawn as a rectangle that is split into
three parts. The upper part contains the class name, the middle part contains the at-
tributes (describing state), and the lower part contains method declarations (describ-
ing behavior). The Reader class might for example contain methods to borrow and
return a book (describing the behavior that a reader can have in the library) in addi-
tion to the attributes ID, name and email address (describing the state of each reader
object by the values that are stored in the attributes):
Reader
readerID
name
email
borrowBook()
returnBook()
Types and visibility. As UML is geared towards object-oriented software design, at-
tributes, parameters and return values can also be accompanied by a type declara-
tion. For example, while the readerID would be an integer, the other attributes would
be strings; the methods have the appropriate parameters of type Book (a user-defined
type) and the return value void (as we don’t expect any value to be returned by the
methods). Attributes and methods can also have a visibility denoting if they can be
accessed from other classes or only from within the same class. While + stands for
public access without any restriction, # stands for protected access only from the same
class or its subclasses, ~ stands for access from classes within the same package, and
- stands for private access only from within the same class.
Reader
- readerID: int
- name: String
- email: String
~ borrowBook(b: Book): void
~ returnBook(b: Book): void
To model multi-valued attributes, a collection type (like array or list) can be used. For
example, we can model the authors as a list of strings:
Book
bookID: int
title: String
year: int
authors: ListhStringi
Associations. Associations between classes are equivalent to relationships between

entities. In the simplest case of a binary association (that is, an association between
two classes), the association is drawn as a straight line between the classes. To model
composite attributes, an association to a new class for the composite attribute contain-
ing the subattributes is used:
Book
bookID Publisher
title name
year city
authors
In more complex cases – for instance, when the association should have additional
attributes, or when an association links more than two classes – an association class
must be attached to the association. In the library example, we need an explicit asso-
ciation class to model the return date:
Reader Book
... ...
...
BookLending
returnDate
Advanced cases like directed associations, aggregation or composition may also be

used to express different semantics of an association. These kinds of associations have
their own notational elements.
Multiplicities. Similar to the cardinalities in ERM, we can specify complexities of an
association. These multiplicities are annotated on the endpoints of the association.
In general, arbitrary sequences or ranges of integers are allowed; a special symbol is
the asterisk * which stands for an arbitrary number. Again, we model the association
between books in such a way that a book can only be lent to a single reader at a time,
14 | 1 Background
but a reader can borrow multiple books at the same time:
Reader 1 * Book
... ...
...
Specialization. A specialization in UML is depicted by a triangular arrow tip pointing

from the subclass to the superclass. A subclass inherits from a superclass all attributes
and all method definition; however, a subclass is free to override the inherited meth-
ods.
Book
bookID
title Novel
year
authors
Interfaces and implementation. Interfaces prescribe attributes and methods for the
classes implemeting them; methods can however only be declared in the interface but
must be defined in the implementing classes. Interfaces have their name written in
italic (and optionally have the stereotype interface written above their interface
name). The implementing classes are connected to it by a dashed line with a trian-
gular arrow tip. For example, the reader class may implement a Person class with a
name attribute:
Reader
interface readerID
Person email
name borrowBook()
returnBook()
The overall UML class diagram in Figure 1.3 is equivalent to the previous ER diagram
for our library example.
UML is particularly important for the design of object databases (that directly store
objects out of an object-oriented program). But due to the widespread use of UML in
software engineering, it also suggests itself as a general-purpose database design lan-
guage.
1.4 Bibliographic Notes
A wealth of text books is available on the principles of database management systems

and data modeling. Profound text books with a focus on relational database man-
Reader Book
readerID Publisher
1 * bookID * 1
name title name
email year city
borrowBook() authors
returnBook()
BookLending
returnDate
Fig. 1.3. UML diagram
agement systems include the books by Jukic [Juk13], Connolly and Begg [CB09] and
Garcia-Molina, Ullman and Widom [GMUW08].
ER diagrams have a long history for the design of relational databases and the
ER model has been unified by Chen in his influential article [Che76]. With a focus on
the theory of information system design, Olivé [Oli07] provides a row of UML exam-
ples; whereas Halpin and Morgan [HM10] cover conceptual modeling for relational
databases with both ER and UML diagrams. For a profound background on UML refer
to the text books by Booch, Rumbaugh and Jacobson [BRJ05] and Larman [Lar05]. Last
but not least, a general introduction to requirements engineering can be found in the
text book by van Lamsweerde [vL09].
2 Relational Database Management Systems
The relational data model is based on the concept of storing records of data as rows
inside tables. Each row represents an entity of the real world with table columns being
attributes (or properties) of interest of these entities. The relational data model has
been the predominant data model of database systems for several decades. Relational
Database Management Systems (RDBMSs) have been a commercial success since the
1980s. There are powerful systems on the market with lots of functionalities. These
systems also fulfill all the basic requirements for database systems as introduced in
Section 1.1. In the following sections, we briefly review the main concepts and termi-
nology of the relational data model.
2.1 Relational Data Model
The relational data model is based on some theoretical notions which will briefly be
introduced in the following section. Afterwards we present a way to map an ER model
to a database schema.
2.1.1 Database and Relation Schemas
A relational database consists of a set of tables. Each table has a predefined name (the
relation symbol) and a set of predefined column names (the attribute names). Each at-
tribute A i ranges over a predefined domain dom(A i ) such that the values in the column
(of attribute A i ) can only come from this domain. A table is then filled row-wise with
values that represent the state of an entity; that is, the rows are tuples of values that
adhere to the predefined attribute domains as shown in Table 2.1. Each table hence
corresponds to the mathematical notion of a relation in the sense that the set of tuples
in a relation are a subset of the cartesian product of the attribute domains: if r is the
set of tuples in a table, then r ⊆ dom(A1 ) × ... × dom(A n ).
The definition of the attribute names A i for the relation symbol R is called a re-
lation schema; the set of the relation schemas of all relation symbols in the database
is then called a database schema. That is, with the database schema we define which
Table 2.1. A relational table
Relation Symbol R Attribute A1 Attribute A2 Attribute A3

Tuple t1 → value value value
Tuple t2 → value value value
18 | 2 Relational Database Management Systems
tables will be created in the database; and with each relation schema we define which
attributes are stored in each table.
In addition to the mere attribute definitions, each relation schema can have
intrarelational constraints, and the database schema can have interrelational con-
straints. These constraints describe which dependencies between the stored data ex-
ist; intrarelational constraints describe dependencies inside a single table, whereas
interrelational constraints describe dependencies between different tables. Database
constraints can be used to verify whether the data inserted into the table are semanti-
cally correct. Intrarelational constraints can for example be functional dependencies
– and in particular key constraints: the key attributes BookID and ReaderID in our ER
diagram will be keys in the corresponding database tables and hence serve as unique
identifiers for books and readers. Interrelational constraints can for example be in-
clusion dependencies – and in particular foreign keyindex constraints: when using the
ID of a book in another table (for example a table for all the book lendings), we must
make sure that the ID is included in the Book table; in other words, readers can only
borrow books that are already registered in the Book table. Written more formally, we
define a table by assigning to its relation symbol R i the set of its attributes A ij and the
set Σ i of its intrarelational constraints.
Formal specification of a relation schema with intrarelational constraints: R i = ({A i1 . . . A im }, Σ i )
A database schema then consists of a database name D that consists of a set of relation
schemas R i and a set Σ of interrelational constraints.
Formal specification of a database schema with interrelational constraints: D = ({R1 . . . R n }, Σ)
2.1.2 Mapping ER Models to Schemas
With some simple steps, an ER diagram can be translated into a database schema:
Each entity name corresponds to a relation symbol. In our example, the entity Book
is mapped to the relation symbol Book. Entity attributes correspond to relation at-
tributes. In our example, the entity attributes BookID and Title will also be attributes
in the relation Book; hence they will be in the relation schema of relation Book.
However, the relational data model does not allow multi-valued and composite at-
tributes. In the case of multi-valued attributes, a new relation schema for each multi-
valued attribute created containing additional foreign keys (to connect the new rela-
tion schema to the original relation schema). In our example, the multi-valued at-
tribute Author must be translated into a new relation BookAuthors with attributes
BookID and Author and a foreign key constraint BookAuthors.BookID ⊆ Book.BookID.
2.2 Normalization | 19
Composite attributes (like Publisher) should usually be treated as single-valued at-

tributes. We have two options for doing this:
– by combining their subattributes into one value;
– or by only storing the subattributes (like City and Name) and disregarding the
composite attribute (like Publisher) altogether.
Relationships are also translated into a relation schema; for example, we have a
BookLending relation in our database with the attribute ReturnDate. In order to be
able to map the values from the entities connected by the relationship together, the re-
lation also contains the key attributes of the entities participating in the relationship.
That is why the BookLending relation also has a BookID attribute and a ReaderID at-
tribute with foreign key constraints on them. Note that this is the most general case
of mapping an arbitrary relationship; in more simple cases (like a 1:1 relationship) we
might also simply add the primary key of one entity as a foreign key to the other entity.
What we see in the end is that we can indeed easily map the conceptual model (the
ER diagram) for our library example into a relational database schema. The definitions
of the database schema and relation schemas are as follows:
Database schema:
Library = ({Book, BookAuthors, Reader, BookLending},
{BookAuthors.BookID ⊆ Book.BookID,
BookLending.BookID ⊆ Book.BookID,
BookLending.ReaderID ⊆ Reader.ReaderID})
Relation schemas:
Book = ({BookID, Title, Year}, {BookID → Title, Year})
BookAuthors = ({BookID, Author}, {})
Reader = ({ReaderID, Name, Email}, {ReaderID → Name, Email})
BookLending = ({BookID, ReaderID, ReturnDate},
{BookID, ReaderID → ReturnDate})
2.2 Normalization
Some database designs are problematic – for example, if tables contain too many
attributes, or tables combine the “wrong” attributes, or tables store data duplicates
(that is, when we have logical redundancy). Such problematic database design entail
problems when inserting, deleting or updating values: these problems are known as
anomalies. Different types of anomalies exist:
Table 2.2. Unnormalized relational table
Library BookID Title ReaderID Name ReturnDate

1002 Introduction to DBS 205 Peter 25-10-2016
1004 Algorithms 207 Laura 31-10-2016
1006 Operating Systems 205 Peter 27-10-2016
Insertion anomaly: we need all attribute values before inserting a tuple (but
some may still be unknown)
Deletion anomaly: when deleting a tuple, information is lost that we still need
in the database
Update anomaly: When data are stored redundantly, values have to be changed
in more than one tuple (or even in more than one table)
Normalization results in a good distribution of the attributes among the tables and hence normaliza-
tion helps reduce anomalies.
The normalization steps depend on database constraints (in particular, functional de-
pendencies) in the data tables. For example, to obtain the so-called third normal form
(3NF) we have to remove all transitive functional dependencies from the tables. We
don’t go into detail here but discuss normalization only with our library example. As-
sume we would not have the information on books, readers, and book lendings in
separate tables, but all the data would be stored in one big single table together (for
sake of simplicity, we leave out the author information altogether). For two readers
and three books we would have a single table as shown in Table 2.2.
What we see is, that the more books a reader has currently borrowed the more
often his name appears in the table; and if we only want to change the information
belonging to a certain book, we would still have to read the whole row which also con-
tains information on the reader. Due to these considerations, it is commonly agreed
that it is advantageous to store data in different tables and link them with foreign key
constraints (according to the schema developed in Section 2.1). A normalized version
of the Library table (in 3NF) hence looks as shown in Table 2.3.
2.3 Referential Integrity
We have seen above that foreign key constraints are a special case of interrelational
constraints. Referential integrity means that values of the attributes that belong to the
foreign key indeed exist as values of the primary key in the referenced table – if there
is more than one option to choose a key, it suffices that the referenced attributes are a
2.3 Referential Integrity | 21
Table 2.3. Normalized relational table
Book BookID Title

Reader ReaderID Name
1002 Introduction to DBS
205 Peter
1004 Algorithms
207 Laura
1006 Operating Systems
BookLending BookID ReaderID ReturnDate

1002 205 25-10-2016
1006 205 27-10-2016
1004 207 31-10-2016
candidate key. That is, in the referenced table, there must be some tuple to which the
foreign key belongs. In our example we stated the requirement that the BookID and
the ReaderID in the BookLending table indeed exist in the Book and Reader table,
respectively. We can optionally allow that the value of the foreign key is NULL (that is,
NULL in all the attributes that the foreign key is composed of).
Referential integrity must be ensured when inserting or updating tuples in the ref-
erencing table; but also deleting tuples from the referenced table as well as updating
the primary key (or candidate key) in the referenced table affects referential integrity.
We will discuss these cases with our library example:
Insert tuple into referencing table: Whenever we insert a tuple in a table that
has foreign key attributes, we must make sure that the values inserted into the
foreign key attributes are equal to values contained in the referenced primary key
or candidate key.
Update tuple in referencing table: The same applies when values of foreign keys
are updated.
Update referenced key: Whenever the referenced primary key (or candidate key)
is modified, all referencing foreign keys must also be updated. When the referenc-
ing foreign keys are referenced themselves by some other tuple, this referencing
tuple must also be updated; this is called a cascading update.
Delete tuple in referenced table: Deleting a tuple can violate referential integrity
whenever there are other tuples the foreign keys of which reference the primary (or
candidate) key of the deleted tuple. We could then either disallow the deletion of a
referenced tuple or impose a cascading deletion which also deletes all referencing
tuples. Alternatively, foreign keys can be set to a default value (if it is defined) or
to null (if this is allowed).
2.4 Relational Query Languages
After having designed the relational database in a good way, how can data actually
be inserted into the database; and after that, how can information be retrieved from
the database? For data retrieval we might want to specify conditions to select relevant
tuples, combine values from different tables, or restrict tables to a subset of attributes.
The Structured Query Language (SQL) is the standardized language to communicate
with RDBMSs; it is a standardized language for data definition, data manipulation
and data querying. For example, you can create a database schema, create a table,
insert data into a table, delete data from a table, and query data with the well-known
declarative syntax. Some commonly used SQL statements are the following:
– CREATE SCHEMA ...
– CREATE TABLE ...
– INSERT INTO ... VALUES ...
– DELETE FROM ... WHERE ...
– UPDATE ... SET ... WHERE ...
– SELECT ... FROM ... WHERE ...
– SELECT ... FROM ... GROUP BY ...
– SELECT ... FROM ... ORDER BY ...
– SELECT COUNT(*) FROM ...
Other (more mathematical) ways to express queries on relational tables, would be the
logic-based relational calculus, or the operator-based relational algebra. Typical rela-
tional algebra operators and examples for these are:
Projection π (restricting a table to some attributes). For example, IDs of readers cur-
rently having borrowed a book:
πReaderID (BookLending)
Selection σ (with condition on answer tuples). For example, all books to be returned
before 29-10-2016:
σ ReturnDate<29−10−2016 (BookLending)
Renaming ρ (giving a new name for an attribute). For example, rename ReturnDate
to DueDate:
ρDueDate←ReturnDate (BookLending)
Union ∪, Difference –, Intersection ∩ are operators that work only between tables
with identical relation schema.
Natural Join ./ (combining two tables on attributes with same name). For example,
2.4 Relational Query Languages | 23
any information on books currently lent out (the answer table has attributes BookID,
Author,Title, ReaderID, and ReturnDate):
Books ./ BookLending
Advanced Join Operators (like Theta Join, Equi-Join and Semijoin) are available for
more concise notation of queries.
Let us now look at the different ways to express queries in a comparative example
by asking a query over the BookLending table:

1002 205 25-10-2016
1006 205 27-10-2016
1004 207 31-10-2016
A natural language version of our query would be
“Find all books that must be returned before 29-10-2016”.
In relational calculus we would express this query as a logical formula with variables
(that is, placeholder for BookID, ReaderID and ReturnDate) x, y, z:
Q(x, z, y) = {(x, y, z) | BookLending(x, y, z) ∧ z <29-10-2016}
In relational algebra we would use the selection operator σ:
σReturnDate<29-10-2016(BookLending)
And, last but not least in SQL:
SELECT BookID, ReaderID, ReturnDate FROM BookLending

WHERE ReturnDate < 29-10-2016
One more thing to note about relational algebra is that an algebra query can be il-
lustrated by a tree: the inner nodes are the algebra operators and the leaf nodes are
the relation symbols. Such an operator tree nicely shows the order of evaluation. It is
helpful for optimization of queries (for example, smaller intermediate results with less
tuples and less attributes). As an example, consider the query for “names of readers
having borrowed a book that must be returned before 29-10-2016”:
π Name ( σ ReturnDate<29−10−2016 ( Reader ./ BookLending ) )
While the algebra tree in Figure 2.1 on the left-hand side is the one corresponding to
the query, an equivalent version of the query is shown in Figure 2.1 on the right-hand
π Name π Name
σ ReturnDate<29−10−2016 ./
./ Reader σ ReturnDate<29−10−2016
Reader BookLending BookLending
Fig. 2.1. An algebra tree (left) and its optimization (right)
side. The first tree computes a join on the two tables Reader and BookLending in their
entirety which results in a huge intermediate result. The second tree is more optimal in
the sense that from the BookLending table only relevant rows are selected (the ones
with a matching return date) before it participates in the Join operation; hence the
intermediate result has (potentially) less rows.
2.5 Concurrency Management
For an improved performance, several users or processes should be able to work with
the database system at the same time – that is, concurrently. Concurrency manage-
ment aims at making concurrent data accesses possible. To achieve this, relational
Database Management Systems usually have an extensive support for transactions
and provide a component that is in charge of controlling correct execution of con-
current transactions. We briefly review these two important concepts in this section.
2.5.1 Transactions
A transaction can be characterized as a sequence of read and write operations on a

database; this sequence of operations must be treated as a whole and cannot be in-
terrupted. A transaction is supposed to lead from one consistent database state to an-
other consistent database state. Concurrency management uses transactions to sup-
port the following properties of database systems:
Logical data integrity: Are the written values correct and final results of a com-
putation?
Physical data integrity & Recovery: How can correct values be restored after a
system crash?
Multi-user support: How can users concurrently operate on the same database
without interfering?
We briefly illustrate these three properties with the example of a bank transfer. Assume
we transfer an amount x of money from bank account K1 to another bank account K2 .
This transfer can be executed in two basic steps where each step consists of a read and
a write operation:
– Subtract x from K1 : Read(K1 ) and Write(K1 − x)
– Add x to K2 : Read(K2 ) and Write(K2 + x)
To achieve logical data integrity, both steps must be fully executed, otherwise one of
the following two errors occurs:
– amount x is lost (neither on K1 nor on K2 )
– amount x is extra (both on K1 and K2 )
Let us now assume that the bank transfer is defined as a transaction consisting of the
following sequence of operations:
T: Read(K1 ) Write(K1 − x) Read(K2 ) Write(K2 + x)
What happens if the database server crashes after the first write operation and the
second write operation cannot be executed anymore? In this case, the transaction T
is not finalized: it has not been committed. To achieve physical data integrity and as
part of the recovery management, the database system maintains a transaction log.
This transaction log records the state of each transaction: it stores which operation of
which transaction is currently being executed. After a system restart all operations of
uncommitted transactions have to be undone; in our example, the first write operation
has to be voided. The transaction log also has to take care of committed transactions: If
all results of a transaction have been computed (and the transaction has already been
committed) but disk writing is interrupted, after a system restart all affected compu-
tations of the transaction have to be redone and then stored to disk.
Let us now assume that we have multiple users of the database system working on
the same data. Assume one transaction (T1 ) transfers amount x from bank account
K1 to bank account K2 while concurrently another transaction (T2 ) transfers amount
y from bank account K1 to bank account K3 :
T1 : Read1 (K1 ) Write1 (K1 − x) Read1 (K2 ) Write1 (K2 + x)
T2 : Read2 (K1 ) Write2 (K1 − y) Read2 (K3 ) Write2 (K3 + y)
It is desirable to execute concurrent transaction in parallel by interleaving their oper-

ations: the advantage is that a user does not have to wait for the transaction of another
user to finish before the database system starts processing his transaction. However,
concurrency of transactions obviously opens space for several error cases: If opera-
tions are interleaved in a wrong order, the overall results might be incorrect. For ex-
ample, with the following interleaving, the final result is only K1 − x:
Read1 (K1 ) Read2 (K1 ) Write2 (K1 − y) Write1 (K1 − x) ...
As a second error case, with the following interleaving the final result is only K1 − y:
Read1 (K1 ) Read2 (K1 ) Write1 (K1 − x) Write2 (K1 − y) ...
We can observe that T1 and T2 are in conflict on K1 because both try to read from and
write to this account; they are however not in conflict on K2 and K3 . A correct order of
the operations would be the following where the final result is K1 − x − y:
Read1 (K1 ) Write1 (K1 − x) Read2 (K1 ) Write2 (K1 − y) ...
Most Relational Database Management Systems manage transactions according

to the following properties – the so-called ACID properties:
– Atomicity: Either execute all operations of a transaction or none of them; this is
the “all or nothing” principle.
– Consistency: After the transaction, all values in the database are correct; that is
the database has to find a correct order of operations and additionally has to check
database constraints (the data dependencies denoted Σ and Σ i in the database
schema in Section 2.1).
– Isolation: Concurrent transactions of different users do not interfere; again, the
database system has to find correct order of operations to ensure this.
– Durability: All transaction results persist in the database even after a system
crash; in this case the database system uses the transaction log for recovery.
ACID properties refer to Atomicity, Consistency, Isolation, Durability.
Database systems adhering to the ACID properties are often called ACID-compliant.
2.5.2 Concurrency Control
There are two basic variants of concurrency control: optimistic and pessimistic con-
currency control. Optimistic concurrency control assumes that conflicting modifica-
tions happen only rarely and can be resolved after they have occurred. An example of
a optimistic concurrency control mechanism is snapshot-based multiversion concur-
rency control. A snapshot is a copy of the current state of the database. Each access
(or a sequence of accesses in a transaction) retrieves one such snapshot at the time
of access. Hence, different accesses (or different transactions), might work on differ-
ent copies of the data. For read-only accesses (that is, queries) this is no problem. They
can simply operate on their own snapshot. However as soon as some data are updated,
concurrent transactions must be validated after they have finished. Some transactions
must then be undone (“rolled back”) and restarted if a conflict happened: for exam-
ple, if two transactions write the same data item, or if one transaction reads a stale out-
dated copy a data item that has been updated by another transaction in the meantime.
In contrast, pessimistic concurrency control avoids any conflicting modifications by
maintaining metadata (like locks or timestamps). Problems that can occur with pes-
simistic concurrency control are deadlocks (between two transactions) or starvation
(of a transaction that is deferred until other transactions finalize).
A correct ordering of operations in concurrent transactions by interleaving the op-
erations is also called a schedule. Note that although operations from different trans-
actions can be interleaved, inside each transaction the order of operations is fixed and
operations cannot be swapped. An important notion for a good schedule is serializ-
ability: An interleaved schedule is serializable if and only if it is equivalent to a serial
schedule without interleaving. Equivalence is here defined by the fact that the inter-
leaved schedule reads the same values and writes the same final results as the serial
schedule. The database component that is in charge of finding a good schedule is the
scheduler (that has already been briefly introduced in Section 1.2). The input of the
scheduler is a set of transactions to be executed concurrently; the operations of the
transaction however might not be fully known beforehand but come in dynamically
at runtime. The output of the scheduler is a serializable schedule of the transactions.
The two basic options that a scheduler has to find such a good schedule are:
– defer some operations until serializability can be achieved, or
– abort some transactions if serializability cannot be achieved.
Two common pessimistic scheduler algorithms are Two-Phase-Locking (2PL) and

Timestamp Scheduling. We briefly review them here:
Two-Phase Locking (2PL): Lock and unlock operations on data items are added
to the schedule. There are two types of locks: the read locks and the read-write
locks. The read locks are non-exclusive in the sense that multiple transactions
can read the value of the data item concurrently. The read-write locks are exclu-
sive in the sense that only one transaction has the right to modify the value of the
locked data item; then no other transaction has the right to even read this item:
other transactions have to wait for the unlock operation on the data item. In the
two-phase locking approach inside a transaction there is a locking phase and an
unlocking phase. That is, all lock operations have to be executed before any un-
lock operation inside the transaction.
Timestamp Scheduler: With timestamp scheduling, each transaction has a
timestamp with its starting time S(T i ). Each data item A has two timestamps: The
write stamp W(A) is the timestamp of the last transaction with write operation on
A; the read stamp R(A) is the timestamp of the last transaction with read opera-
tion on A. Upon each read and write operation, the timestamp scheduler checks
these timestamps to decide whether the operation can be executed. To be more
specific, upon a read operation on item A in transaction T i , the write timestamp
has to be prior to the starting time of the transaction (that is, W(A) ≤ S(T i )); upon
a write operation on item A, the read timestamp has to be prior to the starting time
(that is, R(A) ≤ S(T i )); otherwise the whole transaction has to be aborted because
A has been accessed by another transaction started after T i in the meantime. T i
then has to be restarted at a later time. For a write operation, there is also the case
of dropping a write: if W(A) > S(T i ) a more recent transaction has already written
a value for A which must not be overwritten by T i .
The relational data model has its root in the seminal article by Codd [Cod70]. Full
coverage of all aspects of the relational data model and relational database manage-
ment systems is given in the text books by Jukic [Juk13], Connolly and Begg [CB09] and
Garcia-Molina, Ullman and Widom [GMUW08]. For a more theoretical background on
the relational model see the book by Date [Dat07]. Weikum and Vossen [WV01] give
particular focus to transactions and concurrency control.
The world’s leading vendors of commercial RDBMSs are Oracle (who now also
own MySQL), IBM (with the DB2 database), Microsoft (with their Access and SQLServer
products), Teradata, and Sybase (which is an SAP owned company). Several open-
source RDBMSs are also available and widely adopted: for example, PostgreSQL and
MariaDB. They are backed by highly experienced developers and an active community
support. MySQL (now being owned by Oracle) is available as an open-source “Com-
munity Edition”.
Web resources:
– ISO/IEC Standards catalogue (Information Technology – Data management and interchange):
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_tc_browse.htm?commid=45342
– MariaDB: https://mariadb.org
– documentation page: https://mariadb.com/kb/en/mariadb/
– MySQL: http://www.mysql.com
– documentation page: http://dev.mysql.com/doc/
– PostgreSQL: http://www.postgresql.org
– documentation page: http://www.postgresql.org/docs/manuals/
The Structured Query Language (SQL) is the standardized language to communicate

with RDBMSs [Int11]. However, neither do all RDBMSs fully adhere to the SQL standard
and nor do they implement all features of the standard – which leads to complications
of portability of SQL code between the different RDBMS products. Several graphical
front-ends for designing and querying RDBMSs and management consoles are offered
by RDBMS vendors to facilitate the interaction with these database systems.
|
Part II: NOSQL And Non-Relational Databases
3 New Requirements, “Not only SQL” and the Cloud
Relational Database Management Systems (RDBMSs) have their strengths in the fact
that they are established and mature systems with a declarative standard language
(SQL) that offers complex operations (like joins). A further plus is that RDBMSs can
actively check database constraints and the relation schemas ensure that inserted val-
ues adhere to the predefined attribute domains. RDBMSs also have an extensive trans-
action support and an established authorization concept for multiple users; further-
more, they have active components (like triggers) that allow an RDBMS to automati-
cally react to data modifications. Nevertheless, the relational data model and RDBMSs
might not be ideal candidates for data storage in some applications. In this chapter,
we discuss weaknesses of the relational data model and conventional RDBMSs, and
give an overview of the new requirements and issues that modern data management
faces.
3.1 Weaknesses of the Relational Data Model
Undoubtedly, the relational model has its merits. It has a sound logical foundation and
allows to express database constraints. For some data however, the underlying table
structure is not a natural fit. The following weaknesses can become an issue when
using the relational data model.
3.1.1 Inadequate Representation of Data
Translating arbitrary data into the relational table format is not an easy task. Data
usually come in more complex formats like objects in object-oriented programs, XML
documents, or even unstructured text documents. “Squeezing” data into rows and
columns requires some careful thought and engineering effort and still might lead to
unnecessary storage overhead. We will see examples of this phenomenon when talk-
ing about ways to store objects or XML documents in an RDBMS (see Section 5.3). Ef-
ficiency of data retrieval might also suffer from the relational table representation as
data might have to be recombined from several tables. Another aspect is that due to
normalization (that avoids logical redundancy and hence anomalies of data manip-
ulations; see Section 2.2), data belonging to a single entity might end up in several
different tables. Retrieving data for an entity might hence imply the computation of
several table joins based on a set of foreign keys.
34 | 3 New Requirements, “Not only SQL” and the Cloud
BookLending
Reader Book
Fig. 3.1. Example for semantic overloading
3.1.2 Semantic Overloading
Semantic overloading of the relational data model is another issue: as we have seen
in Section 2.1, entities as well as relationships between entities are both mapped to
relations in a database schema. In our library example the ER diagram contained the
entities Book and Reader as well as the relationship BookLending (see Figure 3.1).
In our final database schema, both the entities and the relationships were con-
tained as relation schemas. That is, we have no means to express the fact that rela-
tionships between entities are of a different nature than the entities themselves. As
we will see later, in graph databases, we have the concepts of nodes and edges; thus
it is possible to differentiate between entities themselves and the links between them.
3.1.3 Weak Support for Recursion
In the relational data model it is difficult to execute recursive queries that need to join
(and then to execute a union) over the same table several times. The purpose of a
recursive query is to compute the transitive closure of some table attributes.
The transitive closure of a relation R with attributes (A1 ,A2 ) defined on the same domain is the relation
R augmented with all tuples successfully deduced by transitivity; that is if (a,b) and (b,c) are tuples
of R, tuple (a,c) also is.
Although the relational algebra can be extended by unary transitive closure opera-
tion to express recursion in a query, it is still costly to compute in a real RDBMS. The
following example should illustrate this difficulty.
Assume we have a relation schema Person with attributes ID, Name, and Child.
The attributes have the domains: dom(ID)=dom(Child): Integer, dom(Name): String.
A recursive query on this schema would be to get the IDs of all the descendants of a
person (not just the children). Our example table is filled as shown in Table 3.1. Exe-
cuting our recursive query for Alice (that is, getting the IDs of all descendants of Alice)
requires extensive use of the union and the join operator. What we expect as the result
3.1 Weaknesses of the Relational Data Model | 35
Table 3.1. Base table for recursive query
Person ID Name Child

1 Alice 3
2 Bob 3
3 Charlene 5
4 David 5
5 Emily NULL
Table 3.2. Result table for recursive query
Descendant ancestorID descendantID

1 3
3 5
1 5
5 NULL
1 NULL
is the result shown in Table 3.2 (where the NULL entries are the termination condition;
we stop the recursion whenever there are no more children).
Expressing this query in SQL is quite complex. We have to recursively define the
Descendant table as follows: the Descendant table is repeatedly joined with the Person
table whenever a descendantID in the Descendant table occurs as a person’s ID in the
Person table; the intermediate results are “unioned” to the descendant table.
WITH RECURSIVE Descendant(ancestorID, descendantID) AS

(
SELECT ID, Child FROM Person WHERE Name = ’Alice’
UNION ALL
SELECT p.ID, p.Child FROM Descendant d, Person p
WHERE p.ID = d.descendantID
)
SELECT ancestorID, descendantID FROM Descendant
In XML or graph databases, recursion comes much more naturally and it can be im-
plemented with better performance.
3.1.4 Homogeneity
A further problem of the relational data model is that a table is a homogeneous data
structure. More precisely, the relational data model requires both horizontal and verti-
cal homogeneity. Horizontal homogeneity is the fact that all tuples have to range over
the same set of attributes; that is, all rows have a fixed uniform format defined by the
columns. Vertical homogeneity is the fact that values in one column have to come from
the same predefined attribute domain; mixing values from different domains in one
column is not allowed. What is more, only atomic values are allowed in table cells; set-
based or even more complex values are not supported in the conventional relational
data model.
3.2 Weaknesses of RDBMSs
Apart from the the conceptual problems with the relational table format, the estab-
lished RDBMSs have some weaknesses with respect to features desired by advanced
applications:
Infrequent updates: It is usually conjectured that RDBMSs are designed for fre-
quent queries but very infrequent updates; hence in case of a frequently changing
data set, relational databases might be more inefficient than desirable.
SQL dialects: Although the standardization of SQL has its advantages, not all
RDBMSs fully support the standard and some deliberately use their own syntax;
all of this complicates portability of SQL code between different RDBMSs.
Restricted data types: RDBMS can be considered quite inflexible regarding the
support of modern data types or formats. For example, although the XML data
type exists in SQL, other data formats (like for example JSON documents) can usu-
ally not be handled natively by an RDBMS.
Declarative access: Another argument against SQL is that queries are usually
declaratively expressed based on the (expected) content in the database tables;
that is, with SQL we retrieve data by specifying a set of desired attributes or com-
paring attribute values with another value. However, other data formats might re-
quire a non-declarative access. For data formats like (tree-like) XML documents or
graphs, a navigational access is usually better suited; that is, we navigate in the
data structure by – for instance – going from parents to children and checking
some conditions there. RDBMSs usually support one of the XML query languages
(like XPath or XQuery) to query XML documents, but operations for modification
of XML documents (like the XQuery Update Facility) are usually only poorly sup-
ported. Native XML databases with a focus on XML data management or docu-
ment databases for JSON documents might do a better job here.
Short-lived transactions: Transactions are well-supported by RDBMSs. The typ-
ical RDBMS transaction is however very short-lived. The implemented transac-
tion management mechanisms are usually not suited for long-term transactions.
However, the support for long-term transactions is in particular important for data
stream processing where queries are periodically executed on continuous streams
of data; for example, to obtain continuous analytics from sensors to automatically
3.3 New Data Management Challenges | 37
monitor some processes. Stream Data Management System specialize in this field
of analytical services.
Lower throughput: When handling massive amounts of data, achieving a suffi-
ciently high data throughput might not be possible as good as one would require
with an RDBMS. Along the same line, it has also been conjectured that RDBMSs
are poor at distributed management of data in a network of database servers.
Rigid schema: A further hindrance of RDBMSs is that, due to the fixed database
schema, schema evolution is poorly supported: Changes in the relation schemas
(like adding a new column to a table) are difficult and costly as the require a reor-
ganization of the data stored in the database system.
Non-versioned data: Versioning of data (keeping multiple versions of a record
with different timestamps) is another feature that is usually disregarded by con-
ventional RDBMSs.
3.3 New Data Management Challenges
Over the last decades, data processing has experienced some major shifts. Some of the
new challenges for database management are the following:
Complexity: Data are organized in complex structures – for example, a quite
novel complex structure is a social network: a network of people interconnected
with each other in arbitrary ways in a graph structure. Similarly, the Semantic
Web connects and describes resources in a graph structure. In other application
domains, like geographic information systems (GIS) or computer aided design
(CAD), data are represented as complex structures with lots of interrelated sub-
structures.
Schema independence: Schema independence means that documents can be
processed without a given schema definition. In other words, data can be struc-
tured in an arbitrary way without complying with any prescribed format. More-
over, data of the same type can be represented differently even in the same data
set; for example, address data could be either stored as one single string of char-
acters or could be split into different strings for street, house number, city and
zip code. This flexibility allows that data from different sources can be combined
in the same data set even if they are not formatted in the same way. Schema-
independent databases are also called schemaless.
Sparseness: If there is an (optional) schema for a data set, it may happen that a
lot of data items are not available: many values might just be unknown or non-
existent. If these missing data were represented as null values, they would un-
necessarily occupy storage space. It is hence better to simply ignore such values
in the data model so that they do not show up in the data set.
Self-descriptiveness: As a consequence on schema independence and sparse-
ness, metadata are attached to individual values in order to enable data process-
ing; these metadata describe the use and semantics of the values (like the name
part of a property or key-value pair or element and attribute names in XML). In
this way, data can be interpreted and processed directly and there is no need to
acquire metadata information (like schema data) from other sources. This is in
particular important when data come from unknown or unreachable sources.
Variability: Data are constantly changing: the database system has to handle
frequent data modifications in the form of insertions, updates and deletions. In
addition, the structure of the processed data might be frequently altered; the
database system hence has to either support schema evolution and adapt to the
changed structure, or the database system must be able to handle schemaless
data.
Scalability: Data are distributed on a huge number of interconnected servers:
for example, in online shopping systems, or when using cloud storage, data are
stored in huge server farms. The database user need not know on which server ex-
actly the data is stored that he wants to retrieve; hence he must be able to interact
with the database system without being aware of the data distribution. Moreover,
the database system has to support flexible horizontal scaling: servers can leave
the network and new servers can enter the network on demand. The database
system has to dynamically adapt the distribution of data to the altered network
structure.
Volume: Large data volumes (“big data”) have to be processed: Database sys-
tems must provide high read and write throughput in order to provide the desired
availability and responsiveness of the system or even allow for real-time analytics;
data management has to be massively parallel in order to achieve this.
Non-relational databases have been developed as a reaction to these challenges and

new requirements. However, the employed data models and their underlying technol-
ogy already have quite a history of research and development: non-relational database
systems have been around for decades and database research has developed differ-
ent theories and systems in the non-relational area ever since. However, only recently
these systems have seen an upswing at the face of the changed data management re-
quirements. In addition, new database products have emerged (sometimes driven by
large companies) as flexible solutions for these requirements. Historically the term
“NoSQL” applied to database systems that offered query languages and access meth-
ods other than the standard SQL. More recently, “NOSQL” has come up to mean “Not
only SQL”; NOSQL is basically an umbrella term that covers database systems that
– have data models other than the conventional relational tables,
– support programmatic access to the database system or query languages other
than SQL (but might support SQL as well),
– can cope with schema evolution or can handle schemaless data,
– support data distribution in a server network by design,
– do not strictly adhere to the ACID properties (in particular in terms of consistency)
of conventional RDBMSs.
A NOSQL database system can have a non-relational data model, support non-standard query lan-
guages, support programmatic access, support schema evolution or schema independence, support
data distribution, or have a weak consistency concept.
At the bottom line, for some advanced applications, a specialized database system
with a focus on the requirements of these applications might perform better than a
conventional RDBMS. NOSQL symbolically stands for the revival, adoption and im-
provement of data models, query languages and network protocols for novel database
applications.
Web resources:
– Roberto V. Zicari’s Portal for Big Data, New Data Management Technologies and Data Science:
http://www.odbms.org/
– ODBMS Industry Watch: http://www.odbms.org/blog/
– Stefan Edlich’s Guide to the Non-Relational Universe: http://nosql-databases.org/
– Rahul Chaudhary’s NoSQL Weekly newsletter: http://www.nosqlweekly.com/
Cloud databases can in particular cater for new requirements in terms of scalability
and volume. Databases-as-a-service are offered by providers of cloud computing in-
frastructure as a remote storage platform. Pricing models are usually based on both
time of usage and volume of the data. A service level agreement (SLA) then regulates
minimum availability or security conditions of the hosted service. Cloud providers
make use of multitenancy to enable elasticity of their services: several groups of client
access the same database platform; they might potentially be separated by having the
database run in different virtual machines. While cloud database claim to reduce ad-
ministration overhead, using a database-as-a-service still requires management skills
(for example for modeling the data and setting up a backup. Security and confiden-
tiality as well as legislation (for example in terms of data protection and privacy) are
major issues in particular if data are hosted in countries with a different legislation
than the country of their origin.
Weaknesses of conventional RDBMSs have been discussed for some time; see for ex-
ample the book by Connolly and Begg [CB09] for some issues that have been reiter-
ated in this chapter. Undoubtedly, the data management landscape has changed in
the last decades towards large scale and distributed storage (including cloud comput-
ing). On this background, Agrawal, Das and Abbadi [ADA12] describe challenges that
arise with data management in cloud applications. With a special focus on database
applications, the book by Fowler and Sadalage [FS12] describes the NOSQL paradigm
and its basic principles and it gives hands-on experience by surveying some NOSQL
databases. The book by Redmond and Wilson [RW12] provides a great starting point to
get an overview of currently available NOSQL database systems of all kinds: starting
from PostgreSQL as an RDBMS, it moves on to a technical description of six NOSQL
databases.
4 Graph Databases
A graph is a structure that not only can represent data but also connections between
them; in particular, links between data items are explicitly represented in graphs. In
this chapter, we first of all establish the necessary theoretic background on graphs
and then shift to data structures that can be used to store graphs. Advanced graph
structures allow for a representation of data of an even higher complexity.
4.1 Graphs and Graph Structures
Graphs structure data into a set of data objects (which may have certain properties
and are equivalent to entities in the relational terminology) and a set of links between
these objects (which characterize the relationship of the objects). The data objects are
the nodes (also called vertices) of the graph and the links are the edges (also called
arcs). Recall that one criticism towards the relational data model was its semantic
overloading were entities and relationships are both represented as relational tables
(see Section 3.1). We now see that for the graph data model the distinction between
data objects (entities) and their relationships comes very naturally. Common applica-
tions for graph databases are hence domains where the links between data objects are
important and need to have their own semantics.
Graphs can store information in the nodes as well as on the edges.
For example, in a social network the nodes of a graph can store information on peo-
ple in the social network and edges can store their acquaintance or express sympathy
or antipathy (see Figure 4.1). Another typical application are geographic information
systems where nodes store information on geographical locations like cities and edges
store for example the distances between the locations (see Figure 4.2).
Name: Bob
Age: 27
knows knows
Name: Alice Name: Clare

Age: 34 dislikes Age: 29
Fig. 4.1. A social network as a graph

42 | 4 Graph Databases
City: Hannover City: Braunschweig

Population: 522K 65km Population: 248K
35km 45km
City: Hildesheim
Population: 102K
Fig. 4.2. Geographical data as a graph
4.1.1 A Glimpse on Graph Theory
From a mathematical point of view, a graph G consists of a set of vertices (denoted

V) and a set of edges (denoted E); that is, G = (V , E). The edge set E is a set of pairs
of nodes; that is, each edge is represented by those two vertices that are connected
by the edge. An edge can be directed (with an arrow tip) or undirected. In case of a
directed edge the pair of nodes is ordered where the first node is the source node of the
edge and the last node is the target node; in the undirected case, order of the pair of
nodes does not matter because the edge can be traversed in both directions. A graph
is called undirected if it only has undirected edges, whereas a graph is called directed
(or a digraph) if it only has directed edges. Moreover, a graph is called multigraph if
it has a pair of nodes that is connected by more than one edge. Let us have a look at
some examples.
Simple undirected graph: For a simple undirected graph (without multiedges and
directed edges), each edge is represented by a set of vertices. That is, the edge set E
is a set of two-element subsets of V (like {v1 , v2 }). Note that in the set notation order
is irrelevant: that is, {v1 , v2 } and {v2 , v1 } are the same edge. For the cardinality of
the vertex set |V | = n there are at most 2n = 2!·(n−2)! n!

edges (without self-loops like
{v1 , v1 }). Such a graph is called complete if E is the set of all 2n two-element subsets

of V. An example of a complete simple undirected graph with three vertices is the

following:
v2
e1
e2
v1 v3
e3
– the vertex set is V = {v1 , v2 , v3 }

– the edge set is E = {e1 , e2 , e3 } = {{v1 , v2 }, {v2 , v3 }, {v1 , v3 }}
3 3!

– the cardinality of the vertex set is |V | = 3, and hence we have 2 = 2!·(3−2)!
=
3·2·1
2·1·1 = 3 edges in a complete graph
4.1 Graphs and Graph Structures | 43
Simple directed graph: For a simple directed graph, each edge is represented by an
ordered tuple of vertices like (v1 , v2 ), where v1 is the source node and v2 is the target
node. In this case, the order of the tuple denotes the direction of the edge: that is,
(v1 , v2 ) and (v2 , v1 ) are distinct edges. More generally, the edge set E is a subset of
the cartesian product V × V. For the cardinality of the vertex set |V | = n there are at
n!
most (n−2)! edges (without self-loops like (v1 , v1 )). The graph is called complete if E is
n!
the set of all (n−2)! tuples in V × V. The graph is called oriented if it has no symmetric
pair of directed edges (no 2-cycles) – that is, there are no two edges where the source
node of the first edge is the target node of the second edge and the other way round.
An example of an oriented simple directed graph is the following:
v2
e1
e2
v1 v3
e3

– the edge set is E = {e1 , e2 , e3 } = {(v1 , v2 ), (v2 , v3 ), (v1 , v3 )}
3!
– this graph is not complete because (3−2)! = 6 but there are only three edges
– this graph is oriented because there are no backward edges
Undirected Multigraph: For an undirected multigraph, the edge set E is a multiset of

two-element subsets of V. That is, duplicate elements are allowed and hence multiple
edges between two vertices are possible; this set of vertices is then called multiedge.
Sometimes, cardinalities are written on a multiedge instead of drawing the edge mul-
tiple times. An example of an undirected multigraph is the following:
v2
e1
e2
v1 v3
e3
e4

– the edge set is E = {e1 , e2 , e3 , e4 } = {{v1 , v2 }, {v2 , v3 }, {v1 , v3 }, {v1 , v3 }} where
e3 and e4 are duplicates
Directed Multigraph: For a directed multigraph, the edge set E is a multiset of tuples
of the cartesian product V × V. Hence, multiple edges (with the same direction) be-
tween two vertices are possible. An example of a directed multigraph is the following:
v2
e1
e2
v1 v3
e3
e4

– the edge set is E = {e1 , e2 , e3 , e4 } = {(v1 , v2 ), (v2 , v3 ), (v1 , v3 ), (v1 , v3 )} where e3
and e4 are duplicates.
Weighted graphs: With weighted graphs, each edge has a cost called weight and
written as w i . This cost can for example denote the distances between cities as seen
in Figure 4.2. An example of a weighted directed multigraph is the following:
e1 : w1 v2
e2 : w2
v1 v3
e3 : w3
e4 : w4

– the set of edges with weights is E = {e1 : w1 , e2 : w2 , e3 : w3 , e4 : w4 } =
{(v1 , v2 ) : w1 , (v2 , v3 ) : w2 , (v1 , v3 ) : w3 , (v1 , v3 ) : w4 }
As we will see later, the prevailing data structure used in graph databases is a directed
multigraph (where information is stored in the nodes as well as on the edges).
4.1.2 Graph Traversal and Graph Problems
A connection between to nodes consisting of intermediary nodes and the edges be-
tween them is called a path; a path where starting node and end node are the same
is called a cycle. The basic form of finding certain nodes in a graph is navigation in a
graph by following some path; in graph terminology, this is called the traversal of a
graph. More precisely, a traversal usual starts from a starting node (or a set of starting
nodes) and then proceeds along some of the edges towards adjacent (that is, neigh-
boring) nodes. A traversal may be full (that is, visiting each and every node in a graph)
or it may be partial. For a partial traversal, navigation may be restricted by a certain
depth of paths to be followed, or by only accessing nodes with certain properties.
The two simplest kinds of a full traversal are depth-first search and breadth-first
search, which both depend on a certain order of the adjacent nodes: From the starting
node, depth-first search follows the edge to the first adjacent node (in the given order)
and then proceeds towards the first adjacent node of the same; breadth-first search
visits all adjacent nodes of the starting node (in the given order) and then does the
same with the adjacent nodes of its first adjacent node, then of its second adjacent
node and so on. When traversing a graph, restrictions may be applied that have come
to known as graph problems. For example for full traversals, graph problems are:
Eulerian Path: a path that visits each edge ex-

actly once; that is, starting node and end node e2 e3
e4
need not be identical, but each edge has to be
e5
traversed. In order to cover all edges, it might be e1 e6
necessary to visit some nodes more than once. e7
start e8 end
Eulerian Cycle: a cycle that visits each edge exactly once; that is, starting node
and end node have to be the same and each edge has to be traversed.
Hamiltonian Path: a path that visits each ver-

tex exactly once. In this case it might happen
that not all edges have to be traversed.
start end
Hamiltonian Cycle: a cycle that visits each vertex exactly once.
Spanning Tree: a subset of the edge set V that

forms a tree (starting from a root node) and vis-
its each node of the tree.
root
Problems for partial traversals also exist; a common example is finding the shortest
path between two nodes. For weighted graphs, several variants of graph problems ex-
ist that aim at optimizing the weights; for example, finding a path between two nodes
with minimal cost (where the cost is the sum of all weights on the path) or finding a
spanning tree with minimal cost.
4.2 Graph Data Structures
When computing with graphs, the information on nodes and their connecting edges
has to be represented and stored in an appropriate data structure. Two important
terms in this field are adjacency and incidence. In directed graphs, one must fur-
ther differentiate incidence with respect to incoming and outgoing edges: one node
is the source node (for which the edge is an outgoing edge) while the other one is
the target node of the edge (for which the edge is an incoming edge). Other terms for
source node and target node are tail node and head node, respectively.
Two nodes are adjacent if they are neighbors (that is, there is an edge between them). An edge is
incident to a node, if it is connected to the node; if the edge is directed, it is positively incident to its
source node and negatively incident to its target node. A node is incident to an edge, if it is connected
to the edge.
In general, for the representation of edges, we have the following choices where each
representation has its own advantages and disadvantages.
4.2.1 Edge List
The graph can be stored according to the mathematical definition as a set of nodes V
and a set (or list) of edges E. Depending on the type of graph the edge set will be im-
plemented as a set of sets (for undirected graphs), a set of tuples (for directed graphs),
as well as a normal set (for simple graphs) or a multiset (for multigraphs). As soon as
a node is created it can simply be added to the node set V; when an edge is created, it
is simply added to the edge set E. The same applies to deletions by simply deleting the
removed node or edge from the respective set. The edge set performs well when one
wants to retrieve all edges of the graph at once and it incurs no storage overhead (that
is, it stores only existence of edges but not the absence of edges). However, the edge
list representation is inefficient in most use cases: for example, looking for one par-
ticular edge or getting all neighbors of a given node requires iterating over the entire
edge list, which quickly becomes infeasible for larger edge sets.
4.2.2 Adjacency Matrix
For cardinality |V | = n the adjacency matrix is an n×n matrix where rows and columns
denote all the vertices v1 . . . v n . For edgeless graphs the matrix contains only 0s; when
edges exist in the graph some matrix cells are filled with 1s as follows.
For a simple undirected graph, if an edge {v i , v j } between v i and v j exists, write 1
into the matrix cell where v i is the row and v j is the column and another 1 where v i is
the column and v j is the row; that is the matrix is symmetric. For the example graph,
the adjacency matrix is:
v1 v2 v3 v2
v1 0 1 1 e1
e2
v2 1 0 1
v1 v3
v3 1 1 0 e3
Note that in practice a symmetric matrix would be stored as a triangular matrix by

skipping all entries above the diagonal. For an undirected graph with loops, for a loop
{v i , v i } write 2 into the appropriate matrix cell on the diagonal (because all edges are
counted symmetrically).
For a simple directed graph, if an edge (v i , v j ) from v i to v j exists, write 1 into the
matrix cell where v i is the column and v j is the row; that is, the matrix is asymmetric.
This is also the case for the example graph:
v1 v2 v3 v2
v1 0 0 0 e1
e2
v2 1 0 0
v1 v3
v3 1 1 0 e3
For a directed graph with loops, for a loop (v i , v i ) write 1 into the appropriate matrix
cell on the diagonal (because all edges are counted asymmetrically).
For an undirected multigraph, if k edges {v i , v j } between v i and v j exist, write k
into the matrix cell where v i is the row and v j is the column and another k where v i is
the column and v j is the row; again the matrix is symmetric because edges are undi-
rected.
v2
e1
e2
v1 v2 v3
v1 v3
v1 0 1 2 e3
v2 1 0 1
v3 2 1 0 e4
For an undirected multigraph with loops, for k loops {v i , v i } write 2 · k into the appro-
priate matrix cell on the diagonal (because all edges are counted symmetrically).
For a directed multigraph, for k edges (v i , v j ) from v i to v j , write k into the matrix
where v i is the column and v j is the row (again we have an asymmetric matrix).
v2
e1
e2
v1 v2 v3
v1 v3
v1 0 0 0 e3
v2 1 0 0
v3 2 1 0 e4
For a directed multigraph with loops, for k loops (v i , v i ) write k into the appropriate
matrix cell on the diagonal (because all edges are counted asymmetrically).
The advantages of the adjacency matrix lie in a quick lookup of the existence of
a single edge (given its source node and its target node) by simply looking up the bit
value in the corresponding matrix cell as well as a quick insertion of a new edge (be-
tween two existing nodes) by just incrementing the bit in the matrix cell. The disad-
vantages are that adding a new node requires insertion of a new row and a new col-
umn and finding all neighbors results in a scan of the entire column. The main issue
with this matrix representation is that it has a high storage overhead: due to its size
|V | × |V | it soon gets very spacious for a larger number of nodes. This is due to the fact
that it stores lots of unnecessary information – at least when there are lots of 0s (that
is, the matrix and hence the graph are sparse). Thus, the matrix representation is usu-
ally only applicable for dense graphs; that is, for graphs where most of the edges are
present.
4.2.3 Incidence Matrix
For cardinality |V | = n and |E| = m the incidence matrix is an n × m matrix where rows
denote vertices v1 . . . v n ; and columns denote edges e1 . . . e m . For edgeless graphs the
matrix contains no columns; as soon as edges are present, columns are created and
filled as follows.
For a simple undirected graph if edge e i is connected (that is, incident) to v j , write
1 for column e i and row v j . For example:
e1 e2 e3 v2
v1 1 0 1 e1
e2
v2 1 1 0
v1 v3
v3 0 1 1 e3
For an undirected graph with loops, for a loop e i = {v j , v j } write 2 for column e i and
row v j .
For a simple directed graph, the source node has to be distinguished from the tar-
get node of an edge. We use -1 for the source node (v i ) and 1 for the target node (v j );
that is, for an edge e k = (v i , v j ), write -1 into the matrix for column e k and row v i , 1 for
column e k and row v j . For example:
e1 e2 e3 v2
v1 -1 0 -1 e1
e2
v2 1 -1 0
v1 v3
v3 0 1 1 e3
For a directed graph with loops, for a loop e k = (v i , v i ) write 2 for column e k and row
vi .
For an undirected multigraph the same procedure applies as for an undirected
simple graph because each edge (even if being part of a multiedge) has its own edge
identifier. For example:
v2
e1
e2
e1 e2 e3 e4
v1 v3
v1 1 0 1 1 e3
v2 1 1 0 0
v3 0 1 1 1 e4
For a directed multigraph the same procedure applies as for directed simple graph
because each edge (even if being part of a multiedge) has its own edge identifier. For
example:
v2
e1
e2
e1 e2 e3 e4
v1 v3
v1 -1 0 -1 -1 e3
v2 1 -1 0 0
v3 0 1 1 1 e4
One advantage of the incidence matrix is that only existing edges are stored; that is,
there is no column with only 0 entries. The disadvantages of the incidence matrix are
similar to the adjacency matrix. Insertions of new vertices and edges are costly be-
cause they require addition of a row or a column, respectively. Determining all neigh-
bors for one vertex requires scanning the entire row and for each non-zero entry (1 in
the undirected and -1 in the directed case) looking up where the other end point (that
is, the other 1 entry) is located in the same column.
Note that checking the existence of an edge (for a given source node and target
node) is more involved for the incidence matrix than for the adjacency matrix: we have
to check whether there is a column with appropriate non-zero entries for the source
node’s row and the target node’s row. In particular, the n×m matrix is storage intensive
for a larger number of edges and vertices. And what makes incidence matrices even
worse is the fact that if there are many vertices, there are lots of 0s in the columns
because usually there will be only two non-zero entries in each column – as a side
note it might however be mentioned that hyperedges (see Section 4.5) can be stored
by having more than two non-zero entries in the edge’s column.
4.2.4 Adjacency List
With an adjacency list, one stores the vertex set V and for each vertex one stores a
linked list of neighboring (that is, adjacent) vertices.
For a simple undirected graph, each edge is stored in the adjacency list of both its
vertices; that is, one node on an edge is contained in the adjacency list of the other
node. For example:
v1 v2 v3 v2
e1
e2
v2 v1 v3
v1 v3
v3 v1 v2 e3
For a simple directed graph, the adjacency list stores only outgoing edges; that is, the
adjacency list of one node is only filled when this node is a source node of some edge
and it then contains only the target nodes of such edges. For example:
v1 v2 v3 v2
e1
e2
v2 v3
v1 v3
v3 e3
For multigraphs, in both the directed as well as the undirected case, nodes can occur
multiple times in an adjacency list – depending on the amount of multiedges between
the two nodes. For example:
v2
e1
e2
v1 v2 v3 v3 v1 v3
e3
v2 v1 v3
v3 v1 v1 v2 e4
v2
e1
e2
v1 v2 v3 v3 v1 v3
e3
v2 v3
v3 e4
With the adjacency list, we have the advantage of a flexible data structure, where new
vertices can be quickly inserted (by just adding it to V) and, similarly, edges can be
quickly inserted (by appending a node to the appropriate adjacency list). Furthermore,
we have a quick lookup of all neighboring vertices of one vertex (by just returning its
adjacency list) and no storage overhead occurs (because only relevant information is
stored). However these advantages come at the cost of some disadvantages – in partic-
ular, with respect to the runtime behavior. For example, checking existence of a single
edge (for a given source node and target node) requires a full scan of the adjacency list
of the source node.
4.2.5 Incidence List
With the incidence list, each edge has its representation as an individual data object
(instead of being only implicit as is the case with edges in the adjacency list); this al-
lows for storing additional information on the edge in the edge object (in our example,
the additional information is the name of the edge). More precisely, with an incidence
list, you store the vertex set V and for each vertex you store a linked list of incident
edges. When the edge is directed, the edge object contains information on its source
node and its target node (like a tuple (v i , v j )) and the edge can only be traversed from
source to target; in the undirected case no difference is made between source and tar-
get nodes: they are stored as a set (like the set {v i , v j }) and the edge can be traversed
in both directions.
For a simple undirected graph, each undirected edge is contained in the incidence
lists of its two connected nodes.
e1 e3
v1
{v1 , v2 } {v1 , v3 }
e1 e2
v2
{v1 , v2 } {v2 , v3 }
v2
e1
e2
e3 e2
v3 v1 v3
{v1 , v3 } {v2 , v3 } e3
For a simple directed graph, it suffices to store only outgoing edges in the incidence list
as long as only a forward traversal of the edges is needed. However, for some queries
on graphs sometimes a backward traversal of an edge is necessary. An example for
a backward traversal in a social network with directed edges would be: “Find all my
friends who like me”. In this case, the “like”-edge has to be traversed from the target
(“me”) towards the sources (the “friends”). In this case, it is advantageous to store
all incident edges in a node’s incidence list to allow for both forward traversal of the
outgoing edges and backward traversal of the incoming edges. Indeed, we could even
have two incidence lists: one with outgoing edges for the forward traversal, and an-
other one for incoming edges for the backward traversal.
In our example we only show the incidence lists for a forward traversal:
e1 e3
v1
(v1 , v2 ) (v1 , v3 )
e2
v2
(v2 , v3 ) v2
e1
e2
v1 v3
v3 e3
For multigraphs, in both the directed as well as the undirected case, each edge has its
own identity (even when being part of a multiedge as is the case for e3 and e4 ) and
hence is stored separately. For example, in the undirected case, each edge contains
again pointers to the incident nodes:
e1 e3 e4
v1
{v1 , v2 } {v1 , v3 } {v1 , v3 }
e1 e2 v2
v2 e1
{v1 , v2 } {v2 , v3 } e2
v1 v3
e3
e3 e4 e2
v3
{v1 , v3 } {v1 , v3 } {v2 , v3 } e4
In the directed case, each edge would have a pointer to its source node as well as one
pointer to its target node:
e1 e3 e4
v1
(v1 , v2 ) (v1 , v3 ) (v1 , v3 )
v2
e2 e1
v2 e2
(v2 , v3 )
v1 v3
e3
v3 e4
In practical implementations, the incidence list would be stored inside a node object
as a collection of pointers to incident edge objects – potentially – in the directed case –
one collection for incoming and one collection for outgoing edges to allow for both for-
ward and backward traversal. This means that – in contrast to the above illustrations
– no duplicate edge objects (the ones with identical names) occur. Each edge object
would in turn contain a collection of pointers to its incident nodes – in the undirected
case; or alternatively – in the directed case – one pointer to the source node of the
edge (“positive incidence”) and one to the target node (“negative incidence”).
Similar to the adjacency list, the incidence list is a flexible data structure for stor-
ing graphs. In addition, because incidence lists treat edges as individual data objects,
information can be stored in an edge. This is important in most practical cases, for
example with the property graph model discussed in the following section.
4.3 The Property Graph Model
When it comes to using graphs for data storage, the basic storage structure usually is
a directed multigraph. However, some extensions to the mathematical graph defini-
tion are due: we must be able to store information inside the nodes as well as along
the edges. More formally, the graph structures described in Section 4.1.1 consider only
one kind of edge and one kind of node and are hence called single-relational graphs.
However for most practical purposes, we must be able to distinguish different kinds
of nodes (for example, Person nodes or City nodes) as well as different kinds of edges.
This is achieved by so-called multi-relational graphs where types are introduced for
nodes and for edges. A type first of all has a name like “Person” for a node type or
“likes” for an edge type. Each node is labeled with the name of a node type and each
edge is labeled with the name of an edge type; that is, the node label denotes the cor-
responding node type, and the edge label denotes the edge type. Apart from the name,
a type defines attributes for the corresponding nodes and edges. More precisely, an
attribute definition must contain a name for the attribute (like “Age” for the “Person”
node type) and it must specify a domain of values over which the attribute may range
(for example, the “Age” attribute should have values from the domain of the integers).
For a node or an edge inside the graph, their attributes are usually written as the at-
tribute name (like “Age”) and an attribute value taken from the attribute domain (like
32) separated by a colon; such name:value-pairs describe properties of a node or edge
and this is where the term property graph stems from. A general restriction for edges
in property graphs is that edge labels between any two nodes in a graph should be
unique: that is, between two nodes there must not be two edges of the same type. An
edge type can furthermore restrict the node types allowed for its source nodes or target
nodes. For example, the source node of a “likes”-edge might only be a node of type
“Person”. Last but not least, each node (and each edge, respectively) in a graph usu-
ally has a system-defined unique identifier that facilitates the internal handling of the
nodes and edges.
A property graph is a labeled and attributed directed multigraph with identifiers

To formalize property graphs a bit more, we can say that a property graph is a labeled
and attributed directed multigraph with identifiers; that is, a property graph G can be
defined as G = (V , E, L V , L E , ID) with the following components:
– V is the set of nodes.
– E is the set of edges.
– L V is the set of node labels (that is, the type names for nodes) such that to each
label l ∈ L V we can assign a set of attribute definitions. That is, for a given label l
the node type definition is t = (l, A) where A is the set of attribute definitions for
the node type. Each attribute definition a ∈ A specifies an attribute name and a
domain: a = (attributename, domain).
– L E is the set of edge labels (that is, type names for edges) such that to each label
l′ ∈ L E we can assign a set of attribute definitions as well as restrictions for the
types of the source and target nodes; that is for a given edge label l′ the type def-
inition is t′ = (l′ , A, sourcetypes, targettypes) where the attribute definitions are
analogous to the ones for the node types. Additionally the sets sourcetypes and
targettypes each contain the allowed node labels from L V : sourcetypes ⊆ L V as
well as targettypes ⊆ L V .
– The set ID is the set of identifiers that can uniquely be assigned to nodes and edges.
A specific node v ∈ V then has the following elements: v = (id, l, P) where id ∈ ID, and
l ∈ L V is a node label. Moreover, P is a set of properties; that is, each property p ∈ P
is a name:value-pair such that the name of the property corresponds to an attribute
name that has been defined for the node type and the value is a valid value taken
from the domain of the attribute. Note that properties are optional: not all attributes
of the type’s attribute definitions must be there in P. Similarly, for an edge we have
that an edge e ∈ E has the following elements: e = (id, l′ , P, source, target) where l′
is an edge label from L E and the properties in P correspond to attribute definitions
of this edge type. Additionally the source node and the target node have to comply
with the restrictions given in sourcetypes and targettypes in the edge type definition;
more precisely the node source has to be the ID of a node with a label l′′ such that
l′′ ∈ sourcetypes and analogously for the target node.
Let us now look at a small example of how to define a social network as a
property graph (see Figure 4.3). We only have one type for nodes (label “Person”)
but two types for edges (label “knows” and label “dislikes”). The “Person” type as
well as the “knows” type each have some extra attribute definitions. Our graph is
G = (V , E, L V , L E , ID) where
– the node set is V = {v1 , v2 , v3 }
– the edge set is E = {e1 , e2 , e3 }
– the node labels are L V = {Person}
– the node type definitions are tPerson = {Person, APerson } where the attribute defi-
nitions are APerson = {(Name, String), (Age, Integer)}.
– the edge labels are L E = {knows, dislikes}
Id: 2
Label: Person
Name: Bob
Id: 4 Id: 5
Age: 27
Label: knows Label: knows
since: 31-21-2009 since: 10-04-2011
Id: 1 Id: 3
Label: Person Label: Person
Name: Alice Id: 6 Name: Charlene
Age: 34 Label: dislikes Age: 29
Fig. 4.3. A property graph for a social network
– the edge type definitions are tknows = {knows, Aknows , {Person}, {Person}} and
tdislikes = {dislikes, ∅, {Person}, {Person}} where the attribute definitions are
Aknows = {(since, Date))}; the dislikes type does not have attributes so its attribute
definitions are the empty set ∅. Note that the source node and target node restric-
tions require that both edges can only be used between nodes of type Person.
– the ID set is ID = {1, 2, 3, 4, 5, 6}
The specific nodes and edges of our graph can now be noted down as follows
– v1 = {1, Person, {Name : Alice, Age : 34}}
– v2 = {2, Person, {Name : Bob, Age : 27}}
– v3 = {3, Person, {Name : Charlene, Age : 29}}
– e1 = {4, knows, {since : 31-21-2009}, 1, 2}
– e2 = {5, knows, {since : 10-04-2011}, 2, 3}
– e3 = {6, dislikes, ∅, 1, 3}
Some further remarks on edge labels are due. In the example, we have already seen
that the restrictions for source and target nodes explicitly require that edges of a cer-
tain type are allowed only between nodes of certain types: like the “knows” and “dis-
likes” edges may only occur between two nodes of type “Person”. Additionally, prop-
erty graphs usually have the implicit requirement of uniqueness of edge labels de-
scribed previously: there may never be two edges with the same label between two
nodes. In other words, a multiedge between two nodes is only allowed whenever the
individual edges in the multiedge have different labels. Continuing our example, the
property graph in Figure 4.4 violates the uniqueness property, because the edge with
ID 7 has label “dislikes” which has already been used by the edge with ID 6.
Id: 2
Label: Person
Name: Bob
Id: 4 Id: 5
Age: 27
Label: knows Label: knows
since: 31-21-2009 since: 10-04-2011
Id: 1 Id: 3
Label: Person Label: Person
Name: Alice Id: 6 Name: Charlene
Age: 34 Label: dislikes Age: 29
Id: 7
Label: dislikes (not allowed!)
Fig. 4.4. Violation of uniqueness of edge labels
Table 4.1. Node table and attribute table for a node type
Nodes NodeID NodeLabel PersonAttributes NodeID Name Age

1 Person 1 Alice 34
2 Person 2 Bob 27
3 Person 3 Charlene 29
4.4 Storing Property Graphs in Relational Tables
It might be necessary to use a legacy relational database management system to store

property graphs in tables. When using relational tables as a storage for property
graphs, the most flexible mapping is to have one base table for the nodes, one base
table for the edges and auxiliary tables to store the attribute information of nodes and
edges. Hence, first of all we have a node table to store the node IDs and the node labels
(see Table 4.1). To store the node attributes, for each node type, we have an auxiliary
table with one column for each attribute. The domain of the attribute definition of
the node is then also the domain of the corresponding column in the table (see also
Table 4.1).
Similarly, we store the edges in an edge table with its ID and label as well as the IDs
of the source and target nodes (see Table 4.2). For each edge type, we have an auxiliary
attribute table as was the case for the node attributes. In our example, the edge type
“dislikes” does not have attribute definitions, and hence we confine ourselves to the
attribute table for the edge type “knows” (see Table 4.3).
4.4 Storing Property Graphs in Relational Tables | 57
Table 4.2. Edge table
Edges EdgeID EdgeLabel Source Target

4 knows 1 2
5 knows 2 3
6 dislikes 1 3
Table 4.3. Attribute table for an edge type
KnowsAttributes EdgeID Since

4 31-21-2009
5 10-04-2011
With many different node and edge types and the corresponding attribute tables for
each type, we may get very many tables. For each type several different attribute tables
have to be accessed to recombine the set of all properties belonging to the type. To re-
duce the amount of different attribute tables, we might store all attributes in a single
attribute table irrespective of the node or edge type that an attribute belongs to. In-
stead we map the attributes to one auxiliary table that stores the attributes names and
values as properties in a single column. With an ID column (that contains the node ID
or edge ID), the attributes are linked to the corresponding node or edge (see Table 4.4).
While this table is now a single table, it will however become very large. Moreover, we
lose the control over the domains of the attributes: the properties table might now
contain arbitrary strings and we cannot control automatically (with the RDBMS) if the
value of a certain entry has the correct domain – for example, whether the age value
is indeed an integer.
Table 4.4. General attribute table
Attributes ID PropertyKey PropertyValue

1 Name Alice
1 Age 34
2 Name Bob
2 Age 27
3 Name Charlene
3 Age 29
4 since 31-21-2009
5 since 10-04-2011
Spanish class
David
Bob
Charlene Emily
Alice
French class
Fig. 4.5. Two undirected hyperedges
4.5 Advanced Graph Models
The basic property graph model allows for flexible data storage and data access with
multiple relations (that is, multiple edge and node types). Nevertheless, for some ap-
plications more advanced graph models might be necessary. Hence, extensions to the
basic graph structures described in Section 4.1.1 can be considered. More formally, in
addition to using multiple relations in a graph (which are represented by using types
in the property graph in Section 4.3), we now want to generalize the concepts of nodes
and (binary) edges. A generalization of a node can for example group a set of nodes
into a new more abstract node. A generalization of an edge is for example able to ex-
press n-ary relations; that is, relations between more than two nodes. Two advanced
graph models supporting these generalizations are hypergraphs and nested graphs.
Hypergraph: A hypergraph is a graph with hyperedges. A hyperedge is the gener-

alization of a normal binary edge as follows. In an undirected graph G = (V , E), a
binary edge e ∈ E corresponds to a two-element subset of the node set V (see Sec-
tion 4.1.1); that is, e = {v i , v j } where {v i , v j } ⊆ V. An undirected hyperedge generalizes
this definition by allowing not just two-element subsets but subsets with an arbitrary
number of elements: an undirected hyperedge is written as e = {v i , . . . , v j } where
{v i , . . . , v j } ⊆ V is a subset of the node set of arbitrary cardinality. See Figure 4.5 for
an example graph where the node set consists of several person nodes. Undirected
hyperedges can now group these people into subsets and defining their relationship;
in the example graph we have relationships by common language classes: one undi-
rected hyperedge for participants of a French class and one undirected hyperedge for
participants of a Spanish class.
Recall that for a directed graph, an edge was a 2-tuple (that is, an ordered pair
of nodes). Similar to the binary case, a directed hyperedge is a 2-tuple – but, it is a
tuple of two sets of nodes (instead of two single nodes). More precisely, a directed
hyperedge is a tuple e = ({v i , . . . , v j }, {v k , . . . , v l }) where both {v i , . . . , v j } ⊆ V and
{v k , . . . , v l } ⊆ V of arbitrary cardinality. The first set {v i , . . . , v j } is the set of source
nodes of the hyperedge; this set is hence called the source set (or alternatively tail set).
The second set {v k , . . . , v l } is the set of target nodes of the hyperedge; this set is hence
Alice
David
visit
Bob
Emily
Charlene
target set
source set
Fig. 4.6. A directed hyperedge
called the target set (or alternatively head set). With a directed hyperedge we can for
example express that one group of people pays a visit to another group of people (but
not the other way round). This can also be shown graphically as in Figure 4.6: each
node of the source set is connected to each node of the target set via a common edge
with an arrow tip.
There is also a second form of hyperedge that uses a tuple representation: the
oriented hyperedge. An oriented hyperedge is a tuple of nodes of arbitrary length;
that is, an n-tuple where n is an arbitrary natural number. Most notably, the order
inside the tuple is important (whereas in the set-based representation of an undi-
rected hyperedge inside the source set and the target set order does not matter). Due
to this, in an oriented hyperedge one node may also occur twice (or even more of-
ten) at different positions of the tuple. More formally, an oriented hyperedge is a tuple
e = (v i , . . . , v j ) ⊆ V × . . . × V of arbitrary length. With an oriented hyperedge we can
express that some nodes have a relationship based on a certain order. For example, we
can say “a person buys a book in a book store” (which is different from the nonsensi-
cal statement that “a book store buys a person in a book”). An oriented hyperedge can
be depicted by drawing arrows from the edge to the nodes and numbering the arrows
according to the order in the tuple; see Figure 4.7.
After seeing these simple forms of hyperedges, things can get even more compli-
cated: A generalized hyperedge not only groups together a set of nodes: it can also
group together a set of nodes and edges. More formally, a generalized undirected hy-
peredge is a set e = {a i , . . . , a j } of arbitrary cardinality where the elements a i and
a j are either nodes or edges; that is, a i , a j ∈ V ∪ E. Moreover, the notion of a gener-
alized undirected hyperedge also covers the case that a hyperedge is a set of nodes,
Alice Book Bookstore

1 3
2
buys
Fig. 4.7. An oriented hyperedge
Id: 8
Label: City
Name: Hannover
Id: 7
Population: 522K
Label: Citizens
Id: 2
Label: Person
Name: Bob
Age: 27 Id: 6 Id: 9
Id: 4
Label: citizen Label: connectedTo
Label: knows
Id: 1 Id: 3
Label: Person Label: City
Name: Alice Name: Hildesheim
Age: 34 Id: 5 Population: 102K
Label: citizen
Fig. 4.8. A hypergraph with generalized hyperedge “Citizens”
simple edges and even other hyperedges. Hence, we get a very flexible data model,
where edges can contain other edges up to an arbitrary depth. This definition can be
extended to the directed and the ordered case, too. Hyperedges are helpful when data
stored in the contained edges and attached nodes have to be combined or compared ef-
ficiently. As an example, consider the graph in Figure 4.8: The generalized hyperedge
“Citizens” combines edges with “Label: citizen”. With this generalized hyperedge, the
citizens of a city can be identified (and iterated over) a lot faster than checking all inci-
dent edges for the matching label “citizen” and discarding edges with any other label
like for example “connectedTo”. Note that in this example, we grouped together only
edges of the same type. This need not necessarily be the case: In general, generalized
hyperedges can contain arbitrary edges.
According to the kind of hyperedges chosen, advanced notions of adjacency of
two nodes or of a path in the hypergraph can be defined. Lastly, what remains to
be added is that the incidence matrix as well as the incidence list representation for
graphs can both be extended to represent hyperedges, because edges have an explicit
identity (which is not the case for the adjacency matrix and the adjacency list).
Nested graph: A nested graph consists of hypernodes that can be used to represent
complex objects [PL94]. Hypernodes generalize simple nodes because hypernodes can
encapsulate entire subgraphs that can themselves be nested. Due to the recursive def-
inition of a hypernode (that may itself contain graphs with hypernodes), the depth of
nesting is theoretically unrestricted. A nested graph may even contain cycles; that is,
one hypernode may be contained in a second node (even at a deeper level of nesting)
and vice versa. More formally,
– the set P of primitive nodes contains keys and values of key-value pairs (like name
and Alice);
– the set I is a set of identifiers;
– a nested graph is defined by choosing an identifier G ∈ I and then assigning to
it a set of hypernodes and binary edges G = (N, E) such that each n ∈ N is either
a primitive node or another identifier G′ (that is, N ⊆ P ∪ I); in other words, a
complex hypernode G′ ∈ N is itself a nested graph.
– the edge set E ⊆ N × N is defined as a binary edge between two hypernodes –
however restricted in such a way that an edge e ∈ E can map a key to a value or
to an identifier.
We illustrate a nested graph by an example family hierarchy where parents’ hypern-

odes contain the hypernodes of their children.
A textual description of the family hierarchy would hence be such that we have the
identifiers I = {1, 2, 3, 4, 5, 6}, and the set of primitive nodes P = {label, name, child,
root,Person, Family, Alice, Bob, Charlene, David, Emily, null}. For each identifier we
define the tuple (N, E) as follows:
– 1=({2,6,label,root,Family}, {label→ Family,root→2,root→ 6})
– 2=({3,label,name,child,Alice,Person}, {label→ Person,name→ Alice,child→ 3})
– 3=({4,label,name,child,Charlene,Person},
{label→ Person,name→ Charlene,child→ 4})
– 4=({label,name,child,Emily,Person,null},
{label→ Person,name→ Emily,child→ null})
– 5=({4,label,name,child,David,Person},
{label→ Person,name→ David,child→ 4})
– 6=({5,label,name,child,Bob,Person}, {label→ Person,name→ Bob,child→ 5})
Figure 4.9 contains an illustration of the family hierarchy with the hypernode labeled
“family” encapsulating the entire hierarchy.
Id: 1
Label: Family
Id: 2 Id: 3 Id: 4 Id: 5 Id: 6

Label: Person Label: Person Label: Person Label: Person Label: Person
Name: Alice Name: Charlene Name: Emily Name: David Name: Bob
Fig. 4.9. A nested graph
4.6 Implementations and Systems
Graph databases in practice have to handle large and highly interconnected graphs
efficiently. They rely on a good storage management (either a “native” graph format
with an appropriate buffer manager or a high-level graph format mapped to a low-level
external database system); building indexes over paths in the graph and over values
in node attributes is crucial for an efficient query handling. In this section, we give a
brief overview of existing open source graph databases and graph processing tools.
The multi-model databases OrientDB and ArangoDB both offer a graph API; they are
surveyed in Section 15.4.
4.6.1 Apache TinkerPop
The TinkerPop graph processing stack offers a set of open source graph management
modules. Among others these modules cover a basic graph data structure, a graph
query processor as well as several algorithm to traverse graphs.
Web resources:
– Apache Tinkerpop: http://tinkerpop.incubator.apache.org/
– documentation page: http://tinkerpop.incubator.apache.org/docs/
– GitHub repository: https://github.com/apache/incubator-tinkerpop
While it does not offer persistence features itself, TinkerPop does however support
connections to a variety graph databases and be used as their programming or query
interface. Initially the TinkerPop stack was strictly divided into several modules each
with a different purpose:
– Blueprints: The basic property graph API.

– Pipes: A data flow framework that allows for lazy graph traversing.
– Gremlin: A graph traversal language for graph querying, analysis and manipula-
tion.
– Frames: An object-to-graph mapper that turns vertices and edges into objects and
relations (and vice versa).
– Furnace: A package with implementations of property graph algorithms.
– Rexster: A RESTful graph server.
In the current version (TinkerPop3) these modules do not have such strict boundaries
any longer and the modules have been combined into to the general Gremlin frame-
work. Due to this, Blueprints is now referred to as the Gremlin Structure API, while
Pipes, Frames and Furnace can now be accessed by the interfaces GraphTraversal,
Traversal, GraphComputer and VertexProgram; Rexster is now termed GremlinServer.
TinkerPop3 puts more emphasis on the distinction between the structure of the graph
versus the processing of the graph. The TinkerPop graph model is a property graph
that supports labels, identifiers, and properties for both vertices and edges; in par-
ticular, properties for vertices can also be nested: a vertex property might contain a
collection of subproperties.
The main interfaces used in the TinkerPop3 structure API are:
– Graph that consists of a set of edges and vertices and provides access to database
functionality.
– Element that encapsulates a collection of properties as well as a label (that is used
to denote the type of this element).
– Vertex that is a subinterface of the Element interface and extends it by maintain-
ing sets of incoming and outgoing edges.
– Edge that is a subinterface of the Element interface and extends it by maintaining
its adjacent vertices: one incoming Vertex and one outgoing vertex.
– Property that represents a basic key-value pair where the key part is a String ob-
ject and the value part is an arbitrary Java object. Generics can be used to restrict
the allowed values to certain classes V such that Property<V> only allows objects
of the class V as values.
– VertexProperty that extends the basic Property class to be nested in the sense
that in addition to one basic key-value pair it also maintains a collection of key-
value pairs.
With these components of the TinkerPop3 structure API a graph can be created. Grem-
lin offers APIs for Java and Groovy and – in addition – a command line interface called
Gremlin Console that interprets Groovy syntax.
A new graph is created by calling the method open of a class implementing the
Graph interface. For example, the predefined class TinkerGraph an in-memory graph
implementation of the Graph interface. New vertices and edges can be added to the
graph with the addVertex method (on a graph object) and addEdge method (on a
vertex object) as well as new properties can be added to an element by calling the
property method. The addVertex method accepts several properties as parameters.
The addEdge method accepts an edge label, the target node of the edge as well as sev-
eral properties as parameters. Note that properties consist of two parameters: a String
object as the key (for example, "name") at one position and the value object (for ex-
ample, "Alice") at the next position in the parameter list. A vertex label and a vertex
identifier as well as an edge identifier can also be explicitly added as parameters.
In Gremlin-Java, a small TinkerPop graph with two vertices and an edge with label
"knows" is created as follows.
Graph g = TinkerGraph.open();
Vertex alice = g.addVertex("name", "Alice");
alice.property("age", 34);
Vertex bob = g.addVertex();
alice.addEdge("knows", bob, "knows_since", 2010);
The TinkerPop3 Graph process API creates traversals in the graph. The starting point of
the traversal can be set by calling either GraphTraversalSource.V() (to start with a
set of vertices) or GraphTraversalSource.E() (to start with a set of edges). The return
type is GraphTraversal. With this GraphTraversal object, several steps in the graph
can be executed. These steps are implemented by a concatenation of method calls
by using the dot operator; steps are hence combined by so-called method chaining.
Each of the calls returns a GraphTraversal object.
For example, to find the names of Alice’s acquaintances in the graph object g from
above, we first of all start a traversal, search for a node that has the property name set
to Alice, traverse along the outgoing edges labeled "knows" and retrieve the values
of their "name" properties.
g.traversal().V().has("name","Alice").out("knows").values("name");
Indexing is supported by TinkerPop such that an index can be created for properties.
For example, for the name property of a vertex, the index can be solicited by calling
the createIndex method in Gremlin-Java.
g.createIndex("name",Vertex.class)
The same graph can be created in Gremlin-Groovy, with which the commands can be
directly input into the command line interface Gremlin Console.
g = TinkerGraph.open()
alice = g.addVertex(’name’,’Alice’);
alice.property(’age’,34);
bob = g.addVertex();
alice.addEdge(’knows’, bob, ’knows_since’, 2010);
Querying and indexing in Gremlin-Groovy proceeds similarly as above.
g.createIndex(’name’,Vertex.class)
gt = g.traversal(standard());
gt.V().has(’name’,’Alice’).out(’knows’).values(’name’);
With the GraphReader and GraphWriter interfaces TinkerPop provides input and out-
put feature for graphs based in textual or binary representations. Supported output
formats are GraphML (an XML format describing the graph as nested <edge> and
<node> elements however only supporting primitive values for properties), GraphSON
(a JSON-based format) or the binary Gremlin-Kryo (Gryo).
4.6.2 Neo4J
Neo4J is a graph database that uses the property graph data structure; in Neo4J termi-
nology, edges are called “relationships”. Neo4J supports transactions with assurance
of the ACID properties. Indexing for properties of nodes and edges is based on the
Apache Lucene indexing technology. Property names have to be strings. Property val-
ues can be strings, booleans, or several numerical types; Neo4J also supports arrays
of these primitive types for set-based values.
Web resources:
– Neo4J: http://neo4j.com/
– documentation page: http://neo4j.com/docs/stable/
– GitHub repository: https://github.com/neo4j/neo4j
The graph manipulation API of Neo4J offers a GraphDatabaseService interface (for

managing nodes and relationships) and a PropertyContainer interface (for managing
properties). Operations on the Neo4J database have to be encapsulated in transac-
tions. If a transaction comes to a successful conclusion, it is finished by persisting the
changes into the database; otherwise, if the transaction fails at some point and raises
an exception, the previous operations inside the transaction are rolled back. A sample
code snippet for the creation of two nodes and a relationship is the following.
GraphDatabaseService db = ...
Transaction tx = db.beginTx();
try {
Node alice = db.createNode();
alice.setProperty("name", "Alice");
alice.setProperty("age", 34);
Node bob = db.createNode();
Relationship edge = alice.createRelationshipTo(bob);
edge.setProperty("knows_since", 2010);
tx.success();
} catch (Exception e) {
tx.fail();
} finally {
tx.finish();
}
Neo4J offers a declarative query language called Cypher. With Cypher expressions you
can search for matching nodes or traverse the graph along edges. The main Cypher
syntax elements are
– START to specify starting nodes in the graph
– MATCH to specify the traversal that should be executed in the query; in the MATCH
statement, –> can be used to denote an edge and additional requirements on edge
labels can be written in square brackets: like -[:knows]-> to follow an edge with
the label “knows”
– WHERE to specify additional filters
– RETURN to specify the return value
As an example for a Cypher query in a social network, we can look up the node with
the value “Alice” for its “name” attribute in an index (called “people_idx”), take the
returned node as the starting node, and then traverse the graph along the edges with
the label “knows” to return the adjacent persons as follows:
START alice = (people_idx, name, "Alice")

MATCH (alice)-[:knows]->(aperson)
RETURN (aperson)
4.6.3 HyperGraphDB
HyperGraphDB is based on the data model of a generalized hypergraph (that is, a

hypergraph where hyperedges may themselves contain hyperedges). Hence, Hyper-
GraphDB offers an advanced graph data model in contrast to the simpler property
graphs covered by other graph databases. Internally, HypergraphDB has a high-level
representation of the graph which is mapped to a low-level “primitive” storage layer;
this layer contains two key-value stores to store nodes, edges, and their values; Hyper-
graphDB relies on a running BerkeleyDB as its storage engine.
Web resources:
– HyperGraphDB: http://www.hypergraphdb.org/
– documentation page: http://www.hypergraphdb.org/learn
– GitHub repository: https://github.com/hypergraphdb/hypergraphdb
HypergraphDB maps Java types into internal types. By constructing a new HyperGraph
object and then calling its add method, a Java object (like String) can be stored to the
database; the message call returns an HGHandle object which acts like a pointer to
the object in the database.
HyperGraph graph = new HyperGraph("/.../...");

HGHandle alicehandle = graph.add("Alice");
graph.close();
With a handle, the stored object an be retrieved from the database system:
String s = (String) graph.get(alicehandle);
On the other hand, the class HGQuery.hg provides factory methods to access the
database. All objects of a type can be retrieved as a list from the database by calling
the getAll method:
for (Object s : hg.getAll(graph, hg.type(String.class)))

System.out.println(s);
In addition, selection conditions can be defined by instances of the HGQueryCondi-

tion class. All nodes and edges are subsumed under the term atom (implemented by
the interface of HGAtom). Edges are implemented as hyperedges as they may contain
an arbitrary number of other atoms (including other edges). Different edge types are
predefined in HypergraphDB:
– HGPlainLink: An edge between atoms without any values assigned to it.
– HGValueLink: An edge that carries a Java object with properties of the edge; effec-
tively turning the edge into a typed edge.
– HGRel: A labeled egdge that restricts the types of the atoms contained in the edge.
As a simple example consider the following two String nodes in the graph that are
connected by an edge with a String property:
HGHandle alicehandle = graph.add("Alice");

HGHandle bobhandle = graph.add("Bob");
HGValueLink link = new HGValueLink("knows", alicehandle, bobhandle);
HypergraphDB allows users to implement custom data types (classes) and store in-
stance of the classes in the database. These classes need to have an empty constructor
as well as the appropriate getter and setter methods for their fields. In order to add the
custom type to HypergraphDB, the custom type has to be accompanied by a type object
describing the custom type that implements the HGAtomType interface; it is respon-
sible for handling database interactions for the custom type – in particular, prepare a
handle for it.
The foundations of graph theory have been covered in many textbooks like [Die12,
CZ12].
A formal treatment of properties of graph query languages can be found in [Woo12]
while a complexity analysis of graph queries is given in [BB13]. Hypergraphs and their
applications have further been studied in [GLP93, BWK07]. A graph-based data model
by using hypernodes was introduced in [LL95].
A general overview of graph database technology with a focus on Neo4J is given
in [RWE13]. Various performance comparisons of several graph databases have been
made for example in [DSUBGV+ 10, DS12, DSMBMM+ 11, KSM13b, JV13, CAH12, HP13].
5 XML Databases
The Extensible Markup Language (XML) was defined by the WWW Consortium (W3C)
as a document markup language. Related standards (in particular, XML Schema,
XQuery and XSLT) are maintained by the W3C, too.
Web resources:
– XML Technology: http://www.w3.org/standards/xml/
In recent years, XML has become a major format for data exchange, flexibly structured
documents, or configuration files. This fact demanded that XML documents could be
persistently stored in database systems that respect the special structure of XML docu-
ments and support efficient data retrieval on them. Although the Java Script Object No-
tation (JSON) manifested itself as the more readable successor of XML, several legacy
systems still rely on data in XML format. Due to its mature and standardized format,
management of XML data is still a current topic. In this chapter we introduce the basics
of XML and discuss several options for storing XML documents in a database.
5.1 XML Background
An XML document consists of tags that structure the document. As this structure can
be less rigid than the conventional relational table model, XML documents can be
used to store data in a so-called semi-structured format. In general, due to well-
formedness conditions for XML documents each XML document can be represented in
a tree shape. In addition, validity can be checked based on some schema definitions:
their use is an optional feature of XML that facilitates document exchange and pro-
cessing. There are two forms of schema definitions for XML widely used today which
we will briefly survey in the following subsections: Document Type Definitions and
XML Schema. Several query or transformation languages can be used to process XML
documents. We focus here on the most relevant background on XML technology; other
introductory as well as advanced literature on XML is widely available.
5.1.1 XML Documents
As already mentioned, XML documents are structured by tags; a tag is basically a label
which spans a section of the document. More precisely, such a section of the document
is called an XML element, and it begins with a start tag and ends with an end tag.
The start tag contains the name of the element (surrounded by “<” and “>”); the end
70 | 5 XML Databases
tag additionally contains a slash sign “/”. For example, let us store all data of a hotel
reservation system in an XML document. For each hotel, we would need an element
that stores the location of a hotel; hence, the name of this element would be “location”.
With the corresponding start and end tags, the element for a hotel located in Newtown
would look like this:
<location> Newtown </location>
In this example, the location element contains the text “Newtown”. Apart from text,
an element can also contain other elements, which are then called subelements of
this element; in other words, elements can be nested. For example, to model a hotel
we can have a hotel element with contains as subelements a name element, a location
element and a room price element:
<hotel>
<name> Palace Hotel </name>
<roomprice> 65 Euro </roomprice>
</hotel>
An element may contain several subelements with the same name; for example, we
might have an element with the name reservationsystem as the top-level element (the
so-called root element) and it might have several hotel elements as subelements:
<reservationsystem>
<hotel>
</hotel>
<hotel>
<name> Eden Hotel </name>
</hotel>
</reservationsystem>
Elements may also be empty (they neither contain text nor subelements). Moreover,
an XML element can have XML attributes. An attribute is alternative for subelements:
they are another way of assigning information to an element. Attribute names and
their values are written inside the start tag of the element: the attribute name is fol-
lowed by an equality sign (=) and the attribute value inside quotation marks ("). An
attribute name can only occur once inside an element; this is in contrast to subele-
ments which can occur multiple times inside the same element. In our example, we
add an ID attribute to the hotel element:
<reservationsystem>
<hotel hotelID="h1">
</hotel>
<name> Eden Hotel </name>
</hotel>
An XML document might contain additional information that facilitates handling of

the document; version declarations, entity declarations (which are shortcuts for often
repeated texts), as well as comments or processing instruction are possible. For exam-
ple, an XML version declaration can be optionally included at start of XML document
as follows: <?xml version "1.0">. Moreover, XML namespaces (XMLNS) are useful
to have a unique naming of elements: Because tag names can be used with a different
meaning in two different XML documents, exchanging these documents or process-
ing these documents together may cause a confusion. This confusion can be avoided
when each element name is prefixed by a unique namespace identifier; the names-
pace itself is usually a Uniform Resource Identifier (URI) which denotes the scope and
origin of the document.
5.1.2 Document Type Definition (DTD)
Document Type Definitions (DTDs) are the simplest way to specify a schema for XML
documents. DTDs have their own syntax (that is, they are not written in XML). The
two basic components of a DTD are element definitions and attribute definitions; their
syntax is as follows:
<!ELEMENT element (subelements-specification) >

<!ATTLIST element (attributes-specification) >
An element definition defines which element names can occur in an XML document,
how elements can be combined with other ones, and how elements can be nested.
Subelements can be specified as
– names of elements,
– #PCDATA (“parsed character data”; which mostly means arbitrary text – but parsed
character data may also contain markup elements),
– EMPTY (no subelements), or
– ANY (any element defined in the DTD can be a subelement).
Note that DTDs do not offer different data types for the content of an element: every
text value inside an element is simply a string (#PCDATA). A subelement specification
may contain regular expressions to further specify occurrences or combinations of el-
ements. Commas between element names indicate they must occur in the specified
order, whereas without commas elements can occur in any order. Moreover, the fol-
lowing notation can be used to quantify the occurrences of elements: | denotes al-
ternatives, + denotes 1 or more occurrences, * denotes 0 or more occurrences, and ?
denotes 0 or 1 occurrences; if no such quantifying notation is given for an element
name, the element must occur exactly once. In our example reservation system we
can specify in a DTD that an arbitrary amount of hotel elements can be contained in
the document, but hotel name, location and room price should be specified in a fixed
order; the room price is optional. Example:
<!ELEMENT reservationsystem (hotel*)>

<!ELEMENT hotel (name, location, roomprice?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT location (#PCDATA)>
<!ELEMENT roomprice (#PCDATA)>
An attribute specification determines, which attributes are available for the specified
element. It can also specify what values the attribute may have. The attribute specifi-
cation contains the name of the attribute and it defines its content type. The content
type may be “character data” #CDATA for arbitrary text, or a fixed enumeration of val-
ues that the attribute may hold, or IDs or IDREFs. IDs are identifier for elements: an ID
associates a unique value with the individual element; and IDREF is an ID reference (or
IDREFS for multiple references): the value of a reference must correspond to the value
of the ID attribute for some element in the same document. Note that each element
can have at most one attribute of type ID and the ID attribute value of each individual
element in an XML document must be distinct to act as a unique identifier. Optional
specifications for attributes are whether the attribute is mandatory (#REQUIRED) or op-
tional (#IMPLIED); or whether it has a default value. Hence, to extend our example, we
can specify the hotel ID to be a mandatory ID attribute of the hotel element:
<!ATTLIST hotel hotelID ID #REQUIRED>
In order to illustrate ID references, assume that we have another element called

booking which stores the booking of a particular hotel for a client.
<!ELEMENT reservationsystem (hotel* booking*)>

...
<!ELEMENT booking (client)>
<!ELEMENT client (#PCDATA)>
<!ATTLIST booking
bookingID ID #REQUIRED
hotelbooked IDREF #REQUIRED>
Each booking should have its own ID as an attribute, but we can specify the booked
hotel by an ID reference as an attribute in the booking element.
Finally, the element and attribute definitions should be surrounded by a DOCTYPE
top-level declaration which states the name of the root element. For example, for our
root element reservationsystem:
<!DOCTYPE reservationsystem[
...
]>
While DTDs are quite easily readable, they have some limitations. In particular, they
only offer limited data typing: All values are strings, and one cannot differentiate
strings from other basic types like integers, reals, etc. Moreover note that IDs and
IDREFs are untyped: in our example, the ID reference hotelbooked can refer to an-
other booking ID instead of referring to a hotel ID which would be nonsensical.
5.1.3 XML Schema Definition (XSD)
An XML Schema definition is an XML document; it can hence directly be processed by

the applications that process the other (non-schema) XML documents. XML Schema
is a much more sophisticated schema language than DTD. It allows for data typing
with basic system-defined types (we can define elements to contain integers, deci-
mals, dates, strings, or list types) and even user-defined types including inheritance.
XML Schema also supports constraints on the number of occurrences of subelements,
on minimum or maximum values and an advanced ID referencing. But this increased
expressiveness comes at the cost of less readability: XML Schema definitions are sig-
nificantly more complicated than DTDs. To start with, there is a standard URI that
defines all the necessary components (like elements and attributes); this URI can be
used in a namespace throughout an XML Schema definition. In our examples, we call

this namespace xsd:
<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>
Elements that do not contain other elements or attributes are of type simpleType and
can be defined as having a basic type like string or decimal.
<xsd:element name="name" type="xsd:string"/>

<xsd:element name="location" type="xsd:string"/>
<xsd:element name="roomprice" type="xsd:decimal"/>
Elements that contain subelements are of type complexType. The list of subelements
of a complex type element is defined inside a sequence element.
<xsd:element name="hotel">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" />
<xsd:element name="location" />
<xsd:element name="roomprice" />
...
</xsd:sequence>
</xsd:complexType>
</xsd:element>
With XML Schema definitions, the number of occurrences (the cardinality) of subele-
ments can be defined. The cardinality of an element is represented by minOccurs and
maxOccurs in the element definition. When the element is optional, its minOccurs
property should be set to 0. When there is no restriction on the maximum number of
occurrences, maxOccurs should be set to "unbounded". In our case, there should be
exactly one hotel name and location but the room price is optional:
<xsd:element name = "hotel">

<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" minOccurs="1" maxOccurs="1"/>
<xsd:element name="location" minOccurs="1" maxOccurs="1"/>
<xsd:element name="pricesgl" minOccurs="0" maxOccurs="1"/>
...
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Attributes must be defined after the corresponding element definitions; for example:
<xsd:element name = "hotel">

<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" minOccurs="1" maxOccurs="1"/>
<xsd:element name="location" minOccurs="1" maxOccurs="1"/>
<xsd:element name="pricesgl" minOccurs="0" maxOccurs="1"/>
...
</xsd:sequence>
<xsd:attribute name="hotelID" type="xsd:string"/>
</xsd:complexType>
</xsd:element>
XML Schema offers user-defined types which can be reused throughout the XML
Schema definition. For example, we can define a new HotelType by giving the com-
plex type definition the appropriate name:
<xsd:complexType name="HotelType">
<xsd:sequence>
<xsd:element ref="name" minOccurs="1" maxOccurs="1"/>
...
</xsd:sequence>
</xsd:complexType>
User-defined types can then be used as types for elements:
<xsd:element name="hotel" type="HotelType">
5.1.4 XML Parsers
Programs that process inputs in XML format are usually called XML parsers. An in-
dispensable syntactic requirement for an XML document is the well-formedness of
the document: an XML document must have a certain structure because otherwise a
parser will fail to process it. The main aspect of well-formedness is proper nesting of
elements: every start tag must have a unique matching end tag, inside the scope of
the same parent element. An example for proper nesting is when the start and end
tag for element name are both inside the element hotel: <hotel> <name> </name>
</hotel>. An example for an improper nesting is when the end tag of element name
is outside of the element hotel: <hotel> <name> </hotel> </name>; this improper
nesting would cause an error when processing the document. Moreover, a well-formed
XML document should always have a unique top-most element (which surrounds all
other elements of the document; this top-most element is called the root element
of the XML document. Other conditions for well-formedness include for example the
correct syntax of attributes or comments.
A validating XML parser not only check if an XML document is well-formed but it
also checks its validity: that is, that it conforms to a Document Type Definition (DTD)
or XML Schema definition.
There are two basic application programming interfaces (APIs) an XML parser can
offer:
Simple API for XML (SAX): A SAX parser reads in an XML document as a stream
of data. When reading a part of the document (for example, a start tag), the SAX
parser creates an event (for example, a start-element event). An application us-
ing the SAX API can then provide event handlers for these parsing events. A SAX
parser processes an XML document in a read-once fashion which makes it fast.
On the downside, however, one cannot navigate in the document or modify it.
Document Object Model (DOM): A DOM parser transforms the XML document
into a tree representation. The DOM API provides several functions for traversing
or updating the DOM tree. The Java DOM API provides a Node class with meth-
ods for navigating in the tree (for example, getParentNode(), getFirstChild(), get-
NextSibling(), getAttribute()), reading text from a text node (getData()), or search-
ing in the tree (for example, getElementsByTagName()).
5.1.5 Tree Model of XML Documents
XML documents have a structure that resembles a tree. The root element is the root
node of the tree; the subelements of the root element are the child nodes of the root
node; and so on. Subelements nested inside the same element are child nodes of the
same parent node; they are on the same level in the tree and are called siblings. All
nodes on the path from a node to the root node are called ancestors of this node;
whereas all nodes in the subtree starting from a node are called descendants of this
node. The nodes for the elements are usually constructed in the order in which they
appear in the XML document (the document order). For each node in the tree one can
hence also differentiate between the preceding nodes and the following nodes (in
the document order). Note that preceding nodes are disjoint from ancestors and fol-
lowing nodes are disjoint from descendants: only those nodes that occur before a node
in the document but that are not ancestors of the node are preceding nodes for the
node; only those nodes that occur after a node in the document but that are not de-
scendants of the node are following nodes for the node. When moving along from
node to node inside the tree (“navigating” the tree), the node where one is currently
located is called the context node or self node. Figure 5.1 shows the different classes
of nodes from the perspective of the context node self.
ancestor
preceding parent following
self
sibling sibling
child
descendant
Fig. 5.1. Navigation in an XML tree
There are two more kinds of nodes: text nodes and attribute nodes. When an element
contains text, this text is positioned as a child node of the element node in a special
text node. When an element contains an attribute, an attribute node is created for
it as a child node of the element; in contrast to text nodes of an element, the text
value of an attribute is stored (together with the attribute name) in the attribute node –
there is no separate text node for attributes. Figure 5.2 shows the representation of the
following example XML document; element and text nodes are drawn as rectangles
whereas attributes nodes are drawn as dashed ellipses:
<reservationsystem>
</hotel>
Note that this tree-model view of XML documents does not take ID references into
account: as an ID reference attribute basically forms a link to a different subtree iden-
tified by the matching ID attribute, the data model is no longer a tree. Hence, more
generally an XML document can be seen as a directed graph; this graph may even
contain cycles – for example, in case we have a bidirectional ID-based reference be-
tween two elements with appropriate ID attributes.
root element
reservationsystem
element
hotel
element element element

attribute
name location roomprice
hotelID
h1
text text text
Palace Hotel Newtown 65 Euro
Fig. 5.2. XML tree
5.1.6 Numbering Schemes
Numbering schemes (also called labeling schemes) are important when working
with XML documents. A numbering scheme assigns each node of an XML tree a unique
identifier (a label or node ID which is usually a number). The simplest way of num-
bering an XML tree is doing a preorder traversal of the tree and simply increasing a
counter for each node. Preorder traversal means that the root node is numbered as the
first node before numbering any other node; and this is done recursively for all child
nodes. The preorder numbering for our example is shown in Figure 5.3
Another way of traversing a tree is the postorder traversal. The root node is num-
bered last after all child nodes have been numbered; this way, the postorder number-
ing starts with the left-most leaf node as the first node with the lowest number while
the root node receives the highest number.
Both preorder and postorder traversal can be combined in the so-called pre/post
numbering. This combination of the preorder and postorder numbering has some
advantages for navigating in the tree: by comparing the preorder and postorder num-
bering of the context node self with the preorder and postorder numbering of another
node w, we can easily determine whether the node w is an ancestor, a descendant, a
preceding or a following node of self.
Navigation rules with pre/post numbering:

1. w ancestor if pre(w) < pre(self ) and post(w) > post(self )
2. w descendant if pre(w) > pre(self ) and post(w) < post(self )
3. w preceding if pre(w) < pre(self ) and post(w) < post(self )
4. w following if pre(w) > pre(self ) and post(w) > post(self )
0
reservationsystem
1
hotel
3 5 7
2
name location roomprice
hotelID
h1
4 6 8
Palace Hotel Newtown 65 Euro
Fig. 5.3. XML tree with preorder numbering
post
1/7 7 a
node a 6 f
5 g
2/2 4/4 6/6 4 d
node b node d node f e
3
2 b
3/1 5/3 7/5 1 c
node c node e node g pre
1 2 3 4 5 6 7
Fig. 5.4. Pre/post numbering and pre/post plane
These rules can conveniently be visualized by drawing a diagram where the x-axis is
the preorder numbering and the y-axis is the postorder numbering; this diagram is
called the pre/post diagram. Nodes are positioned in the plane according to their
pre/post numbering. In Figure 5.4 an example tree is numbered in preorder as well as
postorder (here the numbering starts with 1). The pre/post plane of the tree contains
all nodes of the tree. Assuming that node d is the context node, all ancestors of d are
located in the upper left corner (in this case, node a); all descendants are located in
the lower right corner (in this case, node e); all preceding nodes are located in the
lower left corner (in this case, nodes b and c); all following nodes are located in the
upper right corner (in this case, nodes f and g).
The pre/post numbering has some advantages regarding fast query processing
and compact storage. But it also has a major disadvantage: Modifications in the tree
1
node a
1.1 1.2 1.3

node b node e node g
1.1.1 1.1.2 1.2.1 1.3.1 1.3.2

node c node d node f node h node i
Fig. 5.5. DeweyID numbering
(inserting or deleting nodes) are costly because the numbering of several nodes must
be altered; this is known as renumbering (or relabeling). One extension of the basic
pre/post numbering scheme adds containment information: one can decide more eas-
ily whether a node is a descendant of another node. The preorder numbering is stored
for each node plus the information of the range of the node; that is, the node ID of the
last descendant (the one with the highest node ID) incremented by 1. This encoding
clearly makes it easier to determine whether a node is a descendant of some other node
because its ID is contained in the range of each of its ancestor nodes. Hence, these ex-
tensions run under the name of range-based or interval-based numbering schemes.
Another extension adds level information: for each node its level in the tree is stored
(where the root node is at level 0, its children at level 1 and so on). This way, sib-
lings can be easily determined because they are on the same level. However both ap-
proaches suffer from the renumbering problem, too.
Ultimately, what we need for an efficient storage and management of XML doc-
uments is a modification-friendly numbering scheme. More precisely, a numbering
scheme should only renumber a minimal number of nodes in the existing tree when
a new node is inserted in the tree or an existing node is deleted. Prefix numbering
schemes are one step towards modification friendliness: numbering occurs level-wise,
and when inserting a node, only siblings of the new node and their subtrees have to
be renumbered. The simplest prefix numbering scheme is called DeweyID. Each level
of the tree has its own counter (starting from 1); to a level counter, the counters of all
ancestors on the path to the root node are prepended (where the dot ‘.’ is used as a sep-
arator). This way it is easy to see on which level the node is located and which node
is the parent node of the tree. That is, the root node always has node ID 1; the root’s
first child has node ID 1.1; the root’s second child has node ID 1.2; and so on. Figure 5.5
shows an example tree with DeweyID numbering.
OrdPath [OOP+ 04] is an extension of DeweyID where the intial numbering only
consists of positive odd integers. Even numbers and negative numbers serve for later
insertions into an existing tree. For example, let the initial encoding contain two sib-
lings numbered 1.3 and 1.5; inserting a new sibling in between the two would result
in a node ID 1.4.1 for this new sibling. Hence, numbering in OrdPath is a bit ambigu-
ous because although 1.3 and 1.4.1 are on the same level, their numbering depth is
different; this makes comparison of node IDs more difficult.
A further step to avoid renumbering is to use binary strings (instead of numbers)
for node IDs as in [LLH08]. These binary strings are then compared lexicographically
(instead of numerically). The following property makes binary strings beneficial for
numbering schemes: between any two binary strings ending in ‘1’ there is another bi-
nary string which is lexicographically between those two; in particular, there is there
one such binary string that also ends in in ‘1’. Hence, with binary strings to encode
node IDs modifications in the tree can be easily handled because renumbering is re-
duced. For example, assume a modification wants to insert a new node between a node
with ID 4 and a node with ID 5. With integer representation of 4 and 5, it is impossi-
ble to find an integer between 4 and 5; hence renumbering would be necessary to do
the insertion. However, when representing node labels as binary strings, we can find
a binary string in between the two: for example for a node ID 0011 and a node ID 01,
a binary string which is lexicographically in between the two would be 01001. As the
new number again ends with ‘1’, further insertion would be possible without renum-
bering. More formally, lexicographical order of binary strings is defined by comparing
the two strings from left to right: If one string is a prefix of the other, the prefix is the
smaller one (for example, 01 is smaller than 011); if (after a common prefix) one string
has a 0 where the other has a 1, the one with the zero is smaller (for example, 0011 is
smaller than 01).
5.2 XML Query Languages
Due to the tree-like nature of XML documents, query languages for XML need to offer
functionalities to navigate the tree. We briefly survey XPath, XQuery and XSLT which
are all mature and established XML query and processing languages.
5.2.1 XPath
XPath is a language for navigating an XML document; modifications of the document

are not possible. An XPath query consists of a location path, that is, a concatenation of
navigation steps in the tree. A single step has three components – axis, node test and
predicate – written in the following notation: axis::nodetest[predicate]. The axis
specifies the direction of the step (for example, going from the current context node
to a child node); the node test defines the node type or name that the accessed node
should have; and the (optional) predicate can be evaluated to true or false and acts as
filter for the resulting set of nodes by only considering those nodes in the final result
set for which the predicate is true. For example, we could ask for the ID attributes of
all those hotels that have the text “Newtown” in their location element.
/descendant-or-self::hotel[self::node()/child::location/
child::text()="Newtown"]/attribute::hotelID
As XPath queries in full notation are hard to read, several ways to shorten the expres-
sions have been defined. The following query is equivalent to the previous one but
uses the shorter notation.
//hotel[./location="Newtown"]/@hotelID
Note that ‘//’ stands for the descendant-or-self axis, ‘.’ stands for the self axis, ‘@’
stands for the attribute axis, and the child axis as well as the node test node() and
text() can simply be dropped.
5.2.2 XQuery
XQuery as a query language has some declarative language features: the for, let,
order, where and return clauses; an XQuery query is hence also called a FLOWR ex-
pression. With a for clause, we can iterate with a variable over a set of values; with
a let clause, we can assign a value to a variable; with a where clause we can specify
a selection condition to filter out some parts of the document; with the order clause
we can sort the result; and with the return clause we define the structure of the query
result. XQuery uses XPath expressions to specify navigational paths in the XML doc-
ument. In contrast to XPath, XQuery can modify the structure of an XML document.
For example, an XQuery expression can add elements or attributes to the query result
(which were not present in the input XML document). This is achieved by using these
elements or attributes in the XQuery expression. For example, for our reservation sys-
tem example, we could load the XML document from a file, assign the document to a
variable, iterate over the hotel names with a for loop and then output only the hotel
names in an XML document with the new elements hotels and hotelname.
<Hotels> {
let $b := doc("reservationsystem.xml")
for $a in $b//hotel/name
return
<Hotelname> { $a } </Hotelname>
} </Hotels>
From a database point of view, the XQuery Update Facility (XQUF) is an important fea-
ture that allows for efficient modifications (in particular, insertions and deletions of
elements) inside an XML document without the need to load the whole document into
main memory and then storing it back to disk after modifications; most interestingly,
existing elements can be updated without changing their node IDs. As a simple exam-
ple, let us change the value of the name element of the hotel node with hotelID set to
h1:
let $c := doc("reservationsystem.xml")//hotel[@hotelID = ’h1’]/name

return replace value of node $c with (’City Residence’)
XQuery offers lots of other useful features like user-defined function. In addition,
XQuery Full Text (XQFT) offers the capability of full-text search: for example, XQFT
queries can ask for substrings, or use language-specific features and stemming of
words.
5.2.3 XSLT
XSLT stands for XSL transformation; the eXtensible Stylesheet Language (XSL) is com-
monly used to define how the content of an XML documents should be formatted and
displayed. XSLT as a subset of XSL is used to define how the structure of the input
XML document can be transformed into a differently structured XML document or
even other data formats like HTML or other textual formats. XSLT supports features
like recursion or sorting and hence qualifies itself as a general-purpose transforma-
tion tool for different output formats.
XSLT expressions consist of transformation rules that are called templates; in-
side a template, XPath expressions can be used to navigate to the matching elements.
These matching elements in the XML document are processed according to the actions
specified inside the XSL template element. For example, an XSL value-of element se-
lects (and outputs) the value of the specified element. In contrast to XQuery, elements
that do not match any template are also processed; that is, their attributes and text
element contents are part of the output document. To avoid this, by using the asterisk
* as the match path in the XSL template, a default behavior can be specified. In other
words, this default XSL template is used to match all the remaining elements that do
not match any other template; by leaving this template empty, the remaining elements
will be ignored and do not show up in the output. Any text or tag in the XSL stylesheet
that is not in the xsl namespace will be output as is; this way we can easily introduce
new elements in the output document. As a simple example of an XSLT template with
match and select part, let us just output the names of all hotels of our reservationsys-
tem XML document under a new root element Hotels: we put each hotel name inside
a new Hotelname element and ignore all other elements. We define the string xsl as
the namespace for the XSL definitions:
<Hotels>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/reservationsystem/hotel">
<Hotelname>
<xsl:value-of select="name"/>
</Hotelname>
</xsl:template>
<xsl:template match="*"/>
</xsl:stylesheet>
</Hotels>
5.3 Storing XML in Relational Databases
A relational database system might be used to store XML documents due to several
reasons (for example, maturity and well-documented behavior of RDBMSs, licens-
ing issues or supporting legacy applications). In this section, we survey three op-
tions for storing XML in a SQL database: (i) the SQL/XML data type for XML data, (ii)
schema-based mappings, or (iii) schemaless (also called schema-oblivious or schema-
independent) mapping. What we will see in this section is that it is possible to map an
XML document to a set of relational tables – as long as the XML document is simply-
structured (not too many levels in the tree, not too many optional subelements). How-
ever in the general case – for complex hierarchical, sparse and maybe recursive XML
documents – this mapping procedure will in general be a tedious process. What is
more, joining several tables (using foreign keys) to recompose the original XML docu-
ment results in performance penalties during query processing. And lastly, document
order is important for XML documents (it may be defined in the XML schema defi-
nition) and must be preserved in query results. The relational data model however
sees data as an unordered set of tuples and hence document order of XML documents
mapped to a relational table can only be preserved by storing additional ordering in-
formation in the table.
5.3.1 SQL/XML
Several relational database management systems support the XML-related standard

of SQL, SQL/XML for short. The main feature is that a data type called XML can now
be used to store an entire XML document (or parts of an XML document) as an at-
tribute of a relational table. However, the internal storage structure, the extent to
which SQL/XML is supported, and the types of XML query languages supported (in
particular, efficient updates with XQUF; see Section 5.2.2) all depend on the RDBMS
provider. Internally, an RDBMS can store XML data in a binary format containing the
entire document, or in decomposed format distributing the XML content to various
tables.
As an example for XML as the data type of a column, consider the case that we
store hotel information in a table with a single XML column; each hotel is stored as a
separate row in the table:
CREATE TABLE HotelInfos (

hotelData XML
);
INSERT INTO HotelInfos VALUES (
XML( <HOTEL HotelID="h1">
<NAME>Palace Hotel</NAME>
<LOCATION>Newtown</LOCATION>
<ROOMPRICE>69</ROOMPRICE>
</HOTEL>
)
);
With the XMLTABLE function a stored XML document can be transformed into rela-
tional table format; column names can be specified in the function call to map XML
elements (or attributes) to the relational output format. In our example the following
statement reads the XML document in the column hotelData from the HotelInfos ta-
ble and maps the subelements (as well as the ID attribute) of each hotel element to a
row and then returns all these values as a new table called Xtable with four columns.
SELECT Xtable.* FROM HotelInfos,

XMLTABLE(’/hotel’ PASSING HotelInfos.hotelData
COLUMNS
"hotelid" VARCHAR(100) PATH ’@HotelID’,
"hotelname" VARCHAR(100) PATH ’NAME’,
"hotellocation" VARCHAR(100) PATH ’LOCATION’,
"hotelprice" INTEGER PATH ’ROOMPRICE’,
)AS Xtable
In the opposite direction, the SQL/XML standard also defines some functions that can
be used to generate an XML document from data stored in relational tables. Construc-
tors for elements and attributes can be used to obtain an XML fragment with a correct
syntax. As an example, assume that we have a table that stores data on persons; the
Person table has as columns PersonID, LastName and FirstName. We can access the
values in this columns and create an XML element named Personname that contains
a concatenation of first and last name (separated by a blank space ’ ’); the PersonID
is added as an attribute of that element.
SELECT XMLELEMENT ( "Personname",

XMLATTRIBUTES (p.PersonID AS "pid"),
p.FirstName ||’ ’ || p.LastName
)
FROM Person p
Assume that a person named Alice Smith it stored in the table with PersonID 1; the
output would then contain the XML element for this person:
<Personname pid=’1’>Alice Smith</Personname>
Several such function calls can be nested to obtain a nested XML document.
5.3.2 Schema-Based Mapping
The schema-based mapping of an XML document to relational database tables first of

all processes the XML schema definition: from the given DTD or XML Schema defi-
nitions, a database schema is derived that contains the definition of database tables
holding the necessary elements and attributes; this is called the schema mapping
step. The different tables are linked by foreign key constraints. For example, the table
for the XML root element is the base table that contains a primary key – the primary
key can later on be used as a foreign key in other tables. Next, subelements must be
accounted for in the relational schema. First of all, subelements that can be repeated
multiple times inside the same parent element (that is, their number of occurrences is
greater than 1) should be mapped to their own tables; here we indeed need a foreign
key column to map the subelement to the matching parent element (based on the pri-
mary key of the parent table). Moreover, in case the subelements contain subelements
on their own, again as primary key the node ID of each element should be stored in a
separate column. In contrast, subelements that can occur at most once can be stored
as a column in the same table as their parent; this process is called inlining. If this
subelement is optional, the content of the column will be null for every parent element
which does not contain the subelement. Hence in the relational schema definition,
these columns should be nullable to explicitly allow the use of null values, where as –
the other way round – required elements and attributes should be not null. Note that
in general also the data type has to be mapped: If the XML schema definition contains
information on data types, then the data type of the corresponding column must be
set to a matching type; if no data type information is given, a default data type for
the column must be chosen (for example, a variable-length character string like VAR-
CHAR(255)). Moreover, default values for XML attributes must be mapped into default
values for the corresponding columns. So far, with this kind of schema-based map-
ping, still some information of the original XML document is lost. First and foremost,
the XML document order is not fully mapped. While the child-parent relationship is
generally maintained because of the foreign key constraints, the sibling order is disre-
garded. This is why it is not clear for two subelements that belong to the same parent,
which of them comes first and which last. If sibling order is important, it can be en-
coded by additionally storing the node IDs of subelements themselves in a separate
column instead of just storing the parent ID as a foreign key. Similarly, XML ID refer-
ence attributes must be mapped to foreign key constraints to match the correct XML
ID attribute.
We will now look at our example DTD and map it to the corresponding SQL CREATE
TABLE statements; while we do consider an ID reference attribute, we refrain from en-
coding the sibling order (and hence leave out the additional ID attributes).

<!ELEMENT hotel (name, location, roomprice?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT location (#PCDATA)>
<!ELEMENT roomprice (#PCDATA)>
<!ATTLIST hotel
hotelID ID #REQUIRED>
<!ELEMENT booking (client)>
<!ELEMENT client (#PCDATA)>
<!ATTLIST booking
bookingID ID #REQUIRED
hotelbooked IDREF #REQUIRED>
First of all we see that the root element has two subelements (hotel and booking)
with multiple occurrences; this is why the subelements will each be stored in their
own table with a foreign key to the root element table. Next, we decide how to store
subelements of the hotel element. Inside the hotel element, all elements only occur
at most once; hence the subelements of hotel are all inlined elements in the same
table. The hotelID attribute is defined with a unique not null constraint. Lastly, we
decide how to store subelements of the booking element. Inside the booking element
the ID reference to the hotel must be modeled by a foreign key to the hotelID. The
resulting SQL statements are:
CREATE TABLE Reservationsystem (Id INTEGER PRIMARY KEY);

CREATE TABLE Hotel (HotelID VARCHAR(10) UNIQUE NOT NULL,
Name VARCHAR(20) NOT NULL,
Location VARCHAR(20) NOT NULL,
Roomprice NUMERIC NULL,

ParentID INTEGER REFERENCES
Reservationsystem (Id));
CREATE TABLE Booking (BookingID VARCHAR(10) UNIQUE NOT NULL,
Hotelbooked VARCHAR(10) REFERENCES
Hotel (HotelID),
Client VARCHAR(20) NOT NULL,
ParentID INTEGER REFERENCES
Reservationsystem (Id));
We reinforce the point that we do not consider sibling order in this example as other-
wise node ID columns would have to be added for all elements.
After the definition of the relational schema, the content of the XML document is
parsed, tuples are constructed out of it and these tuples are stored in the correspond-
ing tables; this step is called data mapping. As primary keys for the mapped data, a
node ID based on a numbering scheme (see Section 5.1.5) can be used. When looking
at the following example XML document
<reservationsystem>
<hotel hotelID=’h1’>
<name> City Residence </name>
</hotel>
<booking bookingID=’b1’ hotelbooked=’h1’>
<client> M. Mayer </client>
</booking>
the data mapping results in the three tables shown in Table 5.1.
Table 5.1. Schema-based mapping
Reservationsystem Id
0
Booking BookingID Hotelbooked Client ParentID

b1 h1 M. Mayer 0
Hotel HotelID Name Location Roomprice ParentID

h1 City Residence Newtown NULL 0
When evaluating an XML query, the XPath or XQuery expressions have to be translated
into SQL statements in a so-called query mapping step. The SQL query has to contain
joins based on the foreign keys established in the relational schema to recombine the
values from the relational tuples into an XML (sub-)tree. As the tables contain only the
data content of the XML document, the XML structure information for the query result
(element and attribute names, tree-like structure) must be obtained from the external
XML schema definition. As already mentioned, the three mapping steps (schema map-
ping, data mapping, and query mapping) may turn out to be prohibitively expensive
for large, possibly sparse XML documents with a complex hierarchical structure.
5.3.3 Schemaless Mapping
The schemaless mapping of an XML document to relational database tables does not
look at an XML schema definition; instead, it used a fixed generic database schema.
Basically, the schemaless mapping creates a tuple out of each XML node (element or
attribute). More precisely for each node the node type (root node, element node, text
node or attribute node), the node name (element name or attribute name), and the
node data (for attribute nodes or text nodes) are stored in separate columns. Not all
nodes have values for all columns: an element node has no data, and a text node has
no name. Hence, the resulting table contains several empty fields (that is, null values).
The tree-like structure of the XML document is maintained by storing node IDs (based
on a numbering scheme as discussed in Section 5.1.5): The database schema contains
a column for the node ID of the component itself and a column for the node ID of
the parent node (the parent ID). As an example consider the XML tree with preorder
numbering in Figure 5.3. It will be mapped to the table shown in Table 5.2.
XML queries (in particular, the XPath components) have to be translated into SQL
queries including self-joins on the generic table. For example, to evaluate a child axis,
Table 5.2. Schemaless mapping
nodeID nodeType nodeName nodeData parentID

0 root reservationsystem NULL NULL
1 element hotel NULL 0
2 attribute hotelID h1 1
3 element name NULL 1
4 text NULL Palace Hotel 3
5 element location NULL 1
6 text NULL Newtown 5
7 element roomprice NULL 1
8 text NULL 65 Euro 7
the node ID of the context node must be matched with the parent ID of the child node.
Several other information may be added to the generic schema to speed up query pro-
cessing; for example, the full path (from the XML root) of each element may be stored
to make XPath evaluation more efficient. An obvious disadvantage of the schemaless
mapping is that for a large XML document the mapping results in one large relational
table containing many null values and requiring costly self-joins for query processing.
5.4 Native XML Storage
Storing XML documents in a format that makes processing it (querying and updating
it) efficient is often called native XML storage; database systems that use such an ef-
ficient storage format (instead of relational tables) are called native XML databases.
Such native XML databases should provide all the features of a general DBMS (see Sec-
tion 1.1). An additional requirement for native XML storage is round-tripping: when
an XML document is stored and later on retrieved, its original structure must be pre-
served including the document order of all its elements and attributes. Moreover, re-
garding database access from external programs, some standards have been defined
to facilitate access and interoperation with native XML databases: for example, XQJ is
the standard XQuery API for Java which allows establishing a connection to an XML
database from a Java program. Native XML storage heavily relies on additional data
structures that provide extra information about the XML document (like for exam-
ple different kinds of indexes). Moreover, an advanced management of the memory
pages is necessary to split large XML document to multiple pages while still allowing
for efficient navigation without wasting too much memory space. Both, memory man-
agement and additional data structures, are the main reason for the efficiency gain
of native XML databases in comparison to other (non-native) storage approaches. We
survey index variants and native storage mechanisms in the following subsections.
5.4.1 XML Indexes
Indexes are crucial for an accelerated evaluation of XML queries. Some queries can
even be answered by only consulting an index; that is, the stored XML document it-
self need not be loaded and accessed. Indexes that are particularly useful for XML
processing can roughly be categorized into the following types:
Value indexes: A value index maps each content value of attributes and text
nodes to a list of node IDs of matching nodes; that is, node IDs of those attribute
and text nodes that contain the value. In particular, evaluation of predicates is
much faster with a value index. For the example XPath query in Section 5.2.1, the
predicate [./location="Newtown"] can be evaluated by a lookup in the value
index resulting in the set of matching node IDs. To answer the example query
//hotel[./location="Newtown"]/@hotelID, starting form the resulting node

IDs, the XML document must only be traversed upwards to get the matching loca-
tion and hotel elements and lastly the corresponding hotelID attribute. Without
a value index the whole XML document would have to be traversed in order to
check all text nodes of location elements for a match.
Name indexes: A name index maps all element and attribute names to the cor-
responding node IDs. For example, finding all hotel element nodes anywhere in
the document (written as //hotel) can quickly be answered by a lookup of the
element name “hotel” in the name index.
Path indexes: A path index maps all unique paths in the XML document to node
IDs that match the given path. For example, a mapping to all location element
node IDs would be maintained for /reservationsystem/hotel/location. Sev-
eral extensions of path indexes have been considered: For one, a path index is
sometimes called summary index, because some of the paths might coincide and
can be summarized into a common representation. Instead of mapping paths to
node IDs, the path index might also introduce a dedicated path ID for each in-
dexed path; this path ID can then be used in combination with other indexes. For
a more efficient evaluation of some queries, an index can be built for reversed
paths. A reversed path does not contain the description of the path from the root
down to a node but instead in reversed order: starting from the node and then
concatenating all element names on the upward path up to the root node. For
example, queries containing a descendant-or-self axis (like //hotel/location)
can be answered quickly by looking up in the reversed path index all entries for
the reversed path starting with location/hotel/. More generally, a path index
may contain additional information for each path that helps answer queries solely
with the index – that is, without consulting the XML document. For example, for
a small range of short values, the index could directly store the content value of
each indexed path ending with an element or attribute node. Storing frequencies
of those values allows for answering count queries. If the range of values is too
large to store all of them, the index could just store the minimum and maximum
values in order to quickly determine that queries asking for values outside the
minimum-maximum range have no match.
Join indexes: In an XPath expression, several subexpressions might have to be
evaluated independently. In particular, a structural join is the evaluation of a
descendant-or-self axis (abbreviated as //) where an arbitrary amount of levels
can lie between the ancestor and the descendant. For example, the XPath ex-
pression //hotel//location searches for any location elements under any hotel
elements in the XML tree: the hotel element might be arbitrarily far away from the
root element and the location element might be arbitrarily far away from a hotel
element (but inside the subtree starting from a hotel element). A simple structural
join algorithm would then have to find all the hotel elements and all the location
elements (maybe by using a name index), and then try to find pairs of hotel and
location elements for which there is an ancestor-descendant path between them.

Several other types of joins might be computed during query evaluation; some in-
dex structures (for example, tree-based or stack-based) help evaluate those joins
more efficiently.
Type indexes: With an XML Schema definition, user-defined types can be speci-
fied; indexes may be built for each such type to be able to compare and find ele-
ment nodes of this type more efficiently.
Word-break indexes: A word-break index splits the whole XML document into
separate words according to whitespace boundaries (and other special characters
like <). This allows for substring (“contains”) queries of whole words.
Full-text indexes: A full-text index is an advanced value index that can han-
dle imprecise queries (like for example known from information retrieval). It also
indexes content values of attribute or text nodes. Each string value is split into
separate tokens (for example, words or n-grams). In addition, the full-text index
could provide other options like stemming and language-specific capabilities to-
gether with scoring of tokens and other advanced features. Full-text indexes are
extremely helpful to quickly evaluate XQuery Full Text (XQFT) queries.
Each such index type provides a sorted maintenance of the indexed values and are
usually implemented with specialized data structures (like variants of a so-called B-
tree) that allow for a fast search in the index. Whenever the XML document is modified,
the indexes have to be updated. Both sorting and updating of indexes are costly proce-
dures that must be implemented with care. For example, upon deletion of an element
node, the element’s node ID must be deleted from a name index and any text node of
this element must be deleted from a value index; in a path index, the element node
must be removed from each corresponding path entry and any additional information
(like frequencies of values) have to be modified accordingly.
While indexes greatly improve query processing, they come at the cost of addi-
tional management and storage requirements. That is, important points to consider
are, for instance, update cost and index size. Which type of index is useful for a cer-
tain XML document highly depends on the typical query pattern used – for example,
if it is mostly navigational access or full text search.
5.4.2 Storage Management
Similar to relational databases, XML databases have to manage secondary disk space
for persistent storage of XML documents. XML documents have to be stored in memory
pages (see Section 1.2), because memory pages are the storage unit of secondary disk
storage. Hence, as soon as an XML document is larger than a single memory page, the
parts of the document have to be distributed among different pages. Loading (parts
of) the XML document into main memory and storing pages back to disk is done page-
p2 2 hotelID "h1" 3 p3 5 location p4 7 roomprice

name 4 "Palace Hotel" 6 "Newtown" 8 "65 Euro"
p1 0 reservation
system 1 hotel
Fig. 5.6. Chained memory pages
wise and must be efficiently handled. Some techniques for an improved page handling
are the following:
Ordered storage: The XML document is stored into a page where the nodes follow
each other in document order; that is, starting from the root node, the document
is consecutively stored in memory up until the last leaf node while adding to each
node a node ID according to some numbering scheme. An example can be seen
in Figure 5.6: The XML tree from Figure 5.3 is stored in document order. That is,
each node ID is stored followed by a description of the node (the element or at-
tribute name and/or its value); all node IDs occur in increasing order. When one
page is full, a new page is allocated and a link is added from the full page to the
new page. In other words, the pages for one document form a linked list; they are
called chained pages. By following these links, a forward traversal of the entire
document can be done. In the example, page p1 has a link to the next page p2 and
so on until the last page p4 is reached. When pages are chained in both directions
(one link to the following and one link to the preceding page), both backward and
forward traversal are possible.
Clustered storage: The clustered storage groups (that is, “clusters”) elements ac-
cording to their position in the XML document; each cluster is then stored in a
page. This is advantageous when the elements inside a cluster are often processed
together. The exact document order can then be reconstructed by the stored node
IDs. For example, a cluster can be built for those elements that are 1) in the same
level of the XML tree and 2) have the same element name – this approach was im-
plemented in the Sedna XML DBMS [TSK+ 10]. Queries that have to retrieve nodes
of the same name can then be processed quickly because only the page with the
appropriate cluster has to be loaded. Chained pages can be used for clusters larger
than one memory page.
Subtree extraction: An alternative storage organization is to clip out subtrees of
the XML tree and store each such subtree in a separate page. The root node and
the remainder of the XML document (the elements that are not part of an extracted
subtree) are stored together in one page; links to the appropriate subtree pages are
p2 2 hotelID "h p3 5 location p4 7 roomprice

1" 3 name 4 @ 6 "Newtown" 8 "65 Euro"
p1 0 reservation p5 "Palace
system 1 hotel Hotel"
Fig. 5.7. Chained memory pages with text extraction
stored there, too. The advantage is that the entire subtree can be processed with-
out loading any other pages, because all subelements are already loaded together
with root of the subtree.
Text extraction: Variable-length text values can appear in XML documents, in
particular, in text nodes and attribute values. To achieve an efficient parsing of
the XML document, it makes sense to store the structural information for an XML
document (tree structure of elements and attributes) separate from the content
(text node values and attribute values). In particular, large text values can be ex-
tracted into a new memory page and only a link to this value is stored in the corre-
sponding position (attribute node or text node) of the document. In the example
in Figure 5.7, the value of the text node with node ID 4 is extracted into page p5 ;
only a pointer to this page (depicted by the @-sign) is stored together with node ID
4 in the referencing page p2 . The extracted text might itself span multiple pages
that can be chained together.
Dictionary: Due to self-descriptiveness, XML documents have often-repeated text
parts – like element names that occur multiple times in the same document. A
dictionary can be used to store XML documents in a more compact format: each
repetitive text part is replaced by a short unique internal placeholder; the dictio-
nary maps each placeholder to the replaced text part and must be stored together
with the document. When processing the document, one level of indirection (one
look-up in the dictionary) is necessary for re-inserting the replaced text parts in
place of the placeholder.
B-tree: An XML document that is stored in ordered storage (with elements sorted
according to the document) can be accessed more efficiently by using a form of a
B-tree. A B-tree is a balanced tree – which is also widely used as an efficient index
structure; in general, a B-tree efficiently locates a single element in a sorted set
of elements. Hence, a B-tree can sort the node IDs of the first node in each page;
for a given node ID the page in which the node is stored can be located quickly by
following the correct path in the B-tree. Coming back to the example, we see in
Figure 5.6 that the first node ID in page p1 is node ID 0, in page p2 the first node
ID is 2, in page p3 the first node ID is 5, and in page p4 the first node ID is 7. The
· 5 ·
· 2 · · 7 ·
p1 p2 p3 p4 Fig. 5.8. B-tree structure for node IDs in pages
corresponding B-tree that sorts these IDs is shown in Figure 5.8. The root node
contains the median ID 5. By going to its left child, we find the pages for node IDs
between 0 and 4 (that is, less than 5); by going to its right child, we find the pages
for node IDs between 5 and 7 (that is, greater than or equal to 5). In the next level
of the B-tree we can quickly determine that nodes with IDs 0 and 1 are stored in
page p1 , nodes with IDs 2 to 4 are stored in page p2 , nodes with IDs 5 and 6 are
stored in page p3 , and lastly nodes with IDs greater or equal to 7 are stored in page
p4 .
Indirection table vs direct referencing: Usually, references to nodes are done
by their node ID. For example, determining a parent, a child or a sibling of a node
requires calculation of their node IDs based on the chosen numbering scheme
and then determining their memory address (for example by using an indirection
table mapping node IDs to memory addresses); this is called indirect ID-based ref-
erencing. A quicker traversal which requires no lookup in an indirection table can
be achieved by storing inside a node description not only the element name or
value of this node but also a set of memory addresses (that is, direct pointers) to
parent, child nodes and sibling, respectively. Although direct pointers allow for a
much quicker traversal of the XML document, this comes at the cost of more intri-
cate maintenance of these direct pointers: in case the memory address changes,
all pointers have to be adjusted in all referencing nodes. In particular, this is crit-
ical when links to parent nodes are stored as direct pointers: when the memory
address of the parent nodes changes, the pointers inside all children nodes have
to be modified accordingly.
When it comes to modifying an XML document, the storage organization in the mem-
ory pages is affected. When a new node is inserted, it may often happen that the new
node does not fit into the page where it belongs into. In this case, a page split has to
be carried out as depicted in Figure 5.9: while some part of the old data remains in the
original page p i , a new page p j is allocated to store the remainder of the old data; the
new data can then be inserted at the correct position. When pages are chained, a link
from the original page p i to the new page p j has to be added, and the link to the page
succeeding the original page (say, p i+1 ) has to be moved to the new page p j .
Replacing a text node value by a longer value may also cause a page split: because
the new text is longer, the following page content has to be moved and possibly does
new data
p i data1 data2 p i data1 new p j data3 data4
data3 data4 page split data data2
Fig. 5.9. Page split due to node insertion
not fit on the page any longer. In contrast, a deletion of a node causes a fragmentation
of a block: free space remains where the description of the deleted node is removed.
A reorganization of pages might be advantageous from time to time to defragment
the memory pages. Ideally, memory pages should be filled to nearly 100% even under
frequent modifications.
Renumbering of nodes (due to insertion or deletion of nodes) affects usually all
other referencing nodes when node referencing is implemented by an indirection table
– as well any metadata that use node IDs like the B-tree for efficient page localization.
That is why usually native storage relies on another level of indirection: nodes have
an internal identifier which is different from the node ID assigned by some number-
ing schema, because the node ID might change due to renumbering after inserting or
deleting nodes from the document tree. The node IDs (assigned by some numbering
scheme) are thus used only externally for a higher-level view of the XML document;
for referencing a node on the memory-page-level, however, an internal ID is used for
each node. The internal ID is assigned to a node uniquely; it is immutable and hence
will never change during the existence of the node (even though the external node ID
may change due to renumbering) and it will not be reused by any other node. This con-
cept is called identifier stability. Page management, node referencing as well as all
indexes can use the internal IDs and are not affected by renumbering issues. To link
the external node ID and the internal ID, a mapping between external and internal ID
must be maintained in a second indirection table; that is, all accesses on the docu-
ment which use an external node ID must be accompanied by looking up the internal
ID in the indirection table before further processing the access. On average it might be
better to accept this additional table lookup for the sake of a more efficient handling
of renumbering.
Another issue is storage efficiency of node IDs: care must be taken that node IDs
do not get too long, because this would increase storage requirements and would com-
plicate node ID management. In general, it easier to handle fixed-length node IDs than
node IDs with varying length; when encoding variably-sized node IDs, the length of
each node ID has to be stored in each memory page, too. The problem that a node ID re-
quires more storage space than actually planned is termed identifier overflow. An ID
overflow is in particular a problem of prefix-based numbering schemes like DeweyIDs
because node IDs encode level information and hence longer for nodes at deeper lev-
els. This problem is exacerbated in OrdPath (see Section 5.1.6): OrdPath uses even
numbers as markers for an insertion of a node in between two odd numbers which
accounts for why OrdPath IDs become exceedingly lengthy after a sequence of node
insertions.
5.4.3 XML Concurrency Control
Concurrency of transactions on an XML document is decisive for an efficient and re-

sponsive management of large XML documents. Isolation is a crucial property that
has to be ensured when an XML document is accessed by concurrent transactions:
the different transactions should not interfere with each other. Concurrent accesses
by different users (or transactions) on a single XML document should be supported
by XML databases in a way similar to how concurrent accesses on relational tables
are supported by RDBMSs (see Section 2.5). Concurrency is no problem when the XML
document is never modified, that is, when there are only read accesses (that is, queries)
to the document. As soon as modifying accesses (that is, updates) are allowed, concur-
rency control is needed. Updates can consist of editing, inserting, moving or deleting
nodes in the XML tree. Editing a node means changing the name of an element or an
attribute node or changing the content value of a text or attribute node. This kind of
editing affects other users working concurrently on the same node: when one user ed-
its a node which another user is reading, the edited node might not match the second
user’s query anymore; when two users edit at basically the same time, it is not obvious
which of the two modifications should be persisted. Inserting a node can happen any-
where in the tree. The easiest case is adding a leaf node: adding a node in between two
nodes as a new sibling; but even in this simple case, the insertion might be in conflict
with a read access. See Figure 5.10 for an example: when a read access iterates over all
child nodes of node x while concurrently a new child node y is inserted, the inserted
child y is not covered by the iteration.
More generally, insertions of internal nodes (that is, non-leaf nodes) can be done
along any path in the tree; this case is more complicated as it would require renum-
bering of several affected nodes. Even a new root node can be added which would re-
quire a renumbering of the whole XML tree; from a concurrency control point of view,
all other users working concurrently on the XML document would be affected by such
an insertion. Moving nodes happens when a subtree is moved to a new position in the
XML tree, the subtree itself is however not modified. Deleting nodes is also an intricate
problem: when one user deletes a node, another user might currently be reading the
deleted node; see Figure 5.10 for an illustration where a query reads the child nodes
of x while concurrently this node is deleted. Similar to node insertions, node dele-
tions can also lead to renumbering of parts of the tree and hence affect other users
working concurrently on the tree. Particularly important for XML concurrency control
x x
a) Conflict between insertion and query b) Conflict between deletion and query
Fig. 5.10. Conflicting accesses in an XML tree
is support for multi-lingual accesses on the same XML document. That is, if an XML
databases supports different access methods, concurrent execution should be possi-
ble independent of the access method used – for example, one XQuery execution in
parallel with a DOM-based access.
As can be seen from these examples, concurrency control is more intricate for tree-
shaped XML documents than for the flat relational data model. Concurrency control
on XML documents can be optimistic or pessimistic as for relational database systems
(see Section 2.5). One form of pessimistic concurrency control is lock-based concur-
rency control. As XML documents correspond to a tree-like structure, XML locks are
substantially different from the read-only locks and read-write locks on data items
known from relational concurrency control (as for example the 2PL protocol). Differ-
ent locking protocols for XML have been proposed; these can further be divided into
node-locking and path-index-locking protocols.
Node-locking protocols put locks on single nodes or subsets of nodes in the XML
tree.
Path-index-locking protocols put locks on path indexes that are maintained to
speed up query evaluation.
Let us have a closer look at node-locking protocols; they might implement some of the
following kinds of locks.
Node locks: Node locks are locks on individual nodes in the tree. Node locks can
have different subtypes depending on whether the content of the node is accessed
(read, modified, inserted, deleted) or whether merely a node test for existence (for
example existence of an element with a certain name) is executed. It enhances
concurrency to distinguish such access modes, that is, have different locks de-
pending on whether the node content is accessed, or whether merely its existence
is checked with a node test. A node lock is a lock of finest granularity; that is, there
is no smaller unit on which a lock can be put.
child lock node lock
subtree lock
Fig. 5.11. Locks in an XML tree
Subtree locks: To lock all descendants of a node, subtree locks can be used. That
is, entire subtrees can be locked by putting a lock on the root of the subtree. This
avoids the need to put a lock on every single node in the subtree. Hence, a subtree
lock provides coarser granularity than a node lock.
Child locks: A child lock locks a node together with all its direct child nodes. This
is helpful when a node is accessed followed by an iteration over all direct children
of the node.
Edge locks: Edge locks are locks for navigation in the tree along the axis like an-
cestor, descendant, following, or preceding.
These locks can be shared (to allow for concurrent reads) or exclusive (to allow for
modifications without interfering with other accesses). When using a locking-based
protocol, a compatibility matrix then shows which locks are compatible with other
locks – and the other way round: which lock combinations are prohibited and hence
which accesses have to be postponed until conflicting locks are released. Intention
locks are a further special kind of locks: when an exclusive lock is put on a node, all
ancestor nodes of this node receive an intention exclusive lock to denote the fact that
a modification will take place in a subtree; this is necessary to avoid conflicting locks
with coarser granularity than the one used for the exclusive lock. In a similar way,
intention shared locks can be defined.
Moreover, it is important to handle locks at a low granularity: The less nodes are in
the scope of a lock, the more concurrent transactions are possible. The process of lock
escalation allows for a flexible adaptation of the granularity: for example, if many
single nodes inside a subtree are already locked by one transaction, it is usually more
efficient to obtain a single lock for the whole subtree (instead of managing all the indi-
vidual node locks). And similarly it might be better to lock the whole document when
several subtrees inside the document are already locked by the same transaction.
Several tools for editing and processing XML data are available. From the database
perspective two main open source systems are eXistDB and BaseX.
5.5.1 eXistDB
The eXist database is a Java-based open source XML database.
Web resources:
– eXistDB: http://exist-db.org/
– documentation page: http://exist-db.org/exist/apps/doc/
– GitHub repository: https://github.com/eXist-db/exist
Internally it uses a tree representation of the XML document with a numbering scheme
that virtually expands the tree into a complete tree such that not all node IDs corre-
spond to existing nodes; in this way, there is room for additions in the tree using the
virtual node IDs. A system lookup table maps nodeIDs to physical addresses. eXistDB
backs up its search functionality by several index structures, the following indexes
are supported:
– structural index: it maps element and attribute names (more precisely their in-
ternal qualified names) inside a collection to document IDs and node IDs. This
index is used whenever an element name or attribute name is looked up in an
XPath query.
– full-text index: the full-text index uses Apache Lucene that by default splits text
around whitespaces as well as tags; it can be defined on an individual element or
attribute name or on an entire path.
– n-gram index: it indexes ngrams of text data: data strings are split into overlap-
ping sequences of n (by default three) characters. This index enables efficient sub-
string searches.
– spatial index: an experimental spatial index enables searches in geometries de-
scribed in the Geography Markup Language (GML).
– range index: it indexes text nodes and attributes based on the data type of the
stored value; this makes comparison operations more efficient. The range index
is also based on Apache Lucene.
Settings for indexing are maintained in configuration files. Recently, eXistDB has
added support for arrays (and in particular nested arrays) in response to the XQuery
3.1 Candidate Recommendation. In particular, JSON documents can be represented as
a mixture of maps (containing key-value pairs) and nested arrays; for example:
let $persons := [
map {
"firstName": "Alice",
"lastName" : "Smith",
"age": 31,
"address": [
map {
"street": "Main Street",
"number": 12,
"city": "Newtown",
"zip": 31141
}
]
}
]
eXistDB offers several user APIs:

– RESTful API: XQuery and XPath queries are specified using a GET statement with
optional parameters; for example for finding the lastnames of all persons with
firstname Alice:
http://localhost:8080/exist/db/personcollection?
_query=//lastname[firstname=%22Alice%22]
Stored XQueries can be called in a GET or PUT request; the queries will then be
executed on server-side.
– XML:DB API: Java applications can use the XML:DB API to interact with eXist.
First of all, a database driver has to be used to instantiate a new database object.
This database object has to be registered with the database manager class, which
in turn can be used to open a collection. Afterwards, XML documents can be re-
trieved as resources from the database:
try{
Class driver = Class.forName("org.exist.xmldb.DatabaseImpl");
Database existdb = (Database) driver.newInstance();
DatabaseManager.registerDatabase(existdb);
Collection collection=DatabaseManager.getCollection("xmldb:"
+"exist://localhost:8080/exist/db/personcollection");
XMLResource resource =
(XMLResource)collection.getResource("persons");
System.out.println(resource.getContent());
}
catch (Exception e){}
finally {
if(resource != null) {
try {
((EXistResource)res).freeResources();
}
catch(XMLDBException xe) {}
}
if(col != null) {
try { col.close(); }
catch(XMLDBException xe) {}
}
– XML-RPC API: The XML Remote Procedural Call API can be used with an object of
type XmlRpcClient that can execute methods on the database server.
– SOAP API: eXistDB uses the Apache Axis SOAP toolkit in a servlet to offer the SOAP
interface.
5.5.2 BaseX
BaseX is a native Java-based XML database that stores XML documents in a tabular
representation. BaseX supports full-text search with additional scoring functions and
thesauri as well as fuzzy text search based on the Levenshtein distance between words.
Web resources:
– BaseX: http://basex.org/
– documentation page: http://docs.basex.org/
– GitHub repository: https://github.com/BaseXdb
BaseX implements the Pre/Dist/Size encoding of each node [KGGS15] to facilitate nav-
igation in the XML tree: The Pre value is the preorder number of the node, the Dist
value is the difference between the Pre value of the node to its parent node, the Size
value is the amount of descendants nodes plus the node itself. This means that the
preorder number of the parent node can be derived from the preorder number of the
context node and its Dist value:
pre(parent) = pre(self ) − dist(self )
The preorder number of all descendants of the self node lie in between pre(self ) + 1
and pre(self ) + size(self ) − 1.
Several serialization methods are offered including methods to serialize XML doc-
uments into CSV or JSON format as well as to import these formats. Regarding the
handling of JSON documents, BaseX offers several transformation options:
– direct conversion: The direct conversion results in a lossless transformation of
JSON documents into XML and back. It creates a <json> element as the root. For
each key part (of a JSON object) a new element is created. Attributes called type
are added to specify that an element corresponds to a certain JSON type (string,
number, boolean, null, object or array). Arrays have an empty key that is repre-
sented by an element with the name _ (underscore).
– attributes conversion: The attributes conversion is another lossless transforma-
tion. It creates a <json> element as the root. Objects (key-value pairs) are con-
verted into a <pair> element in which the key part is stored as an attribute called
name. Array entries are represented as several <item> elements. The value parts
are all stored as separate text nodes again with an attribute type specifying the
JSON type.
– other conversion types are map conversion (transforms a JSON document into an
XQuery map), basic and JsonML.
Several language bindings as well as a REST API, an XQJ API and a XML:DB API (only
in standalone mode) are offered. For example, the BaseXClient class is offered in the
Java API with which a session can be created and queries can be sent to the database.
The Geo Module can handle data according to the Open Geospatial Consortium (OGC)
Simple Feature (SF) data model.
Indexes available in BaseX are:
– name index: containing all the names of elements and attributes.
– path index: containing all distinct paths in the XML documents which is useful
when optimizing queries.
– resource/document index: referencing the Pre values of all document nodes
hence facilitating access to certain document.
– text index: containing all text nodes and hence used in evaluating equality tests
or range queries on text values.
– attribute index: containing all attribute texts and hence speeding up equality tests
or range queries on attribute values.
– full-text index: containing all text values and used during the evaluation of
contains text queries.
XML is an established standard of the World Wide Web Consortium [Gro08]. Due to its
long-standing nature, most relational database systems offer the XML data type and
support at least one XML query language. Mappings of XML data to relational tables
(or from XML schemas to relational schemas) have been proposed and analyzed in
several approaches; some of them can be found in [STZ+ 99, LC01, TVB+ 02, ACL+ 07].
Several native XML databases are available; yet they differ in the offered features
like support for XML query languages, support for XML updates, or indexing tech-
niques.
Numbering schemes and the renumbering effects of updates have been formal-
ized in a variety of prior work for example in [FM12] or using the pre/post numbering in
[Gru02]. The Ordpath numbering was introduced in [OOP+ 04]. A recent overview and
an order-centric comparison is given in [XLW12]. Notably, there are only few number-
ing schemes that avoid renumbering of nodes in case of updates: the work of Li, Ling
and Hu [LLH08] proposes a certain binary encoding for node IDs together with lex-
icographic ordering; whereas O’Connor and Roantree [OR13] introduce a numbering
scheme based on the Fibonacci numbers.
Different XML concurrency control approaches including locking have been stud-
ied and surveyed for example in [JC06, PK07, BHH09, SLJ12]. Query optimization for
XML queries is another well-studied topic. In particular, different indexing mecha-
nisms can be used to speed up XML processing; see for example [WH10].
6 Key-value Stores and Document Databases
This chapter covers key-value stores and document databases. Key-value stores are
specialized for the efficient storage of simple key-value pairs. Parallel processing
of key-value pairs has been popularized with the Map-Reduce paradigm. The Java
Script Object Notation (JSON) is a textual format of nested key-value pairs. Document
databases use JSON as their main data format. Last but not least, JSON is often used
as a format for payload data in REST-based APIs.
6.1 Key-Value Storage
A key-value pair is a tuple of two strings hkey, valuei. A key-value store stores such key-
value pairs. The key is the identifier and has to be unique. You can retrieve a value from
the store by simply specifying the key; and you can delete a key-value pair by speci-
fying the key. A key-value store is the prototype of a schemaless database system:
you can put arbitrary key-value pairs into the store and no restrictions are enforced
on the format or structure of the value; that is, the value string is never interpreted or
modified by the key-value store. Hence, a key-value store basically only offers three
operations: writing (“putting”) a key-value pair into the store, reading (“getting”) a
value from the store for a given key, and deleting a key-value pair for a given key.
store.put(key, value)
value = store.get(key)
store.delete(key)
With this simple interface, values cannot be searched and there is no advanced query
language; if a combination or aggregation of several key-value pairs is needed, the
accessing application is responsible or combining the corresponding key-value pairs
into more complex objects.
The characteristic feature of a key-value store is that it is “simple but quick”: Data are stored in a
simple key-value structure and the key-value store is ignorant of the content of the value part.
An advantage of this simple format is that data can easily be distributed among several
database servers; hence, key-value stores are good for “data-intensive” applications.
Typical applications for key-value stores are session management (where the session
ID is the unique key) or shopping carts (where the user ID is the key). A generic ap-
plication for the key-value pair data format is the map-reduce framework discussed in
the next section.
106 | 6 Key-value Stores and Document Databases
In practice, in most key-value stores, values are allowed to have other data types than
just strings. For example, a value can be a collection like a list or an array of atomic
values; they support advanced search and indexing features. Some key-value stores
also support data formats like XML or JSON – and hence are very close to document
databases (see Section 6.2).
6.1.1 Map-Reduce
Under the name “map-reduce” a framework has come to be known that greatly sim-
plifies the distributed processing of data in key-value format.
The basic elements of map-reduce are four functions that operate on key-value pairs (or on key-value
pairs where the value is actually a list of values): split, map, shuffle and reduce.
While split and shuffle are more or less generic functions that can have the same
implementation for all applications, the other two – map and reduce – are highly
application-dependent and have to be implemented by the user of the map-reduce
framework. Map and reduce are executed by several worker processes running on sev-
eral servers; one of the workers is the master who assigns new map or reduce tasks to
idle workers. Roughly, the four basic steps proceed as follows:
1. split input key-value pairs into disjunct subsets and assign each subset to a
worker process;
2. let workers compute the map function on each of its input splits that outputs in-
termediate key-value pairs;
3. group all intermediate values by key and assign (that is, shuffle) each group to a
worker;
4. reduce values of each group (usually one key-value pair for each group) and re-
turn the result.
A typical illustrative example for a map-reduce application is counting occurrences of

words in a document (see Figure 6.1). The input is a document consisting of several
sentences; counting the words consists of four steps:
1. the split function splits the document into sentences; each sentence is assigned
to a worker process;
2. the worker thread starts a map function for each sentence; it parses a sentence
and for each word wordi , the worker thread emits a key-value pair (wordi , 1) that
denotes that the worker has encountered wordi once; these intermediate results
are stored locally on the worker’s machine;
3. during the shuffle phase, local intermediate results are read and grouped by
words; the 1-values for each word are concatenated into a list: that is, for each
word there is a key-value pair where the word wordi is the key and the value is
6.1 Key-Value Storage | 107
split map shuffle reduce

(1,sentence1 ) (word3 ,1) (word1 ,(1,1,1)) worker4 (word1 ,3)
(2,sentence2 ) worker1 (word4 ,1) (word2 ,(1)) (word2 ,1)
(3,sentence3 ) (word3 ,1)
(word1 ,1)
(4,sentence4 ) worker2 (word1 ,1) (word3 ,(1,1,1)) worker5 (word3 ,3)
(5,sentence5 ) (word2 ,1) (word4 ,(1,1)) (word4 ,2)
(word4 ,1)
(6,sentence6 ) worker3 (word3 ,1)
Fig. 6.1. A map-reduce example
a list of 1s corresponding to individual occurrences of the word in all sentences:

(wordi , (1, ..., 1)) then, each word is assigned to a worker process;
4. the worker thread starts a reduce function for each word that calculates the total
number of occurrences by summing the 1s; as the final results it outputs the key-
value pair (wordi , sumi ).
More formally, the signatures for the four functions can be defined as follows:
– split: input → list(key1 , value1 ); that is, split maps some input text to a list of key-
value pairs – for example, to a list where the sentence number is the key and the
sentence content is the value.
– map: (key1 , value1 ) → list(key2 , value2 ); that is, map processes one key-value pair
and maps it to a list of key-value pairs; the new key key2 (for example, the word
wordi ) usually differs from the old key key1 (for example, the sentence number).
– shuffle: list(key2 , value2 ) → (key2 , list(value2 )); that is, shuffle groups the individ-
ual key-value pairs by key and appends to each key a list that is a concatenation
of the values of the individual pairs.
– reduce: (key2 , list(value2 )) → (key3 , value3 ); that is, reduce aggregates a list of val-
ues into a single one; the keys key2 and key3 can be identical (as for example,
wordi ) and value3 is calculated from the list of values list(value2 ) (in our example,
by summation).
Lastly, we highlight some features and optimizations of the basic map-reduce setting:
Parallelization: What makes map-reduce a good fit for processing large data sets
is that the map as well as the reduce task can be run in parallel by different concur-
rent worker processes and even on multiple servers. This can be done as long as
all map and reduce task are totally independent of one another; in our example,
a map process can be executed on a sentence without requiring any input from
any other map process. It is crucial to have a master process to coordinate the par-
allelization. The master keeps track of worker processes and it assigns map and
reduce processes to idle workers. The master must also handle failures of worker
processes: the master checks on a regular basis if a worker process is still alive;
if a worker does not respond to this check, the master has to assign the process
running on the failed worker to another one. If the failed process was a map task,
the master must notify reduce task that want to read the output from the failed
worker of the new location of the map task.
Partitioning: Usually, there are more reduce tasks to be executed than workers
available. That is, each worker has to execute several reduce task on a set of differ-
ent keys key2 . In our example, several different words will be mapped to a worker
to execute the reduce task for each word. The subset of keys assigned to the same
worker is called a partition. A good default partitioning can be achieved by using
a hash function on the keys (which results in a number for each key) and then
using the modulo function to obtain the number of a worker. In other words, if
we have R workers that can accept reduce tasks, then for each key we can com-
pute hash(key) mod R: the hash function maps the key to a number and mod
splits the key space into R partitions called buckets. The user can influence the
partitioning by specifying a customized partitioning function.
Combination: Instead of locally storing lots of intermediate results of the map
processes which later on have to be shuffled to other workers over the network, an
additional combine task can be run locally on each worker after the map phase.
This combine function is similar to the reduce function as it groups the interme-
diate results by key and combines their values. Hence, we have less intermedi-
ate results that have to be shuffled. In our example, the combine task can even
be identical to the reduce task and results in intermediate word counts for each
worker (see Figure 6.2).
Data Locality: Transmitting data to a worker over the network is costly. To avoid
overabundant transmissions, the master can take the location of data into account
before assigning a task to a worker. For example, if a server already has copies of
some sentences, map tasks for these sentences should be assigned to a worker on
the same server.
Incremental Map-Reduce: Input data might be generated dynamically over a
longer period of time. To improve evaluation of such data, the four steps can be
interleaved and final results be obtained incrementally. That is, map tasks can be
started before the entire input data have been read and reduce tasks can be started
split map & combine shuffle reduce

(1,sentence1 ) (word3 ,2) (word1 ,(1,1,1)) worker4 (word1 ,3)
(2,sentence2 ) worker1 (word4 ,1) (word2 ,(1)) (word2 ,1)
(word1 ,1)
(4,sentence4 ) worker2 (word1 ,1) (word3 ,(2,1)) worker5 (word3 ,3)
(5,sentence5 ) (word2 ,1) (word4 ,(1,1)) (word4 ,2)
(word4 ,1)
(6,sentence6 ) worker3 (word3 ,1)
Fig. 6.2. A map-reduce-combine example
before all map tasks have finished. The reduce output data might then be used as
input for other shuffle and reduce tasks until the final result has been computed.
Incremental map-reduce is more involved than the basic case and might even be
impossible to implement in some application scenarios. In our word count ex-
ample, however, incremental map-reduce can indeed be used because taking the
sum over the word occurrences is a simple, non-decreasing function.
6.2 Document Databases
Document databases store data in a semi-structured and nested text format like XML
documents or JSON documents (the definition of JSON is the topic of Section 6.2.1).
Each such document is usually identified by a unique identifier. Hence, document
databases are related to key-value stores in that they store data under a unique key.
However, in contrast to key-value stores, the value portion is not seen as an arbitrary
string but instead it is treated as a document structured according to the text format
chosen. In particular, a document can be nested: for example, an XML element can
contain other XML elements inside; similarly for JSON, a key-value pair may be the
value of another key-value pair.
6.2.1 Java Script Object Notation
The JavaScript Object Notation (JSON, [ECM13]) is a human-readable text format for
data structures and was standardized by Ecma International (European association
for standardizing information and communication systems). It has its origins in the
JavaScript language.
Web resources:
– JavaScript Object Notation: http://www.json.org/
As already mentioned, a JSON document is basically a nesting of key-value pairs. In

JSON, key and value are separated by a colon ‘:’. JSON uses curly braces ({ and }) to
structure the document. Any data enclosed in curly braces is referred to as a JSON
object; inside the braces, a JSON object contains a set of key-value pairs separated by
commas. In the JSON format, while the key portion is always a string, a value can be
one of the following basic types:
– Number (including signed and floating point numbers)
– Unicode String
– Boolean (true or false)
– Array (an ordered set of values using square brackets)
– Object (an unordered set of key-value pairs using curly braces)
– null
A simple example for a JSON description of a Person object is the following:
{
"age" : 31
}
Because the value of a key can itself be an object (that is, a set of key-value pairs), JSON
objects can be embedded in another object. For example, the address of a person can
be embedded in the Person object:
{
"age" : 31,
"address" :
{
"street" : "Main Street",

"number" : 12,
"city" : "Newtown",
"zip" : 31141
}
}
When adding an ordered list of telephone numbers, we can use an array:
{
"age" : 31,
"address" :
{
"number" : 12,
"city" : "Newtown",
"zip" : 31141
} ,
"telephone": [935279,908077,278784]
}
The JSON format itself does not include syntax elements to specify references from one
JSON document to another one (like foreign keys in the relational case); nor references
inside the same JSON document (like ID attributes in XML documents). Some docu-
ment databases (and JSON processing tools) support ID-based referencing: a JSON ob-
ject can be given an explicit ID key which can then be referenced by a specific reference
key inside another object. Referential integrity (that is, ensuring that such references
always point to existing objects) can then be checked, too. We could for example, add
an "id" key for our person object and set a unique value for it; this ID can then be
referenced by other objects.
{
"id" : "person2039457849",
"age" : 31,
"address" :
{
"number" : 12,
"city" : "Newtown",
"zip" : 31141
} ,
"telephone": [935279,908077,278784]
}
In fact, this kind of ID key is often used in document database to store a document
identifier. When storing JSON documents in a document database, each JSON docu-
ment (that is, the top-level object) is assigned a unique value for its id key. In some
document databases, the document ID is system generated, in others, a unique ID has
to be specified by the database user who is inserting the document.
Similar to XML navigation, JSON can be traversed by navigational (or path-based)
access; that is, by specifying a path along the keys of nested key-value pairs in the
current JSON object. For example, navigating to Alice’s street would result in the path
"address"."street". Cross-referencing between different objects can be made by
mixing ID-based referencing (to access the referenced object) and path-based refer-
encing (to navigate in the referenced object).
There are several encodings for JSON that transform a JSON object into a format
that can be transmitted over the network or stored on disk more efficiently. For ex-
ample, the binary JSON (BSON) format stores the length of each object; embedded
objects can hence be skipped without actually reading them and searching for the
closing curly brace belonging to this object. In other words, BSON documents can be
traversed faster by skipping irrelevant nested objects.
6.2.2 JSON Schema
A schema specification for JSON is available as an Internet Draft of the Internet En-
gineering Task Force (IETF). With a JSON Schema definition, JSON documents can be
checked for validity according to the provided schema definition.
Web resources:
– JSON Schema: http://json-schema.org/
– JSON Schema generator: http://jsonschema.net/
The JSON Schema specification is similar in spirit to the XML Schema specification
(see Section 5.1.3). In particular, a JSON Schema document is a JSON document con-
taining a JSON object on its top level. Due to nesting, the top-level schema can contain
subschemas of arbitrary depth. The schema definitions are self-describing: they are
based on keywords as defined in the draft schema specification. We briefly the most
important keywords and expressions.
The $schema keyword: The $schema keyword usually points to the specification:
{"$schema": "http://json-schema.org/schema#"}
and defines the specification of JSON Schema that should be applied when using
the schema document. The title and description keywords: The title and
description keywords give users the opportunity to provide useful information on
the content and purpose of the schema definition; they are purely informative and
hence are ignored while checking validity of a document. The properties keyword:
The properties keyword describes the properties (key-value pairs) inside a JSON ob-
ject by specifying their name part (a string) and further restrictions for each property.
The type keyword: The type keyword restricts the type of a property.
{"type": "string"}
defines a string enclosed in quotation marks like “hello”;
{"type": "number"}
defines an integer or float like 4 or 4.5;
{"type": "object"}
defines a JSON object enclosed in braces { and }; moreover, mixed type definitions are
possible:
{ "type": ["number", "string"] }
accepts both strings and numbers. Restrictions for string types: More restrictions
for string types can be defined; for example restricting the length of a string:
{
"type": "string",
"minLength": 5,
"maxLength": 10
}
accepts strings with at least 5 and at most 10 characters. Regular expressions can also
be used to confine the set of allowed strings:
{
"type": "string",
"pattern": "^([A-Z])*$"
}
accepts for example only strings consisting of capital letters. Lastly, an enumeration
of allowed values can be specified:
{ "type": "string",
"enum": ["Apple", "Banana", "Orange"]
}
Restrictions for numeric types: More restrictions for numeric types can be used to
set the minimum and maximum allowed value, define whether this minimum and
maximum are included or excluded in the definition and whether the number should
be a multiple of some other number:
{
"type": "number",
"minimum": 10,
"maximum": 100,
"exclusiveMaximum": true,
"exclusiveMinimum": true,
"multipleOf": 10
}
Nested schemas: As already mentioned, a definition of a JSON object usually con-

tains properties with names and associated data types; in this way nested schemas
are defined where one object consists of one or more subschemas; other restrictions
for object types allow to specify that no additional properties are allowed and that
some properties required; for example, to define an address object consisting of three
required properties (street, city and zip), one optional property (number) and no other
properties allowed:
{
"type": "object",
"properties": {
"street": { "type": "string" },
"number": { "type": "number" },
"city": { "type": "string" },
"zip": { "type": "number" }
},
"additionalProperties": false,
"required": ["street", "number", "city"]

}
The array type: The array type is used to define an array property; the items in an
array can also be restricted in their type and a minimum amount of items can be spec-
ified:
{
"type": "array",
"items": {
"type": "number"
},
"minItems": 1
}
In addition, arrays can be restricted in their length by disallowing additional items.

For example, consider the following array of length 2 containing exactly one number
and one string:
{
"type": "array",
"items": [
{
"type": "number"
},
{
"type": "string"
}
],
"additionalItems": false
}
Reusing subschemas: To avoid repetitions of definitions across several JSON schemas

and allow for a modular schema structure, the $ref keyword can be used to refer to
a subschema that is defined elsewhere. For example, if a location should be specified
by a geographical coordinate consisting of a longitude and latitude, this definition
can be reused by referring to a predefined specification as follows:
"location": {
"$ref": "http://json-schema.org/geo"
}
Combining subschemas: The following three keywords can be used to flexibly com-
bine several subschemas. These subschemas have to be defined in an array. allOf
means that all of the subschemas must be complied with, anyOf means that at least
of the subschemas must be complied with, oneOf means that exactly one of the sub-
schemas must be complied with. For example, we can define that a property should
either be a string or a number by using the oneOf keyword:
{
"oneOf": [
{ "type": "string" },
{ "type": "number" }
]
}
Additionally, the not keyword means that the document must not comply with a single
specified subschema. For example, to accept anything that is not a number:
{
"not":
{ "type": "number" }
}
6.2.3 Representational State Transfer
Several document databases and key-value stores offer a web-based access method
where JSON documents are used to represent the content of messages to and from the
database server. The MIME media type for JSON text is “application/json”.
Web-based access methods are often designed to follow an architectural style
for computer networks called representational state transfer [Fie00] and hence are
usually called RESTful APIs. A key feature of RESTful APIs is that resources are lo-
cated by uniform resource identifiers (URIs). A REST architecture should have the
following properties:
Client-server architecture: User-side operations (data consumption and pro-
cessing) are separated from data storage operations.
Stateless communication: Request processing on server side cannot be based on
information stored at server side as contextual information for this request. This
means in particular, that session information should be stored on client side. This
relieves the server of storing client-related information across several requests of
an individual user; and it allows for sending different requests of the same user to
different servers in a distributed system. On the downside it implies that commu-
nication overhead is increased because the client has to send context information
with each request.
Cacheable information: In order to reduce network communication, some data

might be stored locally at the client side in a cache.
Uniform interface: Access is not application-specific but is based on a standard-
ized data format and transmission method across applications.
Layered system: Applications should be organized in layers with different re-
sponsibilities und functionalities such that only neighboring layers interact. This
form of abstraction reduces complexity in the individual layers. Messages pass
through different layers while being processed.
Code on demand: Clients can download additional scripts from the servers to
extend their functionality.
When implementing a REST architecture with HTTP, several HTTP methods as defined
in RFC 2616 can be used to interact with the database server. The behavior of the meth-
ods depends on whether the method is called for a single JSON document identifier (a
document URI) or an identifier of set of documents (a collection URI).
GET: The GET method is used to retrieve data from a server, hence the GET method
is usually employed to request data records from the database server: the GET
method on a collection identifier (URI) retrieves a listing of all documents con-
tained in the collection; the GET method on a single document identifier (URI) re-
trieves this document. The GET method does not produce side-effects and is called
nullipotent or safe.
POST: The POST method sends data to the server with which a new resource (with
a new URI) is created. When the POST method is executed on a collection URI,
then a new document as contained in the payload of the POST message is cre-
ated in the collection. The creation of the document is a side-effect on server side;
hence the method is not nullipotent. Moreover, the method is not idempotent: re-
peated applications of the same POST request, each result in the creation of a new
resource.
PUT: The PUT method acts as an update or upsert operation. When executed on a
collection URI, the collection is replaced with an entirely new one; when executed
on a document URI, the document is replaced by the payload document (if the
document previously existed) or it is newly created. The PUT method is idempo-
tent: a repeated execution of the same PUT request results in an identical system
state; that is, a document is only created once and any subsequent PUT operation
with an identical document does not change the document.
DELETE: The DELETE method deletes an entire collection (in case of a collection
URI) or the specified document (in case of a document URI). The DELETE method
is idempotent because once deleted another DELETE request on the same resource
does not change the system state.
We briefly survey a MapReduce framework as well as popular key-value stores and

document databases.
6.3.1 Apache Hadoop MapReduce
The open source map-reduce implementation hosted by Apache is called Hadoop.
Web resources:
– Apache Hadoop: http://hadoop.apache.org/
– documentation page: http://hadoop.apache.org/docs/stable/
– GitHub repository: https://github.com/apache/hadoop
The entire Hadoop ecosystem consists of several modules which we briefly describe.
HDFS: Hadoop MapReduce runs on the Hadoop Distributed File System (HDFS). In an
HDFS installation, a NameNode is responsible for managing metadata and handling
modification requests. A regular checkpoint (a snapshot of the current state of the file
system) is created as a backup copy; all operations in between two checkpoints are
stored in a log file which is merged with the last checkpoint in case of a NameNode
restart.
Several DataNodes store the actual data files. Data files are write-once and will
not be modified once they have been written and hence are only available for read
accesses. Each data file is split into several blocks of the same size. Data integrity is
ensured by computing a checksum of each block once a file is created. DataNodes
regularly send so-called heartbeat messages to the NameNode. If no heartbeat is re-
ceived for a certain period of time the NameNode assumes that the DataNode failed.
The DataNodes are assigned to racks in a HDFS cluster. Rack awareness is used to re-
duce the amount of inter-rack communication between nodes; that is, whenever one
node writes to another node, these should at best be placed on the same rack. Replicas
of each data file (more precisely of the blocks inside each data file) are maintained on
different DataNodes. To achieve better fault tolerance, at least one of the replicas must
be assigned to a DataNode residing on different racks. A balancer tool can be used to
reconfigure the distribution of data among the DataNodes.
MapReduce: In Hadoop, a JobTracker assigns map and reduce tasks to TaskTrackers;
the TaskTrackers execute these tasks at the local DataNode. The map functionality
is implemented by a class that extends the Hadoop Mapper class and hence has to
implement its map method. The map method operates on a key object, a value object
as well as a Context object; the context stores the key-value pairs resulting from the
map executions by internally calling a RecordWriter.
As a simple example, the Apache Hadoop MapReduce Tutorial shows a Mapper
class that accepts key-value pairs as input: the key is of type Object and the value is of
type Text. It tokenizes the value and outputs key-value pairs where the key is of type
Text and the value is of type IntWritable.
public class TokenCounterMapper

extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Similarly, the reduce functionality is obtained by extending the Reducer class and im-
plementing its reduce method. This reduce method accepts a key object, a list of value
objects, as well as a context object as parameters. In the Apache Hadoop MapReduce
Tutorial a simple Reducer reads the intermediate results (the key value-pairs where
the key is a Text object and the value is an IntWritable object obtained by map exe-
cutions), groups them by key, sums up the all the values belonging to a single key by
iterating over them and then outputs key-value pairs where the key is a Text object
and the value is an IntWritable object to the context.
public class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Tez: The Apache Tez framework extends the basic MapReduce by allowing execution
of tasks which can be arranged in an arbitrary directed acyclic graph such that results
of one task can be used as input of other tasks. In this way, Tez can model arbitrary
flows of data between different tasks and offers more flexibility during execution as
well as improved performance.
Ambari: Apache Ambari is a tool for installing, configuring and managing a Hadoop
cluster. It runs on several operating systems, offers a RESTful API and a dashboard.
It is designed to be fault-tolerant where a regular heartbeat message of each agent is
used to determine whether a node is still alive.
Avro: Apache Avro is a serialization framework that stores data in a compact binary
format. The Avro format can be used as a data format for MapReduce jobs. It also sup-
ports remote procedure calls. An Avro schema describes the serialized data; each Avro
schema is represented as a JSON document and produced during the serialization pro-
cess. The schema is needed for a successful deserialization.
YARN: The basic job scheduling functionality is provided by Apache YARN: A global
ResourceManager coordinates the scheduling of tasks together with a local NodeM-
anager for each node; the NodeManager communicates the node’s status to the Re-
sourceManager that bases its task scheduling decisions on this per-node information.
YARN offers a REST-based API with which jobs can be controlled.
ZooKeeper: Apache ZooKeeper is a coordination service for distributed systems.
ZooKeeper stores data (like configuration files or status reports) in-memory in a hier-
archy of so-called znodes inside a namespace; each znode can contain both data as
well child znodes. In addition, ZooKeeper maintains transaction logs and periodical
snapshots on disk storage.
Flume: The main purpose of Apache Flume is log data aggregation from a set of dif-
ferent distributed data sources. In addition, it can handle other event data (network
traffic, social media or email data). Each Flume event consists of a binary payload and
optional description attributes. Flume supports several data encodings and formats
like Avro and Thrift. Event data are read from a source, processed by an agent, and
output to a sink. Apache Flume allows for chaining of agents (so-called multi-agent
flow), consolidation in a so-called second tier agent, as well as multiplexing (sending
output to several different sinks and storage destinations).
Spark: Apache Spark provides a data flow programming model on top of Hadoop
MapReduce that reduces the number of disk accesses (and hence reduces latency)
when running iterative and interactive MapReduce jobs on the same data set. Spark
employs the notion of resilient distributed datasets (RDDS) as described in [ZCD+ 12];
RDDs ensure that datasets can be reconstructed from existing data in case of failure by
keeping track of lineage. With these datasets, data can be reliably distributed among
several database servers and processed in parallel.
6.3.2 Apache Pig
Apache Pig is a framework that helps users express parallel execution of data analytics
tasks. Its language component is called Pig Latin [ORS+ 08].
Web resources:
– Apache Pig: http://pig.apache.org/
– documentation page: http://pig.apache.org/docs/
– GitHub repository: https://github.com/apache/pig
Pig supports a nested, hierarchical data model (for example, tuples inside tuples) and
offers several collection data types. A data processing task is specified as a sequence
of operations (instead of a single query statement). Pig Latin combines declarative ex-
pressions with procedural operations where intermediate result sets can be assigned
to variables. Pig Latin’s data types are the following:
Atom: Atoms contain simple atomic values which can be of type chararray for
strings, as well as int, long, float, double, bigdecimal, or biginteger for nu-
meric types; in addition, bytearray, boolean and datetime are supported as
atom types.
Tuple: A collection of type tuple combines several elements where each element
(also called field) can have a different data type. The type and name for each field
can be specified in the tuple schema so that field values can be addressed by their
field names. This schema definition is only optional; an alternative to addressing
fields by name is hence addressing fields by their position where the position is
specified by the $ sign – for example, $0 corresponds to the first field of a tuple.
Bag: A bag in Pig Latin is a multiset of tuples. The tuples in a bag might all have
different schemas and hence each field might contain different amounts of fields
as well as differently typed fields.
Map: A map is basically a set of key value pairs where the key part is an atomic
value that is mapped to a value of arbitrary type (including tuple, map and bag
types of arbitrary schemas). Looking up a value by its key is done using the #
symbol – for example, writing #'name' would return the value that is stored under
the key 'name'.
Data processing tasks are specified by a sequence of operators. Some available oper-
ators are briefly surveyed next.
LOAD: With the LOAD command the programmer specifies a text file to be read in for
further processing. An optional schema definition can be specified by the AS state-
ment. For example, a file containing name, age and address information of persons
(one person per line) can be read in and converted into one tuple for each person as
follows:
input = LOAD ’person.txt’ AS (name,age,address);
Explicit typing and nesting is also possible; additionally, a user-defined conversion

routine can be specified with the USING statement. For example, a JSON file (where a
person is represented by a name, an age and a nested address element) can be read in
by the JsonLoader as follows:
input = LOAD ’person.json’ USING JsonLoader(’name:chararray,

age:int,
address:(
number:int,
street:chararray,
zip:int,
city:chararray)’);
Once the file is read in, it is represented as a bag of tuples in Pig: a line in the file
corresponds to a tuple in the bag.
STORE: The STORE statement saves a bag of tuples to a file. It is only then, when the
physical query plan of all the preceding commands (that is, all bags that the output
tuples depend on) is generated and optimized. This is called lazy execution.
DUMP: The DUMP statement prints out the contents of a bag of tuples.
FOREACH: The FOREACH statement is used to iterate over the tuples in a bag. The
GENERATE statement then produces the output tuples by specifying a transforma-
tion for each input tuple. To allow for parallelization (by assigning tuples to different
servers), the GENERATE statement should only process individual input tuples. More-
over, a flattening command can convert a tuple containing a bag into several different
output tuples each containing one member of the bag. For example, assume we store
for each person a tuple with his/her name at the first position (written as $0) as well
as the names of his/her children as a bag at the second position (written as $1), then
flattening will turn each of these tuples into several tuples (depending on the number
of children):
input={(’alice’,{’charlene’,’emily’}),(’bob’,{’david’,’emily’})};
output = FOREACH input GENERATE $0, FLATTEN($1);
In this case the output will consist of four tuples as follows:
DUMP output;
(’alice’,’charlene’)
(’alice’,’emily’)
(’bob’,’david’)
(’bob’,’emily’) }
FILTER BY: The FILTER BY statement retains those tuples that comply with the spec-
ified condition. The condition can contain comparisons (like ==, !=, < or >), or string
pattern matching (matches with a regular expression); several conditions can be com-
bined by using logical connectives like AND, OR and NOT. For example, from the follow-
ing input we only want to retain the tuples with a number at their first position ($0):
input={ (1,’abc’), (’b’,’def’) };

output=FILTER input BY ($0 MATCHES ’[0-9]+’);
In this case only the first tuple remains in the output:
DUMP output;
(1,’abc’)
GROUP BY: The GROUP statement groups tuples by a single identifier (a field in the
input tuple) and then generates a bag with all the tuples with the same identifier value:
input={ (1,’abc’),(2,’def’),(1,’ghi’),(1,’jkl’),(2,’mno’) };
output = GROUP input BY $0;
DUMP output;
(1,{(1,’abc’),(1,’ghi’),(1,’jkl’)})
(2,{(2,’def’),(2,’mno’)})
COGROUP BY: The COGROUP statement does a grouping over several inputs (several
relations) with different schemas. As in the GROUP statement, an identifier is selected;
this identifier should appear in all inputs. For each input, a separate bag of tuples is
created; these bags are then combined into a tuple for each unique identifier value:
input1 = { (1,’abc’),(2,’def’),(1,’ghi’),(1,’jkl’),(2,’mno’) };
input2 = { (1,123),(2,234),(1,345),(2,456) };
output = COGROUP input1 BY $0, input2 BY $0;
DUMP output;
(1, {(1,’abc’),(1,’ghi’),(1,’jkl’)}, {(1,123),(1,345)})
(2, {(2,’def’),(2,’mno’)}, {(2,234),(2,456)})
JOIN BY: The JOIN operator constructs several flat tuples by combining values from
tuples coming from different inputs that share an identical identifier value.
input1 = { (1,’abc’),(2,’def’),(1,’ghi’),(1,’jkl’),(2,’mno’) };
input2 = { (1,123),(2,234),(1,345),(2,456) };
output = JOIN input1 BY $0, input2 BY $0;
DUMP output;
(1,’abc’,1,123)
(1,’ghi’,1,123)
(1,’jkl’,1,123)
(1,’abc’,1,345)
(1,’ghi’,1,345)
(1,’jkl’,1,345)
(2,’def’,2,234)
(2,’mno’,2,234)
(2,’def’,2,456)
(2,’mno’,2,456)
As a simple example we look at a Pig Latin program to count words in a document.

We load a text file consisting of several lines, such that we have a bag of tuples where
each tuple contains a single line. The TOKENIZE function splits each line (referred to
by $0) into words and outputs the words of each line as a bag of strings. Hence to
obtain an individual tuple for each word, flattening of this bag is needed. Next, we
group the flattened tuples by word (one group for each word); the grouping results
in tuples where the first position ($0) is filled with each unique word and the second
position ($1) contains a bag with repetitions of the word according to the occurrences
in the document.
myinput = LOAD ’mydocument.txt’;

mywordbags = FOREACH myinput GENERATE TOKENIZE($0);
mywords = FOREACH mywordbags GENERATE FLATTEN($0);
mywordgroups = GROUP mywords BY $0;
mycount = FOREACH mywordgroups GENERATE $0,COUNT($1);
STORE mycount INTO ’mycounts.txt’;
To illustrate this, assume that mydocument.txt contains the following text:
data management is a key task in modern business

data stores are at the core of modern data management
Then the following values are generated for the variables. The variable myinputlines
contains a bag of tuples as can be seen in the DUMP output.
DUMP myinput;
(data management is a key task in modern business)
(data stores are at the core of modern data management)
The variable mywordbags contains a bag of tuples where each tuple consists of a bag
of tuples containing the words of each line.
DUMP mywordbags;
({(data),(management),(is),(a),(key),(task),(in),(modern),
(business)}),
({(data), (stores), (are), (at), (the), (core), (of), (modern),
(data), (management)})
We next get rid of the inner bag by flattening; the variable mywords then consists of a
bag of tuples where each tuple corresponds to an occurrence of a word:
DUMP mywords;
(data)
(management)
(is)
(a)
(key)
(task)
(in)
(modern)
(business)
(data)
(stores)
(are)
(at)
(the)
(core)
(of)
(modern)
(data)
(management)
The next step – grouping – produces a tuple for each word where the first position
is filled with the word and the second position contains a bag of occurrences of this
word.
DUMP mywordgroups;
(data, {data, data, data})
(is, {is})
(a, {a})
(management, {management, management})
(key, {key})
(task, {task})
(in, {in})
(modern, {modern, modern})
(business, {business})
(stores, {stores})
(are, {are})
(at, {at})
(the, {the})
(core, {core})
(of, {of})
Finally, we generate the output tuples containing each word and the count of its oc-
currences.
DUMP mycount;
(data, 3)
(is, 1)
(a, 1)
(management, 2)
(key, 1)
(task, 1)
(in, 1)
(modern, 2)
(business, 1)
(stores, 1)
(are, 1)
(at, 1)
(the, 1)
(core, 1)
(of, 1)
Using positional access (with the $ sign) is a good option for schema-agnostic data
handling; the specification of a schema as well as assigning names to tuple positions
however allows for explicit typing and improves readability of the code.
A version of the word count example that uses schema information and named tuple
fields is the following:
myinput = LOAD ’mydocument.txt’ AS (line:chararray);

mywordbags = FOREACH myinput GENERATE TOKENIZE(line) AS wordbag;
mywords = FOREACH mywordbags GENERATE FLATTEN(wordbag) AS word;
mywordgroups = GROUP mywords BY word;
mycount = FOREACH mywordgroups GENERATE group,COUNT(mywords);
STORE mycount INTO ’mycounts.txt’;
Note that the position that is used for grouping in the GROUP operation implicitly gets
assigned the name group in the output (that is, in mywordgroups) and can later on be
referenced by this name in the GENERATE statement; the second position is implicitly
called as the bag of tuples that is grouped – mywords in our case – so that this name
can be used later on in the COUNT operation.
6.3.3 Apache Hive
Apache Hive is a querying and data management layer on top of distributed data stor-
age. The original approach is described in a research paper [TSJ+ 09].
Web resources:
– Apache Hive: http://hive.apache.org/
– documentation page: https://cwiki.apache.org/confluence/display/Hive/
– GitHub repository: https://github.com/apache/hive
The basic data model in Hive corresponds to relational tables; it however extends the
relational model into nested tables: apart from simple data types for columns, col-
lection types (array, map, struct and union) are supported, too. Tables are serialized
and stored as files for example in HDFS. Serialization can be customized by specifying
so-called serdes (serialization and deserialization functions). Its SQL-like language is
called HiveQL. HiveQL queries are compiled into Hadoop MapReduce tasks (or Tez or
Spark tasks).
Considering the word count example, we can read in our input document into a
table that contains one row for each line of the input document. Next, a table con-
taining one row for each occurence of a word is created: the split functions turns each
input line into an array of words (split around blanks); the explode functions flattens
the arrays and hence converts them into separate rows. Lastly, grouping by word and
counting its occurences gives the final word count for each word:
CREATE TABLE myinput (line STRING);

LOAD DATA INPATH ’mydocument.txt’ OVERWRITE INTO TABLE myinput;
CREATE TABLE mywords AS SELECT explode(split(sentence, ’ ’))
AS word FROM myinput;
SELECT word, count(*) AS count FROM GROUP BY word ORDER BY count;
6.3.4 Apache Sqoop
Sqoop is a tool that can import data from relational tables into Hadoop MapReduce. It
reads the relational data in parallel and stores the data as multiple text files in Hadoop.
Web resources:
– Apache Sqoop: http://sqoop.apache.org/
– documentation page: http://sqoop.apache.org/docs/
– ASF git repository: https://github.com/apache/sqoop
The number of parallel tasks can be specified by the user; the default number of paral-
lel tasks is four. For parallelization, a splitting column is used the partition the input
table such that the input partitions can be handled by parallel tasks; by default, the
primary key column (if available) is used as a splitting column. Sqoop will retrieve the
minimum and maximum value as the range of the splitting column. Sqoop will then
split the range into equally-sized subranges and produce as many partitions as paral-
lel tasks are required – hence by default four partitions. Note that this can lead to an
unbalanced partitioning in case that not all subranges are equally populated. In this
case it is recommended to choose a different splitting column.
Sqoop uses Java Database Connectivity (JDBC) to connect to a relational database
management system; it can hence run with any JDBC-compliant database that offers
a JDBC driver. For example, with the Sqoop command line interface, the columns
personid, lastname and firstname are imported from table PERSON in a PostgreSQL
database called mydb as follows:
sqoop import --connect jdbc:postgresql://localhost/mydb

--table PERSON --columns "personid,firstname,lastname"
Tabular data are read in row-by-row using the database schema definition. Sqoop gen-
erates a Java class that can parse data from a single table row. This includes a process
of type mapping: SQL types are mapped to either the corresponding Java or Hive types
or a user-defined mapping process is executed. In the simplest case, Sqoop produces
a comma-separated text file: each table row is represented by a single line in the text
file. However, text delimiters can be configured to be other characters than commas.
A binary format called SequenceFile is also supported.
As an alternative to importing relational data into HDFS files, the input rows can
also be imported into HBase (see Section 8.3.2) or Accumulo (see Section 8.3.4). Sqoop
generates a put operation (in HBase) or a Mutation operation (in Accumulo) for each
table row. The HBase table and column family must be created first before running
Sqoop.
After processing data with MapReduce, the results can be stored back to the rela-
tional database management system. This is done by generating several SQL INSERT
statements and adding rows to an existing table; alternatively Sqoop can be config-
ured to update rows in an existing table. For example, on the command line one can
specify to export all data in the file resultdata in the results directory into a table EM-
PLOYEES as follows:
sqoop export --connect jdbc:postgresql://localhost/mydb

--table EMPLOYEES --export-dir /results/resultdata
6.3.5 Riak
Riak is a key-value store with several advanced features. It groups key-value pairs
called Riak objects into logical units called buckets. Buckets can be configured by
defining a bucket type (that describes a specific configuration for a bucket) or by set-
ting its bucket properties.
Web resources:
– Riak-KV: http://basho.com/products/riak-kv/
– documentation page: http://docs.basho.com/
– GitHub repository: https://github.com/basho
Riak comes with several options to configure a storage backend: Bitcask, LevelDB,
Memory and Multi (multiple backends within a single Riak cluster for different buck-
ets).
Riak offers a REST-based API and a PBC-API based on protocol buffers as well as
several language-specific client libraries. For example, a Java client object can be ob-
tained by connecting to a node (the localhost) that hosts a Riak cluster:
RiakNode node = new RiakNode.Builder()

.withRemoteAddress("127.0.0.1").withRemotePort(10017).build();
RiakCluster cluster = new RiakCluster.Builder(node).build();
RiakClient client = new RiakClient(cluster);
A bucket in Riak is identified by its bucket type (for example “default”) and a bucket
name (for example “persons”); in Java a Namespace object encapsulates this infor-
mation. Accessing a certain key-value pair in Riak requires the creation of a Location
object for the key in the given namespace (for example “id1”):
Namespace ns = new Namespace("default", "persons");

Location location = new Location(ns, "person1");
Simple Riak objects can be created to store arbitrary binary values (for example “al-
ice”) under a key (for the given location):
RiakObject riakObject = new RiakObject();

riakObject.setValue(BinaryValue.create("alice"));
StoreValue store = new StoreValue.Builder(riakObject)
.withLocation(location).build();
client.execute(store);
For reading the value of a key (for a given location), a FetchValue object can be used:
FetchValue fv = new FetchValue.Builder(location).build();

FetchValue.Response response = client.execute(fv);
RiakObject obj = response.getValue(RiakObject.class);
Riak implements some data types known as convergent replicated data types (CRDTs)
that facilitate conflict handling upon concurrent modification in particular if these
modifications are commutative (that is, can be applied in arbitrary order). Currently
supported CRDTs are flag (a boolean type with the values enable or disable), reg-
ister (storing binary values), counter (an integer value for counting cardinalities but
not necessarily unique across distributed database servers), sets, and maps (a map
contains fields that may hold any data type and can even be nested).
Before storing the CRDTs, it is decisive to create a new bucket type that sets the
datatype property to the CRDT (for example, map) and then activate the bucket type:
riak-admin bucket-type create personmaps

’{"props":{"datatype":"map"}}’
riak-admin bucket-type activate personmaps
For example, we can use the person bucket to store a map under key “alice” contain-
ing three registers (for the map keys “firstname”, “lastname” and “age”). In Java, a
map would then be stored as follows (for a given namespace consisting of bucket type
“personmaps” and bucket name “person” and location for key “alice”) by creating a
MapUpdate as well as three RegisterUpdates:
Namespace ns = new Namespace("personmaps", "person");

Location location = new Location(ns, "alice");
RegisterUpdate ru1 = new RegisterUpdate(BinaryValue.create("Alice"));
RegisterUpdate ru2 = new RegisterUpdate(BinaryValue.create("Smith"));
RegisterUpdate ru3 = new RegisterUpdate(BinaryValue.create("31"));
MapUpdate mu = new MapUpdate();
mu.update("firstname", ru1);
mu.update("lastname", ru2);
mu.update("age", ru3);
UpdateMap update = new UpdateMap.Builder(location, mu).build();
client.execute(update);
Riak’s search functionality is implement based on Apache Solr in a subproject called

Yokozuna. Search indexes are maintained in Solr and updated as soon as some data
affected by the index change. Indexes can cover several buckets. To set up index-
ing for a bucket it has to be associated to that bucket by setting the bucket property
search_index to the index name. A so-called extractor has to parse the Riak objects
in the buckets to make them accessible to the index. In the simplest case, the object
is parsed as plain text. Built-in extractors are available for JSON and XML that flatten
the nested document structures as well as for the Riak data types; custom extractors
can also be implemented. Indexes require schema information that assigns a type to
each field name to be indexed.
For example, when indexing a person JSON document in an index person_idx, a
search request can be issued to return information on people with lastname Smith:
"$RIAK_HOST/search/query/person_idx?wt=json&q=name_s:Smith"
The returned document contains information about all the documents found (includ-
ing the bucket type _yz_rt, the bucket name _yz_rb and the key _yz_rk matching
the search). Range queries and boolean connectors are supported, too, as well as ad-
vanced search constructs for nested data.
Riak implements dotted version vectors (see Section 12.3.5) to support concur-
rency and synchronization. This reduces the amount of siblings as compared to con-
ventional version vectors but required a coordinator node for each write process. The
current version vector is returned in an answer to a read request and must be included
in a write request as the context to enable conflict resolution or sibling creation on the
database side.
Lastly, on database side write operations can be checked: pre-commit hooks are
validations executed before a write takes place and they can lead to a rejection of a
write; post-commits hooks are processes executed after a successful write. The commit
hooks can be specified in the bucket type.
6.3.6 Redis
Redis is an advanced in-memory key-value store that offers a command line interface
called redis-cli.
Web resources:
– Redis: http://redis.io/
– documentation page: http://redis.io/documentation
– GitHub repository: https://github.com/antirez/redis
Redis supports several data types and data structures. In particular, it supports the
following data types:
– string: strings are used are the basic data type for keys and simple values. New
key-value pairs can be added and retrieved by the SET and GET commands – or
the MSET and MGET commands for multiple key-value pairs; for example:
MSET firstname Alice lastname Smith age 34
MGET firstname lastname age
– linked lists: a sequence of string elements where new elements can be added at
the beginning (that is, at the head on the left with the LPUSH command) or at
the end (that is, at the tail on the right with the RPUSH command). The LRANGE
command returns a subrange of the elements by defining a start position and an
end position; positions can be counted from the head (starting with 0) or from the
tail (starting with -1). The LPOP and RPOP commands remove an element from
the head and tail, respectively, and return it. As a simple example, consider the
addition of three elements A, B, C to a list and then printing out the entire list
(ranging from position 0 to -1) and then removing one element from the tail and
one from the head:
RPUSH mylinkedlist A
RPUSH mylinkedlist B
LPUSH mylinkedlist C
LRANGE mylinkedlist 0 -1
1) "C"
2) "A"
3) "B"
RPOP mylinkedlist
LPOP mylinkedlist
LRANGE mylinkedlist 0 -1
1) "A"
– unsorted set: an unsorted set of unique string elements. The SADD command adds
list elements and the SMEMBERS command returns all elements in the set.
– sorted set: a set where each element has an assigned score (a float) and elements
are sorted according to the score. If scores are identical for two elements, lexico-
graphic ordering is applied to these elements. The ZADD command adds an ele-
ment and its score to the list. For example, persons can be maintained in order
sorted by their ages:
ZADD persons 34 "Alice"
ZADD persons 47 "Bob"
ZADD persons 21 "Charlene"
When printing out the entire set (with ZRANGE starting from the head at position
0 up to the tail at position -1) we obtain:
ZRANGE persons 0 -1
1) "Charlene"
2) "Alice"
3) "Bob"
– hash: a map that maps keys to values. Each hash has a name an can hold arbitrar-
ily many key values pairs. The HMSET command inserts values into a hash (with
the provided name) and the HGET command returns individual values from a key
in the hash:
HMSET person1 firstname Alice lastname Smith age 34
HMGET person1 lastname
– bit array: A bitstring for which individual bits can be set and retrieved by the SET-
BIT and GETBIT commands. Several other bitwise operations are provided.
– hyperloglog: a probabilistic data structure with which the cardinality of a set can
be estimated.
Redis supports data distribution in a cluster as well as transactions with optimistic

locking. With the EVAL command execution of Lua scripts is possible.
6.3.7 MongoDB
MongoDB is a document database with BSON as its storage format. It comes with its
command line interface called mongo shell.
Web resources:
– MongoDB: https://www.mongodb.org/
– documentation page: http://docs.mongodb.org/
– GitHub repository: https://github.com/mongodb
– BSON specification: http://bsonspec.org/
The db.collection.insert() method adds a new document into a collection. Each docu-
ment has an _id field that must be the first field in the document. To create a collection
named persons containing a document with the properties firstname, lastname and
age, the following command is sufficient:
db.persons.insert(
{
firstname: "Alice",
lastname: "Smith",
age: 34
}
)
Each insertion returns a WriteResult object with status information. For example, after
inserting one document, the following object is returned:
WriteResult({ "nInserted" : 1 })
Instead of a single document, an array of documents can be passed to the insert

method to insert multiple documents.
The find method with an empty parameter list returns all documents in a collection:
db.persons.find()
A query document can be passed as a parameter specify selection conditions; for ex-
ample, equality conditions on fields:
db.persons.find({age: 34})
Other comparison operators can be specified by appropriate expression; for example,

less than:
db.persons.find({age{$lt: 34}})
An AND connector is represented by a comma:
db.persons.find({age: 34, firstname: "Alice"})
An OR connector is represented by an or expression operating on an array of query

objects:
db.persons.find($or[{age: 34},{firstname: "Alice"}])

Nested fields have to be queried with the dot notation. The same applies to positions
in arrays.
Using the update method, a document can be replaced by the document speci-
fied in the second parameter. The first parameter of the update method specifies the
matching documents (like persons with first name Alice); the second parameter con-
tains the new values for the document; further conditions can be set (for example, an
upsert will insert a new document when the match condition does not apply to any
document):
db.persons.update(
{ firstname: "Alice" },
{
firstname: "Alice",
lastname: "Miller",
age: 31
},
{ upsert: true }
)
An update method with the set expression allows for modifications of individual fields
(without affecting other fields in the document); again, the first parameter of the up-
date method specifies the matching documents (like persons with first name Alice):
db.persons.update(
{ firstname: "Alice" },
{
$set: {
lastname: "Miller"
}
}
)
Several aggregation operators are supported. MongoDB implements an aggregation

pipeline with which documents can be transformed in a multi-step process to yield a
final aggregated result. The different aggregation stages are specified in an array and
then passed to the aggregate method. For example, documents in a collection can be
grouped by a field (like lastname where each unique value of lastname is used as an id
of one output document), and then values can be summed (like the age). The output
contains one document for each group containing the aggregation result:
db.persons.aggregate( [
{ $group: { _id: "$lastname", agesum: { $sum: "$age" } } } ] )
MongoDB supports a specification for references between documents. These may be

implemented by so-called DBRefs that contain the document identifier of the refer-
enced document; cross-collection and even cross-database references are possible by
specifying the full identifier (database name, collection name and document id) in the
DBRef subobject. That is, a DBRef looks like this:
{"$ref": "collection1", "$id": ObjectId("89aba98c00a"), "$db": "db2"}
However, these references are only a notational format: There is no means to auto-
matically follow and resolve these references in a query to obtain the referenced doc-
uments (or parts thereof). Instead, the referenced ID value has to be read from one
document and with this ID a second query has to be formulated to retrieve the data
from the referenced document.
In MongoDB, indexes are defined at the collection level. It supports indexes for
individual fields as well as compound indexes and multikey indexes. Compound in-
dexes contain multiple fields of a document in conjunction so that queries that specify
conditions on exactly these fields can be improved; sort order of values can be spec-
ified to be ascending (denoted by 1) or descending (denoted by -1). The createIndex
method is used to create an index on certain fields; for example a compound index on
the lastname and the firstname both ascending:
db.persons.createIndex( { "lastname": 1, "firstname": 1 } )
A multikey index applies to values in an array. It means that MongoDB creates an index
entry for each value in the array as soon as an index is created for the field holding the
array. For arrays containing nested documents however, the dot notation has to be
used to add the nested fields to the index.
6.3.8 CouchDB
CouchDB is an Erlang-based document database. It stores JSON documents in so-

called databases and supports multi-version concurrency control.
Web resources:
– CouchDB: http://couchdb.apache.org/
– documentation page: http://docs.couchdb.org/
– GitHub repository: https://github.com/apache/couchdb
CouchDB exposes conflict handling to the user: a user has to submit the most recent
revision number of a document when writing data to it. Otherwise the user will be
notified of a conflict and his modifications are rejected; he has to manually resolve
the conflict by editing the most recent version again.
Each CouchDB document has an _id and a _rev field. For example, a person doc-
ument might look as follows:
{
"_id":"5B6CAB...",
"_rev":"C7654...",
"firstname":"Alice",
"lastname":"Smith",
"age":"34"
}
CouchDB’s retrieval process heavily relies on views. They are computed on the stored
data dynamically at runtime in a Map-Reduce fashion. CouchDB views are themselves
defined in JSON documents called design documents. A design document defines the
map and reduce functions and may look like this:
{
"_id": "_design/myapplication",
"_rev": "7D17...",
"views": {
"myview": {
"map": "function(doc) { ... }",
"reduce": "function(keys, values) { ... }"
}
}
}
The view functionality inside a design document is defined by a Javascript function

that maps a single input document to zero, one or more rows of the output view con-
sisting of a key and a value. The emit function is called to produce such an output row.
For example, to retrieve persons by lastname, the lastname has to be used as a key in
the output view; the if statement ensures that values indeed exist for the requested
fields (documents without the fields lastname and age do not produce output rows):
function(doc) {
if(doc.lastname && doc.age) {
emit(doc.lastname, doc.age);
}
}
Executing the map function on person documents results in a new document consist-
ing of row count, offset and an array of row definitions; each row consists of id (refering
to the original document) as well as key (to be used for querying) and value:
{"total_rows":3,"offset":0,"rows":[
{"id":"5B6CAB...","key":"Smith","value":34},
{...},
...
]}
CouchDB allows arbitrary JSON structures as keys of the output views; the key values
are then used to collate and hence sort the output rows. For example, when using an
array as the key, the output documents will first be sorted by the first element, then by
the second and so on. For example, if we had several persons with identical last name
and we wanted to sort them by age in descending order, the age could be emitted as
the second component of the key (the emitted value can then be null if we are not
interested in any further information on the persons):
function(doc) {
if(doc.lastname && doc.age) {
emit([doc.lastname, doc.age],null);
}
}
If multiple view functions are defined in the same design document, they form a so-
called view group. The views are indexed to improve performance of view computa-
tion. These indexes are updated whenever data is modified in the database. More pre-
cisely, upon a read the index will be refreshed by only considering those documents
that were changed since the last refresh.
Reduce functions corresponding to range queries on keys (of the rows emitted by
the map functions) can be executed on the indexes to retrieve the final result. CouchDB
has the three built-in reduce function written in Erlang and executed more efficiently
in CouchDB: _sum, _count and _stats.
Querying the view corresponds to sending a GET request to the database referring to
the appropriate design document and the view inside it; selection condition on the
view’s key can be appended:
/mydb/_design/myapplication/_view/myview?key="Smith"
Range queries can be issued by providing a start and an end key:
/mydb/_design/myapplication/_view/myview
?startkey="Miller"&endkey="Smith"
6.3.9 Couchbase
Couchbase is a document database that merges features of CouchDB and Membase. It

stores JSON documents (and other formats like binary or serialized data) in buckets.
Web resources:
– Couchbase: http://www.couchbase.com
– documentation page: http://docs.couchbase.com/
– GitHub repository: https://github.com/couchbase
In contrast to CouchDB, each document is stored under a unique key (which serves
as the document id) and is accompanied by some metadata; the metadata include the
document id, a check-and-set (CAS) value for optimistic concurrency control, a time-
to-live value to automatically expire some documents, or information regarding the
document type. Couchbase comes with a command line interface, a REST API and a
SQL-like query language called N1QL.
Similar to CouchDB, one way to interact with Couchbase is by defining and using
views. Views contain of a map and a reduce part and are defined in design documents.
The map function accepts a document and metadata associated to the document. In
particular, if searching by document ID is required, a view has to be created that con-
tains these documents IDs as the row keys:
function(doc,meta){
emit(meta.id,null);
}
As a N1QL command the primary index on a bucket called persons can be created as
follows:
CREATE PRIMARY INDEX ON persons
Several language bindings and software development kits are available. In the Java
Couchbase SDK, for example, a cluster object is obtained from the localhost in which
a bucket is opened for further interaction.
Cluster cluster = CouchbaseCluster.create();

Bucket bucket = cluster.openBucket("persons");
Storing an object with the Java API requires creation of a new JSON document and fill-
ing it with key-value pairs; calling the create method assigns a unique ID (like “alice1”)
to the document.
For example:
JsonObject alice = JsonObject.empty()

.put("firstname", "Alice")
.put("lastname", "Smith")
.put("age", 31);
JsonDocument stored =
bucket.upsert(JsonDocument.create("alice1", alice));
Querying a view in the Java SDK requires a ViewQuery object based on the design
document name as the first parameter and the name of the view to be executed as the
second parameter. In Java, a view execution returns a ViewResult object that contains
output rows.
ViewQuery query = ViewQuery.from("mydesigndoc", "myview");

ViewResult result = bucket.query(query);
for (ViewRow row : result) {
System.out.println(row);
}
The Java SDK also accepts queries expressed in N1QL:
QueryResult queryResult =
bucket.query(Query.simple("SELECT * FROM persons"
+ " WHERE lastname = ’Smith’"));
or the more type-safe query methods acting as a domain-specific language (DSL) wrap-
per for N1QL:
Statement select = select("*").from("person")

.where(x("lastname").eq(s("Smith")));
QueryResult query = bucket.query(select);
The Java SDK offers several advanced features like prepared statements.
Due to their simple data structure, key-value stores are widely used and the variety of
key-value stores available on the market is immense. They differ in the functionality
they provide, for example, in terms of expressiveness of their query language, replica-
tion scheme and version control. Amazon’s description of its Dynamo system [DHJ+ 07]
has popularized several of the underlying technologies.
The Map-Reduce paradigm has received considerable attention in the research com-
munity since it was introduced in the seminal article [DG04]; afterwards it has pro-
duced a row of discussions, improvements and benchmark results as for example
[PPR+ 09, ABPA+ 09, DG10, CCA+ 10, DQRJ+ 10, FTD+ 12]. Lee et al [LLC+ 11] give an ex-
tensive survey of existing Map-Reduce tools and techniques; they discuss pros and
cons of Map-Reduce and recommend it as a complement to DBMSs when processing
data in parallel. A discussion of Map-Reduce and related techniques is given by Lin
[Lin12].
JSON [ECM13] as compact human-readable text format has been widely adopted
and several databases use JSON as their primary storage format – in particular, several
open source document databases like ArangoDB, CouchDB, Couchbase, MongoDB,
OrientDB, RavenDB or RethinkDB.
7 Column Stores
A row store is a row-oriented relational database as it was reviewed in Section 2.1. That
is, data are stored in tables and on disk the data in a row are stored consecutively. This
row-wise storage is currently used in most RDBMSs. In contrast, a column store is a
column-oriented relational database. That is, data are stored in tables but on disk data
in a column are stored consecutively.
This chapter discusses column stores and shows that they work very well for cer-
tain queries – for example queries that can be executed on compressed data. In such
cases, column stores have the advantage of both a compact storage as well as efficient
query execution. Conversion of nested records into columnar representation is a fur-
ther topic of this chapter.
7.1 Column-Wise Storage
Column stores have been around and used since the 1970s but they had less (com-
mercial) success than row stores. We take up our tiny library example to illustrate the
differences:

1002 205 25-10-2016
1006 207 31-10-2016
The storage order in a row store would be:
1002,205,25-10-2016,1006,207,31-10-2016
Whereas the storage order in column store would be:
1002,1006,205,207,25-10-2016,31-10-2016
Due to their tabular relational storage, SQL is understood by all column stores as their
common standardized query language. For some applications or query workloads, col-
umn stores score better than row stores; while for others the opposite is the case. Ad-
vantages of column stores are for example:
Buffer management: Only columns (attributes) that are needed are read from
disk into main memory, because a single memory page ideally contains all values
of a column; in contrast, in a row store, a memory page might contain also other
attributes than the ones needed and hence data is fetched unnecessarily.
Homogeneity: Values in a column have the same type (that is, the values come
from the same attribute domain). This is why they can be compressed better when
144 | 7 Column Stores
stored consecutively; this will be detailed in the following section. In contrast, in

a row store, values from different attribute domains are mixed in a row (“tuple”)
and hence a cannot be compressed well.
Data locality: Iterating or aggregating over values in a column can be done
quickly, because they are stored consecutively. For example, summing up all
values in a column, finding the average or maximum of a column can be done
efficiently because of better locality of these data in a column store. In contrast, in
a row store, when iterating over a column, the values have to be read and picked
out from different tuples.
Column insertion: Adding new columns to a table is easy because they just can
be appended to the existing ones. In contrast, in a row store, storage reorganiza-
tion it necessary to append a new column value to each tuple.
Disadvantages of column stores lie in the following areas:

Tuple reconstruction: Combining values from several columns is costly because
“tuple reconstruction” has to be performed: the column store has to identify which
values in the columns belong to the same tuple. In contrast, in a row store, tuple
reconstruction is not necessary because values of a tuple are stored consecutively.
Tuple insertion: Inserting a new tuple is costly: new values have to be added to
all columns of the table. In contrast, in a row store, the new tuple is appended to
the existing ones and the new values are stored consecutively.
7.1.1 Column Compression
Values in a column range over the same attribute domain; that is, the have the same
data type. On top of that columns may contain lots of repetitions of individual values
or sequences of values. These are two reasons why compression can be more effective
on columns (than on rows). Hence, storage space needed in a column store may be less
than storage space needed in a row store with the same data. We survey five options
for simple yet effective data compression (so-called encodings) for columns.
Run-length encoding: The run-length of a value denotes how many repetitions of

the value are stored consecutively. Instead of storing many consecutive repetitions of
a value we can store the value together with its starting row and its run-length (that is,
the number of repetitions). That is, if in our table rows 5 to 8 would have the value 300
we have 4 consecutive repetitions and hence write (300,5,4) because this run of value
207 starts in row 5 and has a length of 4. This encoding is most efficient for long runs
of repetitive values. As an example, let us have a look at the column ReaderID (see
Table 7.1); as readers can have several books at the same time, consecutive repetitions
of the same reader ID are possible. In this case, when using the run-length encoding,
we get a smaller representation of the same information.
Table 7.1. Run-length encoding
ReaderID
205
205
ReaderID
205
(value,start,length)
207
207 (205,1,3)
207 (207,4,4)
207 (205,8,2)
205 (587,10,2)
205
587
587
Bit-vector encoding: For each value in the column we create a bit vector with one bit
for each row. In each such bit vector for a given value, if the bit for the row is 1, then
the column contains this value; otherwise it does not. Table 7.2 shows an example for
the ReaderID column. This encoding is most efficient for relatively few distinct values
and hence relatively few bit vectors.
Table 7.2. Bit-vector encoding
ReaderID 205 207 587

205 1 0 0
205 1 0 0
205 1 0 0
207 0 1 0
207 0 1 0
207 0 1 0
207 0 1 0
205 1 0 0
205 1 0 0
587 0 0 1
587 0 0 1
Table 7.3. Dictionary encoding
ReaderID ReaderID
205 1
205 1
205 1
Dictionary
207 2
207 2 and 1: 205
207 2 2: 207
207 2 3: 587
205 1
205 1
587 3
587 3
Table 7.4. Dictionary encoding for sequences
BookID
1002
1010 BookID
Dictionary
1004
1 and
1008 1: (1002,1010,1004)
2
1006 2: (1008,1006)
1
1002
1010
1004
Dictionary encoding: We replace long values by shorter placeholders and maintain

a dictionary to map the placeholders back to the original values. Table 7.3 shows an
example for the ReaderID column.
In some cases, we could not only create a dictionary for single values but even
for frequent sequences of values. As an example on the BookID column, assume that
several sets of books are usually read together: books with IDs 1002, 1010 and 1004
would often occur together as well as books with IDs 1008 and 1006 would often occur
together. Then our dictionary could group these IDs together and they can be replaced
by a single placeholder (see Table 7.4).
Frame of reference encoding: For the range of values stored in a column, one value
that lies in the middle of this range is chosen as the reference value. For all other values
we only store the off-set from the reference value; this offset should be smaller than
Table 7.5. Frame of reference encoding
ReturnDate
ReturnDate Reference:
25-10-2016
22-10-2016
27-10-2016 −3
25-10-2016 +2
22-10-2016 0
28-10-2016 −3
26-10-2016 +3
21-10-2016 +1
25-10-2016 −4
0
Table 7.6. Frame of reference encoding with exception
ReturnDate
ReturnDate Reference:
25-10-2016
22-10-2016
27-10-2016 −3
25-10-2016 +2
22-10-2016 0
28-10-2016 −3
14-10-2016 +3
26-10-2016 # 14-10-2016
21-10-2016 +1
25-10-2016 −4
0
the original value. Usually a fixed size of some bits is chosen for the offset. Table 7.5
shows an example for ReturnDate column.
In case the offset for a value exceeds the fixed offset size, the original value has
to be stored as an exception; to distinguish these exceptions from this offset a special
marker (like #) is used followed by the original value. For example (see Table 7.6), the
offset between 25-10-2016 and 14-10-2016 might be too large, and we have to store 14-
10-2016 as an exception when it is inserted into the column.
This encoding is only applicable to numerical data; it is most efficient for values
with a small variance around the chosen reference value.
Table 7.7. Differential encoding
ReturnDate ReturnDate
22-10-2016 22-10-2016
27-10-2016 +5
25-10-2016 −2
22-10-2016 −3
28-10-2016 +6
26-10-2016 −2
21-10-2016 −5
25-10-2016 +4
Table 7.8. Differential encoding with exception
ReturnDate ReturnDate
22-10-2016 22-10-2016
27-10-2016 +5
25-10-2016 −2
22-10-2016 −3
28-10-2016 +6
14-10-2016 # 14-10-2016
26-10-2016 −2
21-10-2016 −5
25-10-2016 +4
Differential encoding: Similar to the frame of reference encoding, an offset is stored

instead of the entire value. This time the offset is the difference between the value itself
and the value in the preceding row. Table 7.7 shows an example for the ReturnDate
column.
Again the offset should not exceed a fixed size. As soon as the offset gets too large,
the current value is stored as an exception. In our example (see Table 7.8), this is again
the case when inserting the date 14-10-2016.
This encoding is only applicable to numerical data; it works best if (at least some)
data are stored in a sorted way – that is, in decreasing or increasing order.
Note that with each encoding we have to maintain the order inside the column as oth-
erwise tuple reconstruction would be impossible; that is, after encoding a column we
cannot merge entries nor can we swap the order of the entries.
All these encodings cater for a more compact storage of the data. The price to be
paid, however, is the extra runtime needed to compute the encoding as well as the
runtime needed to decode the data whenever we want to execute a query on them.
Fortunately, this decoding step might be avoided for certain classes of queries; that is,
some queries can actually be executed on the compressed data so that no decompres-
sion step is needed. Let us illustrate such a case with the following example from our
library database:

1002 205 25-10-2016
1004 205 20-10-2016
1008 205 27-10-2016
1002 207 25-11-2016
1006 207 31-10-2016
1010 205 25-10-2016
Now assume that the ReaderID is compressed with run-length encoding. That is, the
column ReaderID is stored as a sequence as follows:
((205, 1, 3), (207, 4, 2), (205, 6, 1)).
An example query that can be answered on the compressed format is: “How many
books does each reader have?”. Which would be in SQL:
SELECT ReaderID, COUNT(*) FROM BookLending GROUP BY ReaderID
To answer this, the column store does not have to decompress the column into the
entire original column with 6 rows. Instead it just returns (the sum of) the run-lengths
for each ReaderID value. Hence, the result for reader 205 is 3+1=4 and the result for
reader 207 is 2:
Result: { (205, 4), (207, 2) }
7.1.2 Null Suppression
Sparse columns are columns that contain many NULL values, that is, many concrete
values are missing. A more compact format of a sparse column can be achieved by not
storing (that is, “suppressing”) these null values. Yet, some additional information
is needed to distinguish the non-null columns from the null columns. We survey the
options analyzed in [Aba07].
Position list: This is a list that stores only the non-null positions (the number of the
row and its value) but discards any null values. As metadata the total number of rows
and the number of non-null positions are stored. See Table 7.9 for an example.
Table 7.9. Position list encoding
Name
NULL
Alice Name
Bob
2: Alice Name metadata
Charlene
3: Bob and
NULL total number of rows: 10
4: Charlene
NULL number of non-null values: 5
8: David
NULL
9: Emily
David
Emily
NULL
Table 7.10. Position bit-string encoding
Name Name
NULL 0
Alice 1
Bob 1
Charlene 1 Name values
NULL 0 and
(Alice, Bob, Charlene, David, Emily)
NULL 0
NULL 0
David 1
Emily 1
NULL 0
Position bit-string: This is a bit-string for the column where non-null positions are
set to 1 but null positions are set to 0. The bit-string is accompanied by a list of the
non-null values in the order of their appearance in the column. See Table 7.10 for an
example.
Position range: If there are sequences of non-null values in a column, the range of
this sequence can be stored together with the list of values in the sequence. As meta-
data the total number of rows and the number of non-null positions are stored. See
Table 7.11 for an example.
These three encodings suppress nulls and hence reduce the size of the data set. How-
ever, internally, query evaluation in the column store must be adapted to the null sup-
pression technique applied.
Table 7.11. Position range encoding
Name
NULL Name
Alice 2 to 4: (Alice, Bob, Charlene)
Bob 8 to 9: (David, Emily)
Charlene
and
NULL
NULL
Name metadata
NULL
David total number of rows: 10
Emily number of non-null values: 5
NULL
7.2 Column striping
Google’s Dremel system [MGL+ 10] allows interoperability of different data stores. As
a common data format for the different stores it uses a columnar representation of
nested data. A schema for a nested record is defined as follows. A data element τ can
either be of an atomic type or it can be a nested record. An atomic type can for example
be an integer, a floating-point number, or a string; the set of all atomic values is called
the domain dom. A nested record consists of multiple fields where each field has a
name A i as well as a value that is again a data element τ (that is, it it can have an
atomic type or be another nested tuple). Fields can have multiplicities meaning that
a field can be optional (written as ? and denoting that there can be 0 or 1 occurrences
of this field) or that it can be repeated (written as * and denoting that there can be 0
or more occurrences of this field). Otherwise the field is required (that is, the field has
to occur exactly once). Nesting is expressed by defining one field to consist of a group
of subfields. Hence a data record τ can be recursively defined as
τ = dom | hA1 : τ[* |?], . . . , A n : τ[* |?]i
As an example, we consider several data records describing hotels. Each hotel has a
required ID, as well as required hotel information (a group of subelements) consisting
of a required name as well as a group of subelements for the address (containing a
city) and a group of subelements for the roomprice (consisting of a price for a single
room and a price for a double room). While the address group is optional the city is
required; that is, whenever there is an address element it must contain a city element.
The optional room price group for a hotel comprises an optional price for a single
room as well as an optional price for a double room. Lastly, information on several
staff members can occur repeatedly and each staff description contains a language
item that can be repeated. Each hotel is represented in a document by a nested data
record that complies with the following schema:
document hotelrecord {
required string hotelID;
required group hotel {
required string name;
optional group address {
required string city;
}
optional group roomprice {
optional integer single;
optional integer double;
}
}
repeated group staff {
repeated string language;
}
}
Note that each such document (complying with the above schema) has a tree-shaped
structure where only the leaf nodes carry values. The process of column striping de-
composes a document into a set of columns: one column for each unique path (from
the root to a leaf) in the document. For the example data record, column striping
would result in one column for hotelID, one column for hotel.name, one column
for hotel.address.city, one column for hotel.roomprice.single, one column for
hotel.roomprice.double, and one column for staff.language.
For each unique path, the values coming from different documents (in the exam-
ple, different hotel documents) are written to the same column in order to be able to
answer analytical and aggregation queries over all documents efficiently. However we
need some metadata to recombine an entire document as well as to be able to query
it (and return the relevant parts of each document). These metadata are called the
repetition level (to handle repetitions of paths) and the definition level (to handle
non-existent paths of optional or repeated fields).
The repetition level for a path denotes at which level in the path the last repetition occurred. Only
repeated fields in the path are counted for the repetition level.
The repetition level can range from 0 to the length of the path: 0 means that no repeti-
tion has occurred so far, whereas the path length denotes that the entire path occurred
for the previous field and is now repeated with the current field.
The definition level for a path denotes the maximum level in the path for which a value exist. Only
non-required fields are counted for the definition level.
To save storage space, only levels of non-required (that is, optional and repeated)
fields are counted for the definition level; that is, required fields are ignored. The def-
inition level can range from 0 to the level of the last optional or repeated field of the
path. Some extra null values have to be inserted in case no value for the entire path
but only for a prefix of the path exists. We derive the repetition and definition labels
for an example.
For the example data record, we store six columns. For these six columns the po-
tential definition and repetition levels can be obtained as follows:
– hotelID: Both the repetition and the definition level are 0 because the field is
required.
– hotel.name: Both the repetition and the definition level are 0 because all fields
in the path are required.
– hotel.address.city: The repetition level is 0 because no repeated fields are in
the path. If the city field is undefined (that is, null), the address field must also
be null (because otherwise the required city field would be defined); hence, the
definition level will be 0 because the optional address field is not present. If in
contrast the city field is defined, then the definition level is 1 because the optional
address field is present.
– hotel.roomprice.single: The repetition level is 0 because no repeated fields
are in the path. If the roomprice field is not present, the definition level is 0; if the
roomprice field is present, but the single field is undefined, then the definition
level is 1; in the third case, the single field is defined and hence also the roomprice
field is present such that the definition level is 2.
– hotel.roomprice.double: The repetition level is 0 because no repeated fields
are in the path. Similar to the above path, if the roomprice field is not present, the
definition level is 0; if the roomprice field is present, but the double field is un-
defined, then the definition level is 1; in the third case, the double field is defined
and hence also the roomprice field is present such that the definition level is 2.
– staff.language: Because the two fields in the path are repeated, both the repe-
tition level as well as the definition level are incremented, depending on whether
the fields are defined and considering the preceding path in the same document
to count the repetitions.
– If the staff element is not present at all, then the repetition level and the defi-
nition level are both 0.
– If the current path only contains a staff field (without any language field), then
the definition level is 1; the repetition level depends on the preceding path in
the same document: if in the preceding path a staff field is present, then the
repetition level is 1 otherwise it is 0.
– If the current path contains a language field (inside a staff field), the defini-
tion level is 2. If the preceding path in the same document contains a staff
field (without any language field), then the repetition level is 1; if preceding
path in the same document contains a staff field and a language field, the cur-
rent and the preceding language field can either be subfields of the same staff
field (then the repetition level is 2) or the current and the preceding language
field are subfields of two different staff fields (then the repetition level is 1);
otherwise the repetition level is 0.
In the example, three different hotels are stored in the system each complying with the
above definition of a data record. Note that the nesting of the staff.language element
differs (in some cases the language element repeats at level 1 in other cases at level
2) and that some non-required fields (like roomprice and staff) are missing from some
records.
hotelID : ’h1’
hotel :
name : ’Palace Hotel’
staff :
language : ’English’
staff :
language : ’German’
staff :
language : ’Spanish’
hotelID : ’h2’
hotel :
name : ’Eden Hotel’
address :
city : ’Oldtown’
roomprice :
single : 85
double : 120
hotelID : ’h3’
hotel :
name : ’Leonardo Hotel’
address :
city : ’Newtown’
roomprice :
double : 100
staff :
staff :
language : ’French’
Each document can be sequentially written as a set of key-value pairs; the number of
levels in each key depends on the level of nesting in the document. Our hotel docu-
ments can hence be written as the following sequence where null values are explicitly
added for non-existent fields:
hotelID : ’h1’,
hotel.name : ’Palace Hotel’,
hotel.address : null,
hotel.roomprice.single : null,
hotel.roomprice.double : null,
staff.language : ’English’,
staff.language : ’German’,
staff.language : ’Spanish’,
hotelID : ’h2’,
hotel.name : ’Eden Hotel’,
hotel.address.city : ’Oldtown’,
hotel.roomprice.single : 85,
hotel.roomprice.double : 120,
staff.language : null,
hotelID : ’h3’,
hotel.name : ’Leonardo Hotel’,
hotel.address.city : ’Newtown’,
hotel.roomprice.single : null,
hotel.roomprice.double : 100,
staff.language : ’French’
The problem with this sequential representation is that the staff information is am-
biguous: some staff members speak two languages (so that the language element at
level 2 is repeated) while in other cases there are two staff members speaking one lan-
guage each (so that the staff element at level 1 is repeated). To avoid this ambiguity,
the repetition level for each field is added. Moreover definition levels are calculated to
retain the structure information for non-existent fields.
hotelID : ’h1’ (rep: 0, def: 0),

hotel.name : ’Palace Hotel’ (rep: 0, def: 0),
hotel.address : null (rep: 0, def: 0),
hotel.roomprice.single : null (rep: 0, def: 0),
hotel.roomprice.double : null (rep: 0, def: 0),
staff.language : ’English’ (rep: 0, def: 2),
staff.language : ’German’ (rep: 1, def: 2),
staff.language : ’Spanish’ (rep: 1, def: 2),
hotel.name : ’Eden Hotel’ (rep: 0, def: 0),
hotel.address.city : ’Oldtown’ (rep: 0, def: 1),
hotel.roomprice.single : 85 (rep: 0, def: 2),
hotel.roomprice.double : 120 (rep: 0, def: 2),
staff.language : null (rep: 0, def: 0),
hotel.name : ’Leonardo Hotel’ (rep: 0, def: 0),
hotel.address.city : ’Newtown’ (rep: 0, def: 1),
hotel.roomprice.single : null (rep: 0, def: 1),
hotel.roomprice.double : 100 (rep: 0, def: 2),
staff.language : ’French’ (rep: 1, def: 2)
Finally all entries for each path are stored in a separate column table with the repe-
tition and definition levels attached. See Table 7.12 for the final representation of our
example.
To achieve a compact representation, for paths only containing required fields
(where repetition and definition levels are always 0) the repetition and definition lev-
els are not stored at all. A similar argument applies to the value null: it suffices to store
the definition and repetition level and represent the value null as an empty table cell.
While the column striping representation allows for a compact and flexible stor-
age of records, the query answering process needs to recombine the relevant data from
the appropriate column tables; this process is called record assembly. In order to an-
swer queries on the column-striped data records, finite state machines (FSM) are
defined that assemble the fields in the correct order and at the appropriate nesting
level. The advantage is that only fields affected by the query have to be read while
other fields (in particular in other columns) are not accessed.
Table 7.12. Column striping example
hotelID hotel.name hotel.address.city

value rep def value rep def value rep def
'h1' 0 0 'Palace Hotel' 0 0 null 0 0
'h2' 0 0 'Eden Hotel' 0 0 'Oldtown' 0 0
'h3' 0 0 'Leonardo Hotel' 0 0 'Newtown' 0 0
hotel.roomprice hotel.roomprice staff.language

.single .double
value rep def
value rep def value rep def
'English' 0 2
null 0 0 null 0 0 'German' 1 2
85 0 2 120 0 2 'English' 2 2
null 0 1 100 0 2 'Spanish' 1 2
'English' 2 2
null 0 0
'English' 0 2
'French' 1 2
For example, to assemble only the hotelID as well as staff.language data, only
few transition rules are needed as shown in Figure 7.1. We need rules that start the
record with the hotelID and then jump to the staff.language element until the end of
the record is reached.
These rules mean that current the row will be read and output with a nesting struc-
ture according to the definition level; after outputting the current row, we look at the
repetition level of the next row and apply the transition rule for which the repetition
level corresponds to the transition condition. For null values the definition levels are
interpreted to determine which prefix of the path was part of the document.
start hotelrecord → hotelID

rep=0
hotelID → staff.language
rep=1,2
staff.language → staff.language
rep=0
staff.language → end hotelrecord
Fig. 7.1. Finite state machine for record assembly

In this section, we present a column store system and an open source implementation
of the column striping algorithm.
7.3.1 MonetDB
MonetDB is a column store database with more than two decades of development ex-
perience.
Web resources:
– MonetDB: https://www.monetdb.org/
– documentation page: https://www.monetdb.org/Documentation
– Mercurial repository: http://dev.monetdb.org/hg/MonetDB/
It stores values in so-called binary association tables (BATs) that consist of a head and
a tail column; the head column stores the “object identifier” of a value and the tail col-
umn stores the value itself. Internally, these columns are stored as memory-mapped
files that rely on the fact that the object identifiers are monotonically increasing and
dense numbers; values can then efficiently be looked up by their position in virtual
memory. For query execution, MonetDB relies on a specialized algebra (BAT algebra)
that processes in-memory arrays. Internally, these BAT algebra expressions are imple-
mented by a sequence of low-level instructions in the MonetDB Assemblee Language
(MAL). Several optimizers are part of MonetDB that make execution of MAL instruc-
tions fast.
7.3.2 Apache Parquet
Apache Parquet implements the column striping approach explained in Section 7.2.
Web resources:
– Apache Parquet: http://parquet.apache.org/
– documentation page: http://parquet.apache.org/documentation/latest/
– GitHub repository: https://github.com/apache/parquet-mr
For efficient handling the striped columns are divided into smaller chunks (column
chunks). Column chunks consist of several pages (the unit of access) where each page
stores definition levels, repetitions levels and the values. Moreover, several column
chunks (column chunks coming from different striped columns) are grouped into row
groups; that is, a row group contains one column chunk for each column of the original
data and hence corresponds to a horizontal partitioning. Several row groups are stored
in a Parquet file together with some indexing information and a footer. Among other
information the footer stores offsets to the first data page of each column chunk and
the indexes. These metadata are stored in a footer to allow for single-pass writing: the
file is filled with row groups first and then the metadata is stored in the end. However,
when reading the data, the metadata have to be accessed first. More precisely, in the
footer there are file metadata (like version and schema information) as well as meta-
data for each column chunk in a page of the file. If the file metadata are corrupted, the
file cannot be read. Additionally each page contains metadata in a header. Based on
the metadata, those pages that are of no interest to the reader can be skipped. Parquet
supports run-length encoding as well as dictionaries but can be extended by custom
column compression or encoding schemes.
Column-oriented storage organization of relational data has been investigated since

the 1970s and has been come to known as the decomposition storage model [LS71,
CK85, KCJ+ 87]. Extending these foundational approaches, several column store sys-
tems are nowadays available. For example, the open source product MonetDB [IGN+ 12]
calls itself “the column store pioneer” and fully relies on column-wise storage of data.
Vectorwise [ZB12] offers several optimizations “beyond column stores”. The research
project C-Store [SAB+ 05] has been continued as the Vertica system [LFV+ 12]. Many
other commercial systems use column store technology for fast analytical queries. The
journal article [ABH+ 13] gives a profound overview of column store technology includ-
ing compression, late materialization and join optimizations. Compression methods
and their performance trade-offs are studied in several articles like [RH93, AMF06,
HRSD07, KGT+ 10, Pla11]. In addition, [SFKP13] also present indexing on columns. Pros
and cons of row stores versus column stores have been analyzed in different settings
– for example, [HD08] or [AMH08].
8 Extensible Record Stores
Extensible record stores are database systems akin to Google’s BigTable system. These
databases have tables as their basic data structure although with a highly flexible col-
umn management; they implement the concept of column families that act as con-
tainers for subsets of columns. Alternative names are tabular data stores, columnar
data stores, wide column stores or column family stores. To avoid confusion with the
column stores introduced in the previous chapter, we will stick with the more generic
name extensible record store here. Although extensible record stores reuse some ter-
minology of the relational data model, they are also akin to key-value stores because
they map a unique, multidimensional key to a value.
8.1 Logical Data Model
Extensible record stores bid farewell to the strict normalization paradigm known from
RDBMSs. Instead they encourage a certain amount of data duplication for the sake of
better query locality and more efficient query execution. That is, while the design in
the relational model is centered around entities (see the Entity-Relationship modeling
in Section 1.3.1) and later on normalization (see Section 2.2) is applied to obtain several
tables with less anomalies, for an extensible record store first of all a typical query
workload should be identified and data modeled around this workload accordingly.
As an illustration, let us revisit our library example with three tables similar to the
normalized tables in Section 2.2 (see Table 8.1).
Table 8.1. Library tables revisited
Book BookID Title Author

Reader ReaderID Name
1002 Databases Miller
1004 Algorithms Jacobs 205 Peter
1006 Programming Brown 207 Laura
1008 SQL Smith

1002 205 25-10-2016
1006 205 27-10-2016
1008 205 20-10-2016
1004 207 31-10-2016
1002 207 25-11-2016
162 | 8 Extensible Record Stores
Now let us assume that a typical query on this data set would be: “What are the names
of those readers who have to return books on or before 31-10-2016?”. When we store
these data in a row store, we would have to execute a join operation in order to recom-
bine the return date and the corresponding reader’s name. A row store would also
load more data than necessary into main memory when answering this query; more
precisely, it would also load the reader ID columns which are irrelevant in the query.
Similar overhead occurs when we store the data in a column store, because it would
have to execute a tuple reconstruction as well as a join on the reader ID: it has to com-
bine values from the Name column and the ReturnDate column and hence it has to
access and combine data from many data locations.
For all of these reasons, extensible record stores represent cells in the table as
key-value pairs: every column name is the key which is mapped to the column value
in a row. In other words, the table’s values are stored as a collection of key-value pairs
where the column name is repeated for every value in the column and rows are iden-
tified by a row key. Due to this repetition the general advice is to choose only short
column names as otherwise storage space would be increased significantly for larger
tables. On the other hand, the key-value pair representation makes the table structure
much more flexible: we are free to choose different column names in every row. For
example, we could decide to use the return date as the column name and map it to
the reader’s name; in other words, we combine the names and return dates in a single
key-value pair for each row. We can ignore the reader ID for the moment as we are not
interested in it in our example workload (it might be stored in another table together
with more data of the reader). To continue the example, our table would correspond
to the following collection of key-value pairs (let us ignore the BookID column for the
moment, too):
Title Author 25-11-2016 25-10-2016

Databases Miller Laura Peter
Title Author 20-10-2016

Algorithms Jacobs Peter

Programming Brown Peter

SQL Smith Laura
What extensible record stores do to organize these key-value pairs is adding another
structural dimension (apart from the usual two dimensions column and row): the col-
umn family. A column family groups columns that are often accessed simultaneously
in a typical query. The columns inside a column family are identified by a so-called
column qualifier. The full column name consists of the column family name and
the column qualifier. When processing such a query, only those column families are
fetched into main memory that contain the columns that are required by the query;
the remainder of a row need not be loaded. Column families hence provide data lo-
cality by storing data inside a column family together on disk. One further advantage
of column families is the dynamic addition of new columns: While column families
must be created before using them and hence are fixed for a table, inside a column
family, arbitrary columns can be added at runtime – and theoretically there can be
infinitely many columns in each column family. In our example it might make sense
to separate the bibliographic information of a book from the lending information.
So we introduce one column family called BookInfo and one column family called
LendingInfo:
BookInfo LendingInfo
Title Author 25-11-2016 25-10-2016




SQL Smith Laura
Lastly, the columns (that is, the key-value pairs) that belong to the same entity are
grouped together according to a unique identifier (the row key). In our example, the
entities are books and we use the book ID as the row key. Most importantly, the row
keys have to be unique for each entity inside each column family. However, there is
no way of specifying foreign key constraints between different column families and
no referential integrity is ensured. That is, an extensible record store has no means
of ensuring that for a book ID that appears as a row key in the LendingInfo column
family, there is also an entry for the same row key in the BookInfo column family.
Hence, all kinds of referential integrity checks have to be done by the application
using the extensible record store.
BookInfo LendingInfo
1002 Title Author 1002 25-11-2016 25-10-2016

1004 Title Author 1008 20-10-2016

1006 Title Author 1006 27-10-2016

1008 Title Author 1004 31-10-2016

SQL Smith Laura
At this point we also see the flexibility of how columns can be added to rows:
one row can have different columns (with different column qualifiers like the due
dates) than other rows inside the same column family. This is exactly why extensi-
ble record stores are good at storing sparse data: in contrast to a relational table
that would record null values, an extensible record store just ignores values that
are not present and no null values are included in rows. All in all, the concatena-
tion of the row key, the column family name and the column qualifier identifies
a cell in the extensible record store; that is, the full key to access a column value
is of the form rowkey:columnfamily:columnqualifier. For example, the full key
1008.BookInfo.Author uniquely identifies the value “Smith”.
Extensible record stores offer the convenient feature of ordered storage: while
the relational data model is set-based and the order of the output basically depends
on the DBMS, extensible record stores sort the data internally. Inside a column fam-
ily, the rows are sorted by their row keys; and inside a row the columns are sorted by
their qualifiers. Some extensible record stores (like HBase) just represent row keys and
column qualifiers as byte arrays and hence order row keys and column qualifiers by
their binary representation. Other extensible record stores (like Cassandra) offer data
types for row keys and column qualifiers and hence sorting can be done according to
the data type and may differ from the binary order. In particular, for different column
families, different sort orders can be chosen. For example, the column qualifiers in the
BookInfo column family can be sorted alphabetically descending, while the column
qualifiers in LendingInfo can be sorted chronologically ascending, and the row keys
are ordered numerically ascending.
BookInfo
alphabetically descending
numerically ascending
1002 Title Author

Databases Miller
1004 Title Author

Algorithms Jacobs
1006 Title Author

Programming Brown
1008 Title Author

SQL Smith
LendingInfo
chronologically ascending
numerically ascending
1002 25-10-2016 25-11-2016

Peter Laura
1004 31-10-2016
Laura
1006 27-10-2016
Peter
1008 20-10-2016
Peter
With these ordering features, extensible record stores are particularly well-suited for
identifying contiguous sequences of columns – like the one we considered at the be-
ginning of this section: “What are the names of those readers who have to return books
on or before 31-10-2016?”. To answer this query we have to find out the columns with
column qualifiers less than or equal to 31-10-2016. Due to the ordering we know that
the matching columns are stored in a consecutive range and that we do not have to
search further once we have reached a column qualifier greater than 31-10-2016. In a
similar manner, row keys should be chosen in such a way that rows that are often ac-
cessed together have row keys that are close according to the chosen ordering – so that
they could be fetched from disk in the same slice of rows.
Last but not least, some extensible record stores add one more dimension to the
row-columnfamily-column space: time. Every insert or update of a value is accompa-
nied by a timestamp. This timestamp can be specified by the user in the insert or
update command; otherwise the current system time (in milliseconds) is used. Exten-
sible record stores hence provide an automatic versioning of column values. When
a read command without an explicit timestamp is issued on a cell, the most recent
version is returned; a user can however also specify a timestamp in a read command
to retrieve older versions. Versioning can further be configured by specifying a maxi-
mum threshold for the amount of stored version: the oldest version will automatically
be discarded once a new version is stored and the maximum value is exceeded. An-
other option for automatic discarding is to specify a time-to-live value for each cell:
when the specified time span has elapsed, the corresponding version of the cell is
deleted.
One last thing to mention is that extensible record stores usually do not make
a distinction between inserts and updates. Instead, a put command is provided that
checks whether there is an existing cell for the given key; if this is the case, the value
for the key will be updated; otherwise a new cell is inserted. This is why this operation
is sometimes called an upsert. With versioning enabled, the upsert also checks the
timestamp provided in the put command; the version of the cell with exactly the same
timestamp as provided in the put command is updated; if there is no such version,
a new version for the provided timestamp (that is different from the existing times-
tamps) is inserted.
8.2 Physical storage
Under the hood, extensible record stores use several techniques for efficient query
answering and recovery. We survey some of them in this section.
8.2.1 Memtables and immutable sorted data files
Extensible record stores implement a write-optimized storage model: all data records
written to the on-disk store will only be appended to the existing records. Once written,
these records are read-only and cannot be modified: they are immutable data files.
Any modification of a record must hence also be simulated by appending a new record
Main memory flush Disk
write Sorted ... Sorted Sorted

memtable
file n file 2 file 1
Fig. 8.1. Writing to memory tables and data files
in the store. More specifically, writing to and reading from an extensible record store
comprises the following steps.
Memtable: The most recent writes are collected in a main memory table (a so-called
memtable) of fixed size. Usually there is one memtable per column family. A record in
the memtable is identified by its key (row key, column family, column qualifier and
timestamp in milliseconds). In case of an upsert, the value part of this record also
contains the new value that should be assigned to this key; in case of a deletion the
data part of the record is empty. Recall that the timestamp can be chosen by the user;
if no particular timestamp is specified in the put request, the current system time is
used by default.
Tombstones: Deletions are treated by writing a new record for a key. However, this
record has no value assigned to it; instead a delete marker (called tombstone) is at-
tached to the record. The tombstone masks all previous versions (with timestamps
prior to the timestamp of the tombstone) which will then be invisible to the user. Later
on however, newer versions for the same key can again be inserted and will be acces-
sible by the user. Several types of tombstones can be defined that differ in the scope
of records that they mark as deleted: a tombstone can for example mark as deleted
either a single version of a column (defined by its exact timestamp), an entire column
(with all its versions), or an entire column family (with all versions of all columns).
Sorted data files: Once the memtable is filled (at least up to a certain percentage), it
is written to disk (flushed) and an entirely new memtable is started. When flushing a
memtable, its records are sorted by key. Recall that in some record stores the sorting
order of the row key and column qualifiers can be configured; by default, binary or-
dering (byte order) is used. The flushed sorted data files on disk are immutable: they
are accessed when reading data but no writes are executed on them. Modifications
(upserts or deletions) for a given key will be contained in other sorted data files that
are flushed at a later point of time. The advantage of immutable data files is that buffer
management is a lot easier: there are no “dirty pages” in the page buffer that contain
modifications that have to be translated to writes on the on-disk records. Internally,
the sorted data files carry a sequence number to maintain the chronological order of
write operations which is important for the read process.
Combine upon read: The downside of immutable data files is that they complicate
the read process: retrieving all the relevant data that match a user query requires com-
Main memory Disk
Sorted ... Sorted Sorted

memtable
com-
bine
read combine
block buffer
Fig. 8.2. Reading from memory tables and data files
bining records from several on-disk data files and the memtable. This combination
may affect records for different search keys that are spread out across several data
files; but it may also apply to records for the same key of which different versions ex-
ist in different data files. In other words, all sorted data files have to be searched for
records matching the read request (see Figure 8.2).
Roughly, there are two types of read requests: get (also called point query) and
scan (also called range query). A get request accesses a particular row (identified
by its row key). A scan iterates over a contiguous range of rows depending on some
condition on the row key; for example, a starting row key can be specified and then
the next consecutive 10 rows can be retrieved with a scan. Due to sorting of the data
files, scans for a contiguous range of keys can be done efficiently.
Once the row keys to be accessed are identified, the result can be restricted to a
subset of the columns inside each row (by specifying their column qualifiers). In some
extensible record stores a set of versions of a cell (by specifying its column qualifier
and a condition on the timestamp) can also be accessed. For example, for a particular
column qualifier and a user-defined threshold k, the k records with the k most recent
timestamps are returned – provided that at least k versions of the cell exist. Usually, a
range of timestamps can also be specified in a per-query basis; for example from 0 up
to a certain timestamp to retrieve the oldest records.
Combining records from various on-disk data files and the memtable and iden-
tifying the most recent version of a column is not trivial. One difficulty is that there
may be a clash of timestamps: several records for exactly the same key and version
(with identical row key, column family, column qualifier and timestamp) but differing
in their value portion may exists in the immutable data files. These clashes may occur
because the timestamp is part of the key: if the user specifies a key that already exists
in a data file, a new record with the same key is appended to the memtable and later on
flushed to a new data file. When reading this key, several values for exactly the same
key could be returned from different data files. It is however desirable to determine a
unique most recent value for each key. One way to handle such a clash is to use the
unique sequence number of the stored data files: the record for a key that is contained
in the data file with the highest sequence number is the most recently written one. A
further difficulty of reads is that the extensible record store has to interpret the time-
to-live (TTL) values of records as well as tombstones when retrieving and combining
data from multiple sorted data files. Because on-disk data files are immutable records
with a passed TTL still remain in the store and hence they have to be skipped when
combining records into the result set. In a similar manner, deletions mask all records
with earlier timestamps than the timestamp of the tombstone; these masked records
have to be filtered out of the result set.
Some extra information can be maintained for each data file to speed up the com-
bine process; for example, the range of row keys in the file or the minimum and maxi-
mum timestamp can be stored as metadata of the file. With the help of these metadata,
some data files may be excluded from the combine process straight away.
8.2.2 File format
Extensible records stores store the data in the on-disk data files in a certain format
with the following properties:
Data blocks: An on-disk data file is composed of several data blocks. The block
size can usually be configured in the extensible record store settings. Moreover,
in some extensible record stores, a different block size can be used in each col-
umn family and hence the block size can be specified by the user when creating
the column family. A block may also span multiple conventional memory pages;
recall that the size of a memory page is usually fixed and dictated by the memory
buffers and the operating system. On top of this, memory management is even
more flexible in extensible record stores: the block size can also be exceeded by
some records in a data file. Indeed, if the size of a record is larger than the block
size, the record is nevertheless handled as one coherent block although it spans
several memory pages.
Key-value pairs: As shown in Figure 8.3, a data block may contain one or more
key-value pairs. Each key-value pair contains the entire key – that is, row key, col-
umn family, column qualifier and timestamp. This format hence is the foundation
for the flexibility of extensible record stores because no fixed database schema is
needed to interpret the data. This flexibility however comes at the price of repeti-
tious occurrences of portions of keys: when a row consists of several columns, the
row key is contained in every record for each of these columns; analogously, col-
umn families and column qualifiers are usually parts of keys in several different
records. For this reason, longer row keys, column families and column qualifiers
have a negative impact on storage consumption in extensible record stores. The
key is followed by a type information: the type determines if the record is a put
(in other words, an upsert) – in which case the new value is appended – of if it is
a deletion – in which case the type also determines the scope of the deletion like
data data . . . data index trailer
key-value key-value . . . key-value
row key column family column name timestamp type value
key
Fig. 8.3. File format of data files
single version, entire column or entire column family. In some extensible record
stores additionally the type of an increment is available: in this case the column is
called a counter column and during an insert a certain increment value is added
to the previous most recent value of the record.
Index: Obviously, data files usually contain records for several keys and may be-
come quite large. Reading in records in a data file sequentially is a very inefficient
method when searching for a single record for a given key in such a large data file.
In order to speed up the retrieval of records from data files, an index structure is
maintained at the end of each file. Indexing is done for row keys in a block-wise
manner; that is, the first row key on each block is inserted in the index. The re-
trieval process is then supported by the index as follows. When searching for a
given key in a data file, first of all the entire index of the data file is loaded into
main memory. As not all row keys are maintained in the index (only the first row
key in each block), the index has to return either the entry for the exact search
key (in which case the search key is the first key in a block), or the index entry for
the largest row key preceding the search key. In the former case when the exact
search key is contained in the index, the index entry is used to offset to the correct
block in the data file and load the block into memory to access the value for the
search key. In the later case the exact search key is not found in the index and we
hence cannot be sure whether a record for the search key is contained in the data
file or not; due to this the index entry for the largest preceding row key is used to
offset into the data file, load the block into main memory and parse it sequentially
either until the exact search key is found and its record is accessed – or until the
end of the block is reached without finding a record for the search key.
key-value key-value leaf index . . . key-value leaf index root index trailer
Fig. 8.4. Multilevel index in data files
Trailer: As the last component of a data file, a trailer contains management infor-
mation (for example, where the index starts; see Figure 8.3).
Multi-level index: Even though indexing is done at the block level and hence not
all row keys are maintained in the index, the single index at the end of each data
file might become quite large. This might slow down the read process because
the entire index has to be loaded into main memory before accessing any key-
value pairs. This is where multilevel indexes come to the rescue. A multilevel index
splits the single index into several sub-indexes: one sub-index (called leaf index)
is stored at the end of each block, and only a small super-index (called root index)
pointing to the sub-indexes is stored at the end of the data file. As leaf indexes are
contained in each block they now allow for a more fine-grained indexing: each
key inside a block can be indexed in the leaf index of the block such that its index
entry determines an offset into the block where the record for the key can be found.
This extended file format is sketched in Figure 8.4. For extremely large data files,
even more intermediate index levels are possible to help keep the root level small;
the root index entries then point to the intermediate indexes and the intermediate
indexes themselves point to leaf indexes.
Depending on the exact organization of the index and the on-disk management of
the data files (see the notion of compaction below) there are different implementa-
tions and manifestations of these data files. In particular, a form of data file is the
Log-Structured Merge Tree (LSM tree) or one of its variants like for example a Sorted
Array Merge Tree (SAMT) or a Cache-Oblivious Look-ahead Array (COLA). All these tree
structures have in common that they espouse a better write throughput but also have
a more efficient behavior for scans because the data in the leaf nodes are stored in
contiguous data blocks ordered by their keys (as opposed to scattered leaf nodes with
other tree structures like the B-tree).
8.2.3 Redo logging
The memtable is kept in volatile memory until it is eventually flushed to disk. Data are
only durable (ensured by backups and replication) when stored in the on-disk data
files and hence data that are contained in the memtable are vulnerable to failures of
Disk
Main memory redo log
write (1)
memtable flush Sorted ... Sorted Sorted

(2)
Fig. 8.5. Write-ahead log on disk
the database server. For example, a crash of the server may wipe out the entire mem-
ory, or write errors may occur when flushing the memtable to disk. When restarting
the server (or when trying to rewrite the data to disk) the data in the memtable have
to be recovered. Recovery is achieved with the help of an on-disk log file that keeps
track of all records that are appended to the memtable but have not yet been flushed
to the disk. Note that this means that all data have to be written twice: once to the
log file and then to the memtable. Inside the log file, each record received a log se-
quence number (LSN) that maintains the chronological order of write operations. Be-
cause data are stored to the on-disk log file before (that is, ahead of) appending them
to the memtable, this process is often called write-ahead logging (see Figure 8.5).
One peculiarity of extensible record stores is that they assume redo-logging to
be sufficient: When restarting the database server after a failure the memtable is re-
constructed from the log file by re-executing all the operations in the log as ordered
by their LSNs. This restriction is justified by the fact that extensible record stores
support neither transactions that range over several rows nor long-lived transactions
which would have to be logged completely before appending their operations to the
memtable. Because redo-logging of such complex transaction leads to a slow write
performance, undo-logging would be necessary when such a transaction aborts pre-
maturely: as the transaction did not complete, all operations that so far have been
executed by the transaction have to be rolled back.
Although the logging mechanism adds an additional overhead to the write pro-
cess, it in fact improves overall write performance: while appending a record to the
log file in chronological order is fast, sorting the records by key is slower and can be
deferred until flushing the memtable; moreover, flushing the memtable corresponds
to a batch write of all records to sequential data blocks of a new data file – this is much
more efficient than inserting each record at its correct position one at a time. Last but
not least, writes to the same key can be coalesced and only the most recent record has
to be flushed to disk – in particular, no record for a key must be written to disk at all
if an upsert for the key is masked by a tombstone for the key in the memtable.
Disk
Sorted
file n+1
Main memory
merge
memtable Sorted ... Sorted Sorted
Fig. 8.6. Compaction on disk
8.2.4 Compaction
After some time, several flushes of memtables will have occurred and hence there will
be quite a lot of data files stored on disk. These data files will most probably contain
some outdated records: records whose time-to-live value has passed, records for which
more recent versions exist and for which the maximum number of stored versions is
exceeded, or records which are masked by a tombstone. Outdated records not only
unnecessarily occupy disk space, they also slow down read processes because they
have to be loaded and compared with other records in the combine process of data
retrieval (see Section 8.2.1). This is why a process called compaction was devised to
remove any unwanted records and merge a set of data files into a new one. As sketched
in Figure 8.6, a set of data files is chosen for compaction, their records are merged and
the result is written to a new larger data file (at a new location on disk); finally, the
small input data files can be deleted. More specifically, a minor compaction merges
only a small subset of all data files, whereas a major compaction merges all data files
into a single new one.
Several things have to be considered during compaction:
– The records of all key-value pairs in the data files have to be sorted by their keys
and hence reordering and restructuring of the index is necessary.
– At the same time, time-to-live values have to be interpreted so that expired records
can simply be ignored.
– If one of the data files contains a tombstone, all data that are masked by the tomb-
stone and have been written prior to the tombstone can be ignored. Note that
records that are masked by the tombstone but have been written after the insertion
of the tombstone (because they are contained in a more recent data file as identi-
fied by the data file sequence number) are handled differently: these records are
merged into the new data file but will still be masked by the tombstone if it is a mi-
nor compaction. Tombstones themselves can only be deleted during major com-
paction; this means that only after a major compaction more recent records for a
key will be visible because they would previously be masked by the tombstone.
This somehow incoherent behavior is usually chosen to simplify the compaction
process and the interpretation of tombstones during a read process. Other seman-
tics of deletions can be enforced but this would require data retrieval as well as
minor compactions to be more involved.
– In some extensible record stores, versioning settings are also enforced during
compaction: only a specified amount of versions for each key is kept at the max-
imum. For example, if the maximum amount of versions to be stored is set to
three, for each key the records with the three most recent timestamps are copied
to the compacted data file while all records with older timestamps are ignored.
– Last but not least, changing column family settings can be done during major
compaction: when settings (like the data type or sorting order of columns) have
been modified, the new settings will be applied during compaction to the records
in the older sorted data files. That is, after major compaction all records are consol-
idated to the new settings. This can be seen as an automatic support for schema
evolution.
Compaction is demanding with respect to disk space: sufficient disk capacity is

needed when smaller data files are merged into a new larger data file. While the
compaction is run, roughly twice the disk space as occupied by the smaller data files
is needed. The smaller data files will however be discarded after compaction (as soon
as all read processes on them have finished) effectively releasing the disk space.
Several heuristics can be applied when choosing data files for minor compaction.
Oldest files first: Data files may be chronologically ordered (as by their data file
sequence number) and the oldest files with the lowest sequence number are cho-
sen for compaction; that is, older records migrate into larger compacted data files
first. If the data file sequence number is needed during data retrieval to avoid
timestamp clashes, only data files with a continuous range of sequence numbers
can be merged to maintain the global chronological order.
Small files first: In order to obtain a homogeneous set of data files, several similar
sized smaller files are merged into one larger file.
Compaction threshold: The user can configure a minimum number of files for
which a compaction is run. For smaller amount of files compaction is deemed
unnecessary and inefficient.
Note that with these heuristics, the same record may be chosen for minor compaction
several times and the records unnecessarily often migrates from smaller files into
larger files. Furthermore, the size of the resulting compacted files cannot be con-
trolled. To avoid these issues, leveled compaction has proved to be advantageous.
With leveled compaction, the data files are organized into levels as shown in Fig-
ure 8.7. Each level contains a fixed amount of data files; data files in lower levels are
smaller than data files in higher levels. A flush always moves the memtable to a data
Disk
merge merge merge

Main memory flush
memtable ...
Level Level Level
L0 L1 L2
Fig. 8.7. Leveled compaction
file in the lowest level L0. Subsequent compaction steps move a record only from one
level to the next, so that the amount of merges for each record is bounded by the num-
ber of levels. It is also helpful for an efficient merge process to assign non-overlapping
key ranges to the data files inside each level; in this way the merge only involves one
of the data file in the next level while all other data files in the level remain unaffected.
Key ranges also improve the read process because not all data files have to be accessed
to search for a key.
8.2.5 Bloom filters
A Bloom filter is a probabilistic method to determine set membership for a given value.
To be more specific, for a given value c and a set S of values, the Bloom filter is a small
bit vector with which we can decide whether c is included in S without actually search-
ing for c in S – but it comes with a small probability of error. Hence, the following cases
of outcomes can be considered:
– True positive: A true positive means that the Bloom filter correctly reports a
match and confirms that c is an element of S; hence c ∈ S indeed holds.
– False positive: A false positive means that the Bloom filter wrongly reports a
match; that is, it assumes that c is included in S although it is not.
– True negative: A true negative means that the Bloom filter correctly reports a miss
and states that c is not an element of S; hence c 6∈ S indeed holds.
– False negative: A false negative means that the Bloom filter erroneously reports
a miss saying that c is not included in S although in fact c ∈ S holds.
What makes Bloom filters a good choice for a quick membership pre-test is that only
one kind of error arises: False positives indeed happen with a certain probability.
Hence, when the Bloom filter reports a match (saying c is an element of S), we cannot
be sure if this is true: we have to actually search for c in S to verify whether c ∈ S holds
data data . . . data index bloom filter trailer
Fig. 8.8. Bloom filter for a data file
or whether the Bloom filter wrongly believed c to be an element of S although in fact

c 6∈ S holds. In contrast, false negatives will never occur with a Bloom filter. In other
words, there will only be true negatives: when the Bloom filter decides that c 6∈ S, we
can simply skip searching for c in S.
For extensible record stores, Bloom filters can be maintained for all the row keys
in a data file with the following positive effect: When searching for a given query key in
the file, first the Bloom filter is accessed with the query key. If the Bloom filter reports
a miss, this will be a true negative; hence, we do not have to access any data (and not
even the index) inside this data file. In case the Bloom filter reports a match, we have to
load the index and search it for the query key (and potentially access the appropriate
block and scan it for the query key) to check whether the match was a true positive or
a false positive. In case of a false positive, the key will not be found in the data file. For
small data files, a single Bloom filter can be appended at the end of the data file. See
Figure 8.8 for an illustration with the trailer having a pointer to the start of the Bloom
filter entry. For larger files with lots of keys, a single Bloom filter will be large, too.
This will result in performance delays when searching for a row key in the data file. To
remedy this and in conjunction with an index, a Bloom filter can be broken into pieces
(similar to the multi-level index): a small Bloom filter “chunk” is maintained for the
keys in each block; the Bloom filter chunk is then queried for the existence of a key
before actually accessing data in the corresponding block. A further extension is to
not only have a Bloom filter for row keys but instead to maintain a Bloom filter for the
combination of row key and column qualifier. With such a row-key-column-qualifier
Bloom filter we only have to search the row for the given column qualifier in case of a
match; yet, in case the filter reports a miss, we do not access the data record at all.
More formally, a Bloom filter is a bit vector of a chosen length m with every po-
sition initialized to 0. The bit vector is accompanied by k different hash functions
h1 , . . . , h k where each hash function is assumed to map an arbitrary value c randomly
to a number between 1 and m – that is, h i (c) ∈ {1, . . . , m}. More precisely, these
hash functions should not only map values randomly but also uniformly to the range
1, . . . , m: for each number d between 1 and m, the probability that an input value c
1 1
is mapped to d should be equal to m ; this can be written as Prob(h i (c) = d) = m . The
′
case that two different input values c and c are mapped to the same value (that is,
h i (c) = h i (c′ )) is called a collision. Collisions are the reason why false positives can
occur with Bloom filters as we will see shortly.
Let us now have a look at how Bloom filters improve query performance in extensible
record stores. When a record for a row key is inserted, we compute all k hash values
of the key and fo r h i (key) = d, the d-th bit in the Bloom filter is set to 1. See Figure 8.9
for a Bloom filter of length m = 16 that uses three hash functions. The Bloom filter is
initially all-zero. When the first record with key key1 is added, the three hash values
for key1 are calculated and the corresponding bits (1, 3 and 8) in the Bloom filter are
set to 1. The same steps are executed when the second record with key key2 is added;
this time the hash values of key2 result in the bits 6, 10 and 13 to be flipped. When
querying for a certain row key, the k hash functions are also applied to the query key.
All these k hash values are then compared with the Bloom filter: when for each i =
1 . . . k and each hash value h i (query) = d′ , the d′ -th bit of the Bloom filter is 1, then
we have a match. Note in Figure 8.9 that for query1 all three bits corresponding to
the query’s hash values in the Bloom filter are 1; in Figure 8.9 (d) this corresponds to
a true positive if indeed key1 = query1 . Due to collisions of the hash functions, this
match can however be a false positive: the matching bits could have been set by a set
of keys other than query1 which however happen to have one or more hash values
identical with query1 ; this is shown in Figure 8.9 (e). This is why we have to access the
appropriate data block in the data file and search for the query key there. On the other
hand, when for a query key one of its hash values corresponds to a 0 bit in the Bloom
filter, then we can be sure that this particular key so far has not been added to the data
file.
This kind of basic Bloom filter works nicely (and without false negatives) as long
as keys are only added to the store. To support deletions of keys, the basic Bloom filter
has to be extended. For example counting Bloom filters keep a counter of how many
times a bit was set to one: for an insertion the counter is incremented and for a deletion
the counter is decremented. Luckily, with extensible record stores, such extensions are
not needed: The data files are immutable and hence Bloom filters have to be computed
once when creating a new data file – that is, when flushing the memtable or when
compacting a set of data files.
The probability of false positives for Bloom filter settings (given by the number n of
elements in the input set S, the length m of the Bloom filter bit vector and the amount k
of hash functions) can be approximated. With such an approximation, we can choose
parameters n, m and k such that the occurrence of false positives is minimized – and
hence we have a high accuracy of the Bloom filter. We reiterate a common approxima-
tion for the probability of false positives below. This approximation is only valid for
large Bloom filters (that is, large values of the bit vector length m). In other words, for
small Bloom filters, the expected amount of false positives can be significantly higher.
To obtain a lower bound for the false positive probability, recall that the probability
1
that hashing an input value with one hash function sets a certain bit to 1 is m (due to
uniformity of hash functions). By reversing this argument, the probability that a bit in
(a) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
key1
(b) 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
key2
key1
(c) 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0
key2
key1
(d) 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0
query2 : miss
(true negative)
query1 : match
(true positive if key1 =query1 )
key2
key1
(e) 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0
query2 : miss
(true negative)
query1 : match
(false positive because key1 6= query1 and key2 6= query1 )
Fig. 8.9. A Bloom filter of length m = 16 with three hash functions

the bit vector is not set to 1 is

1
1−
m
After applying all k hash functions to an input value, the probability of a bit not being
set to 1 is k
1
1−
m
And after hashing all n input values this probability turns into
k·n
1
1−
m
Again by reversing the argument the probability that one bit is indeed set to 1 is:
k·n
1
1− 1−
m
Given this probability, what is now the probability of a false positive? A false positive
happens when applying all k hash functions on the test value leads to k collisions. In
other words, the k bits in the bit vector that are computed for the test value are already
set to 1. The probability for this case happening is usually taken to be
k·n !k
1
1− 1− (8.1)
m
This is a simplified lower bound which is not entirely correct; this approximation is
nevertheless used often and for sufficiently large values of m and relatively small val-
ues of k there is only a slight difference to the correct false positive rate (which is how-
ever derived by a much more complex formula).
In abstract terms, we learn the following rules of thumb: the more elements (n)
are added to a Bloom filter, the lower the accuracy; the more hash functions (k) are
used, the higher the accuracy (but more computation needs to be done); the larger
the bit vector (m) the higher the accuracy (but more space is used).
Equation 8.1 is then often approximated by using the Euler number e and the fact
1 m
that asymptotically (that is, for large m) the expression (1 − m ) tends towards e−1 :
k·n !k
1 −k·n
1− 1− ≈ (1 − e m )k (8.2)
m
By doing some math with this formula, one can on the one hand derive the optimum
value for k – that is, the number of hash functions for which the false positive prob-
ability is minimized. This value for k opt is calculated to be around the fraction of m
divided by n multiplied with the natural logarithm of 2: k opt ≈ mn · ln2 ≈ 13n
9m
. As the
result of this calculation is usually not a natural number we can choose the next inte-
ger that is greater than or less than the result. For example, assume that m = 16 and
Table 8.2. False positive probability for m = 4 · n
m =4·n approximate false positive probability by Equation 8.2
k =1 0.221 (22.1%)
k =2 0.155 (15.5%)
k =3 0.147 (14.7%)
k =4 0.160 (16.0%)
Table 8.3. False positive probability for m = 8 · n
m =8·n approximate false positive probability by Equation 8.2
k =3 0.031 (3.1%)
k =4 0.024 (2.4%)
k =5 0.022 (2.2%)
k =6 0.022 (2.2%)
k =7 0.023 (2.3%)
n = 4 (and hence m = 4n), then k opt ≈ 9·1613·4 ≈ 2.77. By inserting several values for k
into Equation 8.2, we see that indeed k = 3 has the lowest approximate false positive
probability as shown in Table 8.2. In general however, we see that m = 4n is obvi-
ously not a good choice because the probabilities are much too high for any practical
application. Fortunately, these values quickly degrade for larger m. For example, for
m = 8n we see that k opt ≈ 5.55 and the approximate false positive probability is a lot
lower as shown in Table 8.3.
On the other hand, we can also fix our desired false positive probability and then
calculate the size m of the Bloom filter that is needed to ensure the desired probability.
Indeed, in some extensible record stores, the desired probability can be configured by
the user; decreasing this probability then ultimately results in larger data file sizes
because the Bloom filters occupy more space.
The bottleneck of practical Bloom filters is the abundant computation of hash val-
ues. To alleviate this problem a partition scheme for the Bloom filter can be used result-
ing in a partitioned Bloom filter (as introduced in [KM08]): Two hash functions h1
and h2 can simulate the entire set of k hash functions. More precisely, k hash functions
g0 , . . . g k−1 can be derived by letting
g i (x) = h1 (x) + i · h2 (x)
for i ranging from 0 to k − 1. Asymptotically, for large values of m, the same approxi-
mation for the false positive probability can be achieved as in Equation 8.2. Hence we
can get an (approximately) equally accurate Bloom filter by only computing two hash
functions instead of k. The partition schema derives its name from the fact that the
bit vector is split into k different partitions each of size m′ = mk . By further assuming
that h1 and h2 range over [0, . . . , m′ −1] and computing the summation g i modulo m′ ,
key
0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0
partition partition partition partition

i=0 i=1 i=2 i=3
Fig. 8.10. A partitioned Bloom filter with k = 4 and partition length m′ = 4
the i-th hash value (that is, g i ) indeed occupies a bit in the i-th partition. Figure 8.10
shows and example for four partitions each of length four.
This section briefly surveys some features of currently available open source extensi-
ble record stores.
8.3.1 Apache Cassandra
Cassandra stores column families in a keyspace. The Cassandra query language (CQL)
can be used to interact with the database system and CQL commands can be input
into the CQL shell (cqlsh).
Web resources:
– Apache Cassandra: http://cassandra.apache.org/
– documentation page: http://docs.datastax.com/en/getting_started/
– GitHub repository: https://github.com/apache/cassandra
A keyspace can be created with certain settings for the replication.
CREATE KEYSPACE library

WITH REPLICATION = {
’class’ : ’SimpleStrategy’,
’replication_factor’ : 3 };
The create table command creates a new column family in the keyspace (this com-
mand is identical to the create column family command). That is, in Cassandra the
term table corresponds to a column family – which is in contrast to other extensible

record stores that use the term table as a container for column families.
CREATE TABLE bookinfo (

bookid int PRIMARY KEY,
title text,
author text
);
Insertions are done with the insert command:
INSERT INTO bookinfo (bookid, title, author)

VALUES (1002,’Databases’,’Miller’);
An index can be created on columns other than the primary key to enable filtering on
column values:
CREATE INDEX ON bookinfo (title);

CREATE INDEX ON bookinfo (author);
Each Cassandra column family (table) has a primary key. The primary key can be com-
posed of several columns (a so-called compound key) where the first component (col-
umn) of the primary key is called the partition key; on the partition key, the data is
split into partitions that will be distributed among the database servers. The other
components of the primary key are used to sort (cluster) the data inside a partition.
Even more sophisticated, the partition key itself (that is, the first component of the
primary key) can be composite and hence consist of a tuple of columns. In particular,
if the partitions defined by a simple partition key are too large, a composite partition
key can help split the data in to smaller partitions.
Cassandra supports the collection types set, list and map for cell values; inter-
nally, each element of a collection is store inside a separate column. For example, we
can create an attribute as a set of texts, and then update values in the set (where addi-
tion (+) insertion of an element subtraction (-) denotes removal of an element). After
creating an index on its values, the set can be searched with the contains statement.
For the book example, we can change the author to allow a set of texts and then issue
a select statement to find those books where a certain author is contained in the set:
CREATE TABLE bookinfo (

bookid int PRIMARY KEY,
title text,
author set<text>
);
INSERT INTO bookinfo (bookid, title, author)

VALUES(1002, ’Databases’, {’Miller’});
UPDATE bookinfo SET author = author + {’Smith’} WHERE bookid = 1002;
CREATE INDEX ON bookinfo (author);
SELECT title,author FROM bookinfo WHERE author CONTAINS ’Miller’;
Moreover Cassandra allows to define nested tuples and user-defined types.

The Cassandra Java Driver comes with some features including asynchronous
communication, load balancing, and node discovery. In particular, a client applica-
tion can connect to a server (identified by its server name as a string) in a cluster.
A session object can be obtained from the cluster and CQL statements can then be
executed on the session object.
Cluster cluster = Cluster.builder()

.addContactPoint("localhost").build();
Session session = cluster.connect();
session.execute(
"CREATE TABLE TABLE bookinfo ("+
" bookid int PRIMARY KEY,"+
" title text,"+
" author text"+
");");
The execute method returns a Result object that can be used to access the results of a
selection query:
ResultSet results = session.execute("SELECT * FROM bookinfo "+

"WHERE title = ’Databases’);");
for (Row row : results) {
System.out.println("Title: " + row.getString("title")+
", Author: "+row.getString("author"));
}
Instead of filling in values in the CQL query string, statements can be parameterized:
several “?” markers can be used in the CQL string. When executing the parameter-
ized statement, the markers are replaced by the remaining parameters of the execute
method in the order of occurrence:
session.execute("INSERT INTO bookinfo (bookid, title, author)"+

" VALUES (?, ?, ?)", 1002, "Databases", "Miller");
Moreover, the session object has a prepare method to declare a prepared statement. A
prepared statement is a method that is registered and precompiled into an execution
plan at the servers sided such that an invocation of the statement only needs to pass
the name of the statement and parameters. The prepared statement can be combined
with the parameterization with “?” signs such that parameters passed to the prepared
statement are inserted into the prepared statement in the order of occurrence in the
invocation:
PreparedStatement prepStatement = getSession().prepare("INSERT"+

" INTO bookinfo (bookid, title, author) VALUES (?, ?, ?);");
A BoundStatement object is then used to invoke a prepared statement and insert

(bind) parameters into it:
BoundStatement boundStatement = new BoundStatement(prepStatement);

getSession().execute(boundStatement.bind(1002,"Databases","Miller"));
The query builder API might be even more convenient to use in a Java program since
no CQL statements must be written but instead a Select, Update, Insert or Delete
object can be used to represent a query:
– the QueryBuilder.select method returns a Select object.
– the QueryBuilder.insert method returns an Insert object.
– the QueryBuilder.update method returns an Update object.
– the QueryBuilder.delete method returns a Delete object.
Method chaining can be used to set more restrictions for each operation. The Select
object has for example a from method that defines the keyspace and the table (column
family) name and a where method that specifies restrictions on columns:
Select select = QueryBuilder.select().all()

.distinct().from("library", "bookinfo")
.where(eq("title", "Databases"));
ResultSet results = session.execute(select);
Similarly, the Insert object has a methods insertInto and value to insert data into
a column family:
Insert insert = QueryBuilder.insertInto("library", "bookinfo")

.value("bookid", 1002)
.value("title", "Databases")
.value("author", "Miller";
ResultSet results = session.execute(insert);
A mapping manager in conjunction with annotations (similar to JPA and JDO as de-
scribed in Section 9.3) can be used to map Java object to cassandra tables.
8.3.2 Apache HBase
HBase stores tables in namespaces.
Web resources:
– Apache HBase: http://hbase.apache.org/
– documentation page: http://hbase.apache.org/book.html
– GitHub repository: https://github.com/apache/hbase
HBase offers a command line interface with its own commands like create, put and
get. The table name and the column family name have to be specified in the create
command; in our book example we create a table Book and a column family BookInfo:
create ’book’, ’bookinfo’
With the put command we can add information for a book where the BookID is used
as the row key and the column name consists of the column family name (BookInfo)
and the column qualifier (Author and Title) separated by a colon:
put ’book’, ’1002’, ’bookinfo:author’, ’Miller’

put ’book’, ’1002’, ’bookinfo:title’, ’Databases’
With the get command, the values of all columns in a row (for a row key) can be ob-
tained:
get ’book’, ’1002’
With the scan command all rows in a table (or a subset of all rows) can be obtained:
scan ’book’
In the Java API, the Admin class is used to create a table (HTableDescriptor) and a
column family as well as a column inside the table (both with HColumnDescriptor).
TableName tableName = TableName.valueOf("book");

HTableDescriptor table = new HTableDescriptor(tableName);
table.addFamily(new HColumnDescriptor("bookinfo"));
HColumnDescriptor newColumn = new HColumnDescriptor("author");
Connection connection = ConnectionFactory.createConnection();
Admin admin = connection.getAdmin();
admin.addColumn(tableName, newColumn);
admin.createTable(table);
A Put object writes values to the database; note that is more efficient to keep the col-
umn family name and the column qualifier in static byte arrays to avoid repetitive
conversions from String to byte array:
public static final byte[] CF = "bookinfo".getBytes();

public static final byte[] COL = "author".getBytes();
Put put = new Put("1002".getBytes());
put.add(CF, COL, "Miller".getBytes());
table.put(put);
Note that put works as an upsert: it adds a new row to a table (if the key is new to the
column family) or it updates an existing row (if the key already exists in the column
family). A Get and a Scan object read values from the database:
Get get = new Get("1002".getBytes());

get.addColumn(CF, COL);
Result result = table.get(get);
byte[] b = result.getValue(CF, COL);
Scan scan = new Scan();

scan.addColumn(CF, COL);
ResultScanner scanner = table.getScanner(scan);
try {
for (Result r = rs.next(); r != null; r = rs.next()) {
System.out.println(r.toString());
}
} finally {
scanner.close();
}
table.close();
So-called client request filters enforce conditions on a query. For example, a single
column value filter can implement an equality comparison; for example searching for
an author called Miller:
public static final byte[] CF = "bookinfo".getBytes();

public static final byte[] COL = "author".getBytes();
SingleColumnValueFilter filter = new SingleColumnValueFilter(
CF, COL, CompareOp.EQUAL, Bytes.toBytes("Miller") );
scan.setFilter(filter);
Other filter types include RegexStringComparator, SubstringComparator, ColumnPre-

fixFilter, and ColumnRangeFilter.
Several settings can be configured by the user. For example, by changing the
value of HColumnDescriptor.DEFAULT_VERSIONS the maximum number of stored
versions for a cell can be altered . The default is 1, so that only one version of a cell will
be returned for a query. Any other versions than the most recent one can be physically
deleted during a compaction. In a get command, the number of returned versions as
well as the recency of returned versions can be configured via Get.setMaxVersions()
and Get.setTimeRange(). A minimum number of versions can also be specified that
is interpreted in combination with time-to-live values: the database must maintain
at least this minimum number of versions (the most recent ones according to their
timestamp) so that a cell might even be retained (and not deleted during compaction)
when its time-to-live value has expired. Compaction can be influenced by the ad-
ministrator, too; for example by setting the minimum and maximum sizes that a file
should have to be considered for compaction. Bloom filters are by default defined on
the the row keys. However they can be configured to work on a combination of row
key and column qualifier in a column family; this setting can be changed by calling
HColumnDescriptor.setBloomFilterType or setting the property on the command
line when creating a table; for example:
create ’book’,{NAME => ’bookinfo’, BLOOMFILTER => ’ROWCOL’}
The row+column Bloom filter provides a benefit when often accessing individual
columns – it is however not effective when data are only accessed by row key without
restricting the column qualifier. Due to the increased amount of keys to be main-
tained by the Bloom filter, its space demand increases, too. Other settings that can
be changed for Bloom filters are the error rate and the maximum number of keys per
Bloom filter.
HBase nicely integrates with the features offered by Hadoop MapReduce: HBase
can act as a data source and sink for MapReduce jobs. In this case, the Hadoop jobs
should be defined in subclasses of the TableMapper and TableReducer classes. In
addition, HBase can run code on the server side in so-called coprocessors which lets
developers and administrators add functionality.
8.3.3 Hypertable
Hypertable stores tables in a namespace. Hypertable’s query language is called Hy-

pergraph query language (HQL) which is similar to SQL. The Hypertable shell is Hy-
pertable’s command line interface. HQL queries can be interpreted by the Hypertable
shell, the Thrift API or the HqlInterpreter C++ class.
Web resources:
– Hypertable: http://hypertable.org/
– documentation page: http://hypertable.com/documentation/
– GitHub repository: https://github.com/hypertable/hypertable
Namespaces can be created with an HQL command and namespaces can be nested so
that a namespace can be a subnamespace of another namespace.
CREATE NAMESPACE "/mynamespace";

USE "/mynamespace";
CREATE NAMESPACE "subnamespace";
When creating a table, the column family names have to be specified:
CREATE TABLE book (bookinfo, lendinginfo);
Optionally, column families can be assigned to access groups; column families in the
same access group will be stored together on disk.
The insert command inserts data; several tuples can be specified in one command
and each tuples consists of a row key, the column name (consisting of column family
name and column qualifier) and a value – a timestamp can be added as an optional
property.
INSERT INTO book VALUES ("1002", "bookinfo:title","Databases"),

("1002", "bookinfo:author","Miller");
Selection statements can be specified on the column family level or at the column
level. When specifying the column family, the entire set of columns inside the family
is returned; for example:
SELECT bookinfo FROM book;
At the column level, all values in the column are returned:
SELECT bookinfo:title FROM book;
Moreover, conditions on the row key and conditions on the key (row key as well as
column name) to access a cell can be specified; for example:
SElECT * FROM book WHERE ROW = ’1002’;

SELECT * FROM book WHERE CELL = ’1002’,’bookinfo:title’;
Other comparison operators like < and <= are possible to specify ranges; regular ex-
pression for string matching can be used, too. Hypertable supports indexes on cell
values as well as on column qualifiers. A cell value index improves searches with se-
lection conditions on cells whereas a column qualifier index improves searches with
conditions on column qualifiers.
Compaction can be scheduled by using the compact command. In Hypertable,
compaction can be executed on a single row key that can be specified in the command;
for example:
COMPACT book "1002"
8.3.4 Apache Accumulo
Apache Accumulo stores tables in namespaces. In order to switch between names-

paces, the namespace is prepended to the table name separated by a dot.
Web resources:
– Apache Accumulo: http://accumulo.apache.org/
– documentation page: http://accumulo.apache.org/1.7/accumulo_user_manual.html
– GitHub repository: https://github.com/apache/accumulo
Accumulo offers a command line interface called Accumulo Shell. The shell can be
used for data management (creating tables, writing values, and scanning for values)
and for user management (creating users, granting privileges to users, and logging in
as a user). Some commands are for example:
createtable book
insert 1002 bookinfo title "Databases"
scan
createuser alice
grant System.CREATE_TABLE -s -u alice
revoke System.CREATE_TABLE -s -u alice
user alice
Accumulo comes with an authentication and authorization framework for users. Ac-
cumulo requires an authentication token to be passed along with every request. In the
simplest case, this token is obtained by a PasswordToken object that represents the
password of the user; more sophisticated authentication mechanisms like Kerberos
can be included. A connection to the database is established by creating a ZooKeeper

instance (for a certain database name and a string containing one or more names or
IP addresses of underlying ZooKeeper servers) and then passing in the user name and
the password token.
Instance inst = new ZooKeeperInstance("db1", "localhost:2181");

Connector conn = inst
.getConnector("username", new PasswordToken("password"));
The connector object is the interface to operate with the database: it offers table oper-
ations (for example to create a table):
conn.tableOperations().create("book");
A Writer object in addition requires a Mutation object that stores the requested up-
dates.
BatchWriter bw = connector.createBatchWriter(
"book", new BatchWriterConfig());
Mutation mut = new Mutation("1002".getBytes());
mut.put("bookinfo".getBytes(), "title".getBytes(),
System.currentTimeMillis(), "Databases".getBytes());
mut.put("bookinfo".getBytes(), "author".getBytes(),
System.currentTimeMillis(), "Miller".getBytes());
bw.addMutation(mut);
bw.flush();
bw.close();
A Scanner object can be configured (setting the range of rows and the columns to read)
to fetch data from the database:
Scanner scanner = conn.createScanner("book", Authorizations.EMPTY);

scanner.setRange(new Range(new Text("1002")));
scanner.fetchColumn(new IteratorSetting.Column("bookinfo","title"));
for (Entry<Key,Value> entry : scanner)
System.out.println("Key: " +
entry.getKey().toString() +
" Value: " +
entry.getValue().toString());
When using the authentication framework, the authentication token can be passed to
the databases as part of any command that is issued by the user. Accumulo’s autho-
rization system is based on security labels: each cell has a visibility assigned to it; and
each user has a set of authorizations assigned. The visibility of a cell is set when it is
written or updated in the Mutator object; for example the visibilty can be set to public:
mut.put("bookinfo".getBytes(),
"title".getBytes(),
new ColumnVisibility("public"),
System.currentTimeMillis(),
"Databases".getBytes());
When a scanner is created, it can get passed a set of authorizations that the user is
requesting to see. If the requested authorizations are not a subset of the authorizations
assigned to the user, execution will stop with an exception being thrown.
Authorizations auths = new Authorizations("public");

Scanner scan = conn.createScanner("book", auths);
Insertion constraints can be registered with a table. Such a constraint can restrict the
insert commands that are allowed for the table. Insertions that are prohibited accord-
ing to the specified constraints will then be rejected. Each constraint has to implement
the interface org.apache.accumulo.core.constraints.Constraint, and then be
deployed in a JAR file stored in the lib folder of the Accumulo installation and then
registered with the table with the constraint command on the command line.
Compaction can be configured in Accumulo by setting the compaction ratio and
the maximum number of files in a tablet (which is a set of data files on disk). The
compaction ratio looks at a set of files considered for compaction: the compaction
ratio (by default 3) is multiplied with the size of the largest file in the set; if this value
is smaller than the sum of the sizes of all files in the set, then compaction takes place.
This process starts by considering all files in a tablet as the compaction set and keeps
disregarding the largest file until the condition holds.
The Google system called BigTable is described in [CDG+ 06]; it can be seen as the foun-
dation of extensible record stores. The Cassandra system started as a project at Face-
book and was described by Lakshman [LM10]. HBase is part of the Hadoop project
(which also includes a distributed file system and a map-reduce framework); an appli-
cation of Hadoop in a distributed setting at Facebook is detailed in [BGS+ 11]. A further
open source extensible record store is Accumulo [BAB+ 12].
Several experimental approaches for sorted and indexed storage structures
(including discussions of compaction strategies) can be found in [GHMT12, Spi12,
TBM+ 11, MA11, BFCF+ 07, BDF+ 10]; the log-structured merge tree in [OCGO96] can be
seen as the predecessor of these approaches. Write-ahead logging was recently an-
alyzed by Sears and Brewer [SB09]. Bloom filters are named after the author of the
seminal article [Blo70]. Lower and upper bound for the false positive rate of Bloom
filters is examined by Bose et al [BGK+ 08]. Tarkoma, Rothenberg and Lagerspetz
[TRL12] provide a comprehensive overview of advanced Bloom filters (like counting
Bloom filters and deletable Bloom filters) and describe recent applications of Bloom
filters in distributed systems. Partitioning for Bloom filters was introduced by Kirsch
and Mitzenmacher [KM08].
9 Object Databases
Together with the advent of object-oriented programming languages came the prob-
lem of how to easily store objects persistently in a database system. We briefly review
the most important aspects of the object-oriented paradigm that affect the way ob-
jects can be stored in a database. On this basis, we discuss the storage of objects in re-
lational databases, describe the steps that object-relational databases made towards
object support, and lastly cover purely object-oriented database systems.
9.1 Object Orientation
The peculiarities of object-orientation are usually not entirely covered by conventional

database systems. The following object-oriented constructs give rise to difficulties for
object storage.
Classes and Objects: A class definition specifies a set of entities, or, more pre-
cisely, it specifies a type for a set of entities: that is, a class defines common fea-
tures of these entities. The concrete, uniquely identifiable entities inside a class
are then called objects or instances of the class.
Encapsulation: A class definition contains both attributes (also called variables)
and methods. The values stored in the attributes describe the current state of a
concrete object (equivalent to the attributes in an Entity-Relationship diagram).
Methods describe the behavior that all objects
of a certain class have. In other words, an object Person
encapsulates both state and behavior. Method name: String
calls are used to let objects communicate with age: int
each other and send messages between them:
marry(Person p)
one object calls a method of another object so
divorce()
that it executes the operations defined in the
method.
Information Hiding: As we have seen in Section 1.3.2, when modeling objects
with UML, attributes and methods can have different scopes of visibility.
It is common use that attributes should not be
accessible from outside: methods should pro- Person
tect attributes from direct access. Another effect – name: String
of information hiding is that, as long as the ex- – age: int
ternal interface remains the same, internal im- # marry(Person p)
plementation of an object can change without # divorce()
a need to modify accessing objects.
194 | 9 Object Databases
Complex data types: Attributes can be either simple or complex. A simple at-
tribute has a system-defined primitive type (such as integer or string) and takes
on only a single such primitive value. A complex attribute can contain a collection
of primitive values, a reference to another (possibly user-defined) class, or even
a collection of references. An object that contains complex attributes is called a
complex object. A reference attribute corresponds to a relationship/association
between entities. The class definition of the referenced object can be defined
anonymously within the referencing class. Or alternatively, the referenced class
is an external class with its own identity; then, the reference attribute either
contains the object identifier of the referenced object or it contains the memory
address of the referenced object.
Specialization: A class can be defined to be a special case of a more general class:
The special cases are called “subclasses” and the more general cases are called
“superclasses”. Hence, defining a subclass for a class is called specialization; the
reverse – defining a superclass for a class – is called generalization. Subclasses
inherit all properties of their superclasses; however, subclasses can redefine (over-
ride) inherited methods. Subclasses can also extend the superclass definition by
their own attributes or methods. In object-oriented programs, usually objects of a
subclass can be substituted in for objects of a superclass; that is, objects of a sub-
class can be treated as objects of one of their superclasses. Due to this property
of specialization, another important object-oriented feature is dynamic binding:
while a method call is specified in a program, only at runtime the concrete sub-
class and hence the appropriate method implementation to be executed is deter-
mined.
Abstraction: Abstraction allows for separating external interfaces of an object
from the internal application details. As mentioned previously, classes imple-
menting an interface must implement all methods defined by the interface. A
comparison of generalization and abstraction is shown in the UML diagram in
Figure 9.1.
All these properties of object-oriented programs require that objects be managed in a

totally different manner than tuples in a relational database table. The term object-
relational impedance mismatch has been coined to describe the incompatibility of
object-oriented and relational paradigms.
9.1.1 Object Identifiers
The most-distinctive feature of objects in an object-oriented program is that each ob-

ject has its own object identifier (OID) which is assigned to the object at the time
of creation. The OID is a value that system-generated, unique to that object, invari-
ant during the lifetime of the object, and independent of the values of the object’s
interface
Person
Person
name: String
name: String
marry(Person p)
marry(Person p)
divorce()
divorce()
Employee
Employee
company: String
company: String
hire()
hire()
fire()
fire()
marry(Person p)
divorce()
Fig. 9.1. Generalization (left) versus abstraction (right)
attributes (that is, the object state). As mentioned previously, references between ob-
jects can be implemented by assigning the reference attribute the OID of the referenced
object. With the given notion of OIDs, a clear distinction can be made with regard to
whether two objects are identical or whether they are equal:
Identity of objects: The definition of identity is simply based on the OID: Two
objects are identical only if the have the same OID.
Equality of objects: Equality however is value-based: Two objects are equal when
the values inside their attributes coincide (that is, they have the same state inde-
pendent of their OIDs).
For complex objects, equality can be ambiguous: there is a further distinction into
shallow equality and deep equality: Shallow equality means that if the shallowly
equal objects reference other objects, it has to be checked whether all referenced ob-
jects are identical; in other words objects that are shallowly equal reference objects
with the same identifier. Deep equality is more complicated: If the deeply equal ob-
jects reference other objects, it has to be checked whether all referenced objects are
also deeply equal; that is, they have the same values in their attributes, and in the ref-
erenced objects all attributes have the same values, and all attributes in the objects
referenced by the referenced objects have the same values and so on).
We can now see the following notable difference to the relational model: Identity
of tuples in a relational table is value-based instead of ID-based and hence identity
and equality coincide. In other words, two tuples inside one table are identical when
they have exactly the same values for each of their attributes. In addition to this, there
is no means of identifying a tuple other than by the unique values of its primary key.
From a data storage point of view, permanence of OIDs is on important issue: the
scope of validity of an OID must be larger for a database system than for a common,
short-lived application. The following scopes of OIDs are possible:
Intraprocedure: the OID is valid during execution of a single procedure; the ob-
ject identified by the OID exists only inside the procedure and hence the OID can
be reused once the procedure has finished.
Intraprogram: the OID is valid during the execution of a single application; when
used in different applications, the same OID references totally different objects
inside one application and the other.
Interprogram: the OID can be shared by several applications on the same ma-
chine, but the OID of an object might change when an application is restarted or
run on a different machine.
Persistent: the OID has a long-term validity and hence is persistent; an object
always has the same ID even when accessed by different applications on different
machines or in different executions of the same application.
In sum we see that persistent OIDs are needed when storing objects in database sys-
tems so that they can be loaded and reused by different applications.
9.1.2 Normalization for Objects
We have reviewed the process of normalization for the relational data model in Sec-
tion 2.2; its purpose is to obtain a good database schema without anomalies. For object
storage models it might be similarly beneficial to distribute attributes among differ-
ent classes as it might reduce interdependencies between complex objects. Generally
speaking, it makes sense to define normalization by object identifiers.
Normalization for objects demands that every attribute in an object depends on the object’s OID.
But, taking a closer look, the case of normalization is a bit more difficult for object
models than for the conventional relational data model: because object-orientation
also has the feature of specialization, object normalization is not only affected by com-
plex objects but also by class hierarchies. Moreover, methods (and method distribu-
tion among objects) can play a role in normalization. Normalization for objects has not
been as deeply analyzed as relational normalization; indeed, various proposals of ob-
ject normalization techniques exist. In this section we informally present four object
normal forms (ONFs). We illustrate these ONFs with a library example. The unnor-
malized form in Figure 9.2 consists of a Person class and a separate Reader class.
Reader
readerID
firstname
lastname
street
housenumber
Person city
firstname zip booktitle1
lastname bookauthor1
street bookduedate1
housenumber booktitle2
city bookauthor2
zip bookduedate2
booktitle3
bookauthor3
bookduedate3
setBookTitle()
setBookAuthor()
setDueDate()
Fig. 9.2. Unnormalized objects
The first ONF (1ONF) wants to avoid repetitions of attributes in a class. To achieve
1ONF, repetitive sequences of attributes (that represent a new type) are replaced by a
1:n-association to a new class representing the new type. Methods operating on the
outsourced attributes are also moved to the new class.
First Object Normal Form (1ONF): A class is in 1ONF when repetitive sequences of attributes (represent-
ing a new type) are extracted to their own class and replaced by a 1:n-association. Methods applicable
to the new class will be part of the new class.
In our example, we see that book information is repeated in the Reader class. Hence
is makes sense to extract these repeating attributes into a separate Book class as in
Figure 9.3. All methods belonging to books are also moved to the new Book class.
The second ONF (2ONF) extracts information that is shared by objects of different
classes (or by different objects of the same class).
Second Object Normal Form (2ONF): A class is in 2ONF when it is in 1ONF and information that is shared
by multiple objects is extracted to its own class. The classes sharing the information are connected
to the new class by appropriate associations. Methods applicable to the new class will be part of the
new class.
Reader
Person Book
readerID
firstname booktitle
firstname
lastname bookauthor
lastname 1 *
street bookduedate
street
housenumber setBookTitle()
housenumber
city setBookAuthor()
city
zip setDueDate()
zip
Fig. 9.3. First object normal form
Reader
Person
readerID
firstname Book
firstname
lastname booktitle
lastname 1 *
street bookauthor
street
housenumber setBookTitle()
housenumber
city setBookAuthor()
city
zip
zip
BookLending
bookduedate
setDueDate()
Fig. 9.4. Second object normal form
In our example, the information of the due date is shared by reader and book; as we
have seen previously, it should be extracted as an association class characterizing the
association between readers and books as in Figure 9.4. The method for setting the
due date is moved to the new class.
The third ONF (3ONF) is meant for ensuring cohesion: A class should not mix
responsibilities but instead should have only a single well-defined task.
Third Object Normal Form (3ONF): A class is in 3ONF when it is in 2ONF and when it encapsulates a
single well-defined, cohesive task. Other tasks have to be extracted into separate classes (together
with the methods for the tasks) and linked to the other class(es) by associations.
In our example we can see in Figure 9.5 that the address information can be extracted
into a separate address class. This has the advantage that the Address class can be
used for storing other addresses as well (not only addresses of readers) and the inter-
nal address format can be changed without modifying the accessing classes (for ex-
Reader Book
Person booktitle
readerID 1 *
firstname bookauthor
firstname
lastname setBookTitle()
lastname
setBookAuthor()
1
1
* BookLending
*
Address bookduedate
street setDueDate()
housenumber
city
zip
Fig. 9.5. Third object normal form
ample, Reader). We can now also easily model that readers have multiple addresses
(like home address, office address etc.).
The fourth ONF (4ONF) reduces duplication of attributes and methods by building
a class hierarchy: Some classes may be subclasses of other classes effectively inher-
iting their attributes and methods without a need for duplicating them. If necessary,
appropriate superclasses have to be newly created.
Fourth Object Normal Form (4ONF): A class is in 4ONF when it is in 3ONF and when duplicated at-
tributes and methods are extracted into a superclass (or an existing class is used as the superclass,
respectively). The class is linked to the superclass by an inheritance relation.
Obviously we can make Person the superclass of Reader as in Figure 9.6 and hence
avoid the duplicate declaration of the person attributes.
To sum up, normalization can help with the construction of a well-structured soft-
ware design. A word of warning should however be given: As is the case for relational
normalization, object normalization should be based on an assessment of application
requirements. A good software design should also take data accesses and data usage
into consideration. For example, if whenever a person object is accessed also his ad-
dress information is needed, then the address information could best be embedded
into the person object instead of extracting an address into an external class. From a
database perspective, retrieving one larger object (a person with embedded address)
is usually more efficient than retrieving two smaller separate objects (a person object
and its associated address object).
Book
Person booktitle
Reader 1 *
firstname bookauthor
readerID
lastname setBookTitle()
setBookAuthor()
1
*
Address
street BookLending
housenumber bookduedate
city setDueDate()
zip
Fig. 9.6. Fourth object normal form
9.1.3 Referential Integrity for Objects
Referential integrity for objects is similar to referential integrity for the relational data
model. While referential integrity for the relational data model is based on foreign
keys (see Section 2.3), referential integrity for objects is based on object identifiers.
Referential Integrity: For each referenced OID in the system there should always be an object present
that corresponds to the OID.
This means that any referenced object must exist. In particular, dangling references
should be avoided: one should not delete a referenced object without informing the
referencing object. Hence, there is a need to maintain a backward (“inverse”) ref-
erence for each forward reference. More precisely, for complex objects, inverse at-
tributes can be used to maintain referential integrity. As an example, consider the
1:n-relationship from readers to books. To ensure referential integrity, deletion of an
object of class Reader must fail as long as it points to at least one object of class Book.
To delete an object of class Reader, first of all, the system has to modify the inverse
references to the Reader object from objects of class Book; modification can be setting
the references to another Reader object or to a null value.
9.1.4 Object-Oriented Standards and Persistence Patterns
Accompanying the evolution of object-oriented programming languages, several

groups formed to promote the idea of object-orientation. A main goal was (and still
is) to develop standards for object models and query languages to improve the in-
teroperability between different object-oriented programming languages. The first

international group was the Object Management Group (OMG). It was founded in 1989
by several hundred partners including large software companies. Although it is not
an official standardization organization, it developed a reference object management
architecture that defined an object model and the Object Request Broker (ORB) as a
communication architecture for objects. The Common Object Request Broker Archi-
tecture (CORBA) is a reference architecture that uses an ORB as well as an interface
repository or a stub/skeleton approach to enable message passing between objects
that do not reside on the same server. The OMG also adopted the Unified Modeling
Language (UML) as the standard for object-oriented software development.
Web resources:
– OMG: http://www.omg.org/
– specifications: http://www.omg.org/spec/
The Object Data Management Group (ODMG) was founded by vendors of object
database systems. The Object Data Management Group announced to develop a “4th
generation” standard for object databases by adopting the ODMG standard (version
3.0) as sketched below. It extended the OMG object model leading to the ODMG object
model. The ODMG also defined an Object Definition Language (ODL) and an Object
Query Language (OQL). The ODMG officially ceased to exist in 2001, but still some
ODMG-compliant object database systems are available. In more detail, the main
components of the ODMG specification are
– The object model (OM) describes what an an object is. It first of all distinguishes
between literals and objects. Literals are just constant values (they do not have
an identifier) and hence are immutable. There are atomic literals (like integers,
doubles, strings and enumerations), collection literals (set, bag, list, array and
dictionary), and structured literals (like date and time). In contrast, objects have
an identifier and are mutable (they can change their value); objects include user-
defined types and mutable versions of the above listed collections and structures.
The lifetime of an object can be either transient (it is just needed inside one method
call or inside one application and is destructed after termination of the method or
application), or persistent (it has to be persistently stored in a database).
– The object definition language (ODL) is used for specifying type definitions for in-
terfaces and classes. In particular, with the ODL one can define attributes (that
can only have literal values), relationships (that represent one or more referenced
objects and hence correspond to reference attributes or collections of references),
and method signatures (without the actual definition in the method bodies). ODL
definitions can be used to exchange objects between different programming lan-
guages.
– The object query language provides a SQL-like syntax to execute queries on ob-
jects in a database. It uses the well-known SELECT FROM WHERE clause, with
the difference that the SELECT clause can be used to execute methods on an ob-
ject, and path expressions can be used to follow relationships (that is, references)
inside an object. Flattening must be used to break a collection object into its con-
stituent objects to be able to process them further.
Apart from the comprehensive object technologies defined by OMG and ODMG, some
more light-weight notations and patters have become widely adopted. For example,
when it comes to include storage components in an object-oriented software design,
the design pattern of Data Access Object (DAO) is commonly used. It separates per-
sistence issues from the rest of the application, and hence is an abstraction of any
storage details (like database accesses). Roughly the DAO pattern works as follows:
for each application object that should be persisted to the database, there has to be
an accompanying DAO; the DAO executes all necessary operations to create, read, up-
date or delete the object in the database. Note however that transactions can (and
should) not be handled by each DAO individually but have to be managed globally for
the application.
Web resources:
– Oracle Core J2EE Patterns – Data Access Object:
http://www.oracle.com/technetwork/java/dataaccessobject-138824.html
– IBM developerWorks:
Sean Sullivan: Advanced DAO programming – Learn techniques for building better DAOs:
http://www.ibm.com/developerworks/java/library/j-dao/
9.2 Object-Relational Mapping
One approach to persistently store objects out of an object-oriented program is to use

a conventional RDBMS as the underlying storage engine. In this case, we have to map
each object to (one or more) tuples in (one or more) relational tables. In particular,
on the one hand, we have to write code that decomposes the object and stores the
attribute values into the appropriate tables; on the other hand, when retrieving the
object from storage we have to recombine the tuple values and reconstruct an object
out of them. We focus here on the problems that arise with complex objects (in partic-
ular, collection attributes and reference attributes) and specialization.
Table 9.1. Unnormalized representation of collection attributes
Person ID Name Hobby Child

1 Alice Swimming 3
1 Alice Hiking 3
2 Bob Football 3
2 Bob Cycling 3
1 Alice Swimming 6
1 Alice Hiking 6
2 Bob Football 6
2 Bob Cycling 6
3 Charlene Hiking 5
4 David Climbing 5
5 Emily Cycling NULL
6 Fred Swimming NULL
9.2.1 Mapping Collection Attributes to Relations
We start with the mapping of collection attributes: collections correspond to multi-

valued attributes (as introduced in Section 1.3.1). As already briefly described in Sec-
tion 2.1, multi-valued attributes are not allowed in the conventional relational model;
a fact that leads to difficulties when mapping the Entity-Relationship diagram to a
relational schema. To illustrate these difficulties even more we elaborate the exam-
ple of a Person table with an ID attribute as its key and other attributes for Name,
Hobby and Child information. Persons can have more than one hobby and more than
one child; that is, Hobby as well as Child should be multi-valued attributes. How-
ever we try to model these attributes in our relational table by simply duplicating
entries for Name, Hobby, Child in all necessary combinations. We assume the fol-
lowing domains for the (now single-valued) attributes: dom(ID)=dom(Child): Integer,
dom(Name)=dom(Hobby): String. When looking at Table 9.1, we notice a lot of redun-
dancy: Name, Hobby and Child have many duplicated entries.
Normalization (see Section 2.2) can come to the rescue: we obtain three tables (see
Table 9.2) – Name (N), Hobby (H) and Child (C) – where the ID is used as a key for the
Name table and a foreign key for the Hobby and Child table, respectively.
We see that we got rid of unnecessary redundancy: each combination of ID and hobby,
as well as ID and child, only occurs once. However, redundancy reduction comes at
the cost of more complex querying. In fact, SQL queries that have to combine names,
hobbies and children need lots of join operations – for example, the query “What are
the hobbies of Alice’s grandchildren?”:
SELECT H.Hobby FROM C C1, C C2, N, H WHERE N.Name = ’Alice’

AND N.ID = C1.ID AND C1.Child = C2.ID AND C2.Child = H.ID
Table 9.2. Normalized representation of collection attributes
H ID Hobby C ID Child
N ID Name
1 Swimming 1 3
1 Alice 1 Hiking 1 6
2 Bob 2 Football 2 3
3 Charlene 2 Cycling 2 6
4 David 3 Hiking 3 5
5 Emily 4 Climbing 4 5
6 Fred 5 Cycling 5 NULL
6 Swimming 6 NULL
This query requires joins on all three tables including a self-join on the child table and
hence are quite costly.
All in all we note that, while it is technically possible to store collections (and
hence multi-valued attributes) in a relational table, we forfeit performance of object
storage and object retrieval.
9.2.2 Mapping Reference Attributes to Relations
Another aspect of complex objects is that reference attributes have to be stored in the
database. Reference attributes represent relationships/associations between classes;
in particular, a reference attribute can point to an association class (see Section 1.3.2)
that is used to link two or more objects together possibly with additional attributes.
Reference attributes contain the OID of the referenced object. Hence in order to obtain
referential integrity for objects (see Section 9.1.3), we have to explicitly store the OID
of each object: each table representing a class has a separate column for the OID. The
OID column can then serve as a foreign key in the referencing class. Then we have to
ensure referential integrity in the relational tables as described in Section 2.3.
9.2.3 Mapping Class Hierarchies to Relations
The feature of specialization of classes implies that classes are effectively organized
in a class hierarchy. More and more attributes are implicitly added to objects of sub-
classes deeper in the hierarchy. An appropriate table structure (that is, a database
schema) has to be devised to store all the attribute values belonging to an object of
some class in the hierarchy. Here we only consider the case of one level of special-
ization; with increasing depth of specialization, however, the problem of storing at-
tributes of superclasses is aggravated. As an example for a simple class hierarchy, we
consider two subclasses (Student and Employee) of a Person class (see Figure 9.7).
Person
Student Employee
name: String
University: String company: String
age: int
StudentID: int hire()
marry(Person p)
study() fire()
divorce()
Fig. 9.7. Simple class hierarchy
To store these classes in a relational database, we have three options:
Store each class in a separate table: We store the attributes of the superclass in a
table separate from the attributes of the subclass. The superclass and the subclass in-
formation then has to be linked together by an ID. For our example this means that we
store general Person data (attributes Name and Age) in Person table; Employee data
(attribute Company) in an additional Employee table and use ID as a foreign key to
reference the Person table; as well as Student data (attributes University and Studen-
tID) in an additional Student table and use ID as a foreign key to reference the Person
table. More formally, we have the following relation schemas:
– Person({ID, Name, Age},{ID →{ID, Name, Age}})
– Employee({ID, Company},{ID →{ID, Company}})
– Student({ID, University, StudentID},{ID →{ID, University, StudentID}})
And the database schema:
D={{Person, Employee, Student},{Employee.ID⊆Person.ID,

Student.ID⊆Person.ID}}
With this way of storing a class hierarchy, we lose the distinction between what is the
subclass and what is the superclass. This semantics must then be built into the query
strings sent by the accessing applications. For example, applications have to decom-
pose objects to write the data into different tables: correct SQL insertion statements
have to be written depending on the subclass. For example, inserting an employee,
like
INSERT INTO Person VALUES (1,’Alice’,31)

INSERT INTO Employee VALUES (1,’ACME’)
is different from inserting a student, like
INSERT INTO Person VALUES (2,’Bob’,20)

INSERT INTO Student VALUES (2,’Uni’, 234797).
When retrieving objects from the database, the application programmer has to write
SQL queries to join the Person table with the right subclass table. Retrieving all names
of employees of ACME requires a join with Employee:
SELECT P.Name FROM Person P, Employee E

WHERE P.ID = E.ID AND E.Company = ’ACME’
Whereas, retrieving all names of Students of Uni requires a join with Student:
SELECT P.Name FROM Person P, Student S

WHERE P.ID = S.ID AND S.University = ’Uni’
Hence, storing each class in a separate table leads to duplication of code and handling
the different table names might cause inconsistencies. The advantage is that this stor-
age method reduces redundancy in the tables and maintains the superclass informa-
tion; for example, retrieving all names of all persons is easy:
SELECT P.Name FROM Person P
Store only the subclasses in tables: We store all the attributes of the subclass to-
gether will all inherited attributes in one table. For our example this means that we
don’t have a Person table but instead store all employee data (attributes ID, Name,
Age, Company) in the Employee table and all student data (attributes ID, Name, Age,
University and StudentID) in the Student table. That is, we have the following two
relation schemas each with key ID:
– Employee({ID, Name, Age, Company},{{ID} →{ID, Name, Age, Company}})
– Student({ID, Name, Age, University, StudentID},{{ID} →{ID, Name, Age, Univer-
sity, StudentID}})
And the database schema is simply composed of these two relation schemas:
D={{Employee, Student},{}}
Insertion and selection of values that only affect a single subclass has now become
easier, because we just have to access the right subclass. Application logic must con-
struct correct SQL statements depending on subclass. Inserting an employee is now
just a single SQL statement:
INSERT INTO Employee VALUES (1,’Alice’,31,’ACME’)
And the same applies to inserting a student:
INSERT INTO Student VALUES (2,’Bob’,20,’Uni’, 234797)

In the same vein, retrieving all names of employees of ACME does not require a join
with another table:
SELECT E.Name FROM Employee E WHERE E.Company = ’ACME’
and neither does retrieving all names of students of Uni:
SELECT S.Name FROM Student S WHERE S.University = ’Uni’.
Choosing the right table for storing an object is still the task of the accessing applica-
tion. However, we absolutely lose the information of what is defined by the superclass
(in our example, the attributes Name and Age). The semantics of the superclass must
be built into application logic: a UNION operation has to be executed on all subclasses
to produce the superclass. In our example, to get the names of all persons we have to
combine the information from the Employee and the Student table:
SELECT E.Name FROM Employee E UNION SELECT S.Name FROM Student S
Store all classes in a single table: We store all subclasses in one relation and hence
combine all attributes of all subclasses in this single relation. Doing this we cannot dif-
ferentiate between subclasses anymore: from the table structure it is not clear which
attribute belongs to which subclass. That is, again the accessing application has to
ensure correct handling of subclasses. Additional attribute “Type” that contains as
value the class name of the appropriate subclass can artificially be introduced to dis-
tinguish the subclasses. In our example, we only have a single Person table with all
attributes ID, Name, Age, Company, University and StudentID. The additional type
attribute ranges over {Employee, Student} and ID is the key attribute. The relation
schema of Person is hence:
– Person({ID, Name, Age, Company, University, StudentID, Type},{{ID} →{ID, Name,
Age, Company, University, StudentID, Type}})
Again each application must construct correct SQL statements depending on subclass.
The SQL statements differ on whether an employee is inserted:
INSERT INTO Person

VALUES (1,’Alice’,31,’ACME’, NULL, NULL, ’Employee’
or a student is inserted:
INSERT INTO Person

VALUES (2,’Bob’,20, NULL, ’Uni’, 234797,’Student’)
What we see from this small example is that lots of unnecessary attributes are stored
for the objects; that is, the table stores lots of NULL values.
Retrieving data for a subclass hence requires checking the type attribute; for ex-
ample, all names of employees:
SELECT P.Name FROM Person P WHERE P.Type = ’Employee’
or, all names of Students:
SELECT P.Name FROM Person P WHERE P.Type = ’Student’
As the superclass information is now contained in the single table, retrieving infor-
mation of the superclass is easy; for example, retrieving the names of all persons:
SELECT P.Name FROM Person P
9.2.4 Two-Level Storage
When looking at storage management for object-relational mapping, the storage can
be divided into main memory as the first level and disk storage as the second level.
While the OOPL application storage model in main memory is object-oriented (divid-
ing data into objects, variables and the like), the database storage model on disk is
relational (handling data in terms of tables, tuples etc.). Data loading and storage is
hence more complex due to increased transformation efforts. In particular, there is a
separation in the main memory between the database page buffer and the OOPL ap-
plication cache (sometimes called the local object cache). Hence the following basic
steps are required to handle the two-level storage model (compared to Section 1.2):
1. the application needs to access some object the attribute values of which are
stored in an RDBMS; the application hence produces some query to access the
appropriate database table (or tables if the object values are spread over more
than one table);
2. the DBMS locates a page containing (some of) the values of the demanded object
on disk (possibly using indexes or “scanning” the table);
4. as the page usually contains more data than currently needed, the DBMS locates
the relevant values (for example, certain attributes of a tuple) inside the page;
5. the application (possibly using a specialized database driver library) copies these
values in the application’s local object cache (potentially conversions of SQL data
into the OOPL data types are necessary);
6. the application reconstructs the object from the data values by instantiating a new
object and assigning the loaded values to its attributes;
7. the application can access and update the object’s attribute values in its local ob-
ject cache;
8. the application (again using the database driver library) transfers modified at-
tribute values from the local object cache to the appropriate page in the DBMS
page buffer (conversions of OOPL data types into SQL data types might be neces-
sary);
9. the DBMS eventually writes pages containing modified values back from the
DBMS page buffer onto disk.
What we see is that not only the mapping of objects to tables as well as reconstructing
objects by reading values from different tables involves some overhead; the storage
management itself is more complex to due transformations and conversions necessary
to handle data in the main memory.
9.3 Object Mapping APIs
As shown in the previous sections, mapping objects to relational tables requires quite
some data engineering. In particular, for the process of storing objects, the program-
mer has to manually connect to the RDBMS, store the object’s attribute values by
mapping them to tables while ensuring that referential integrity is maintained, and
additionally store class definitions (including method bodies). For the process of re-
trieving objects, the programmer has to connect to the RDBMS, potentially read in the
corresponding class definition, retrieve the object’s attribute values (potentially by
joining several tables) and also reconstruct all referenced objects. In contrast, object-
relational mapping (ORM) tools form an automatic bridge between the object-oriented
programming language (OOPL) and an RDBMS. In particular, there is no need to write
SQL code manually. This implies that the source code is better readable; and also bet-
ter portable between different RDBMSs – although an RDBMS-specific database driver
is usually still required. In ORM tools often optimizations like connection pooling and
automatic maintenance of referential integrity are already included.
ORM tools and APIs are available for a wide range of OOPLs. Standards for Java
object persistence are the Java Persistence API (JPA; see Section 9.3.1) as well as the
Java Data Objects (JDO) API which is covered in Section 9.3.2.
9.3.1 Java Persistence API (JPA)
We will have a closer look at the Java Persistence API; this API is defined in the
javax.persistence package of the Java language. The Java Persistence Query Lan-
guage (JPQL) is used to manage objects in storage. Additional metadata (for example
used to express relationships between classes) are used to map objects into database
tables; metadata can be expressed as annotations (markers starting with @ in the

source code) or configuration files (in XML).
Web resources:
– Oracle Java Platform, Enterprise Edition: http://docs.oracle.com/javaee/
– The Java EE Tutorial Part VIII Java Persistence API:
http://docs.oracle.com/javaee/7/tutorial/partpersist.htm
The main components of JPA are:
Entity Objects: Entity objects are objects that are to be stored in the database. The
class of an entity object has to implement the Serializable interface and has to be
marked with the annotation @Entity. This annotation causes the generation of a
database table – by default the table name is the class name; the default table name
can be changed with the @Table annotation. Column names of the table will be the
attributes of the class; a column name can be changed by explicitly specifying the an-
notation @Column. If some of the attributes should not be stored, they can be marked
with @Transient. An example for a Person class with three persistent attributes and
one transient attribute is as follows:
@Entity @Table(name="PERSON")
public class Person implements Serializable {
@Column(name="FIRST_NAME" )
String firstname;
@Column(name="LAST_NAME" )
String lastname;
@Column(name="AGE" )
int age;
@Transient private boolean hungry;

}
Note that it is more common to define columns by calling the “getter” methods for an
attribute. For example, the firstname column might be defined by
@Column(name="FIRST_NAME")
public String getFirstName() {
return firstname;
}
Entity Lifecycle: A persistence context is a set of entity objects at runtime; all ob-
jects in a persistence context are mapped to same database (“persistence unit”). The
EntityManager API manages the persistence context; it also supports transactions that
must be committed (see the “all or nothing” principle in Section 2.5). An entity object
can have different states during its lifecycle:
– new: A new entity object is created but not managed (in a persistence context) nor
persisted (stored on disk)
– persist: An entity object is managed and will be persisted to the database on trans-
action commit
– remove: An entity object is removed from a persistence context and will be deleted
from the database on transaction commit.
– refresh: The state of entity object is (re-)loaded from database
– detach: When a transaction ends, the persistence context ceases to exist; that is,
the connection of the entity object to the database is lost and loading of any ref-
erenced objects (“lazy loading”) is impossible
– merge: The state of a detached entity object is merged back into a managed entity
object.
Identity Attributes: In each class definition, a persistent identity attribute is required:

it is mapped to primary key of the database table corresponding to the class. There are
three ways to specify the identity attribute; it can be either
– a single attribute of simple type which is annotated with @Id, or
– a system-generated value by annotating the ID attribute with @GeneratedValue,
or
– spanning multiple attributes by adding a user-defined class for the ID which can
be an internal embedded class (annotated with @EmbeddedId), or an external in-
dependent class (by annotating the entity with @IdClass).
Note that the identity attribute must be defined in the top-most superclass of a class
hierarchy. As an example for an identity attribute that is system-generated and stored
in an identity column (when the row belonging to a Person object is inserted in the
table), consider the following Person class:
@Entity
public class Person implements Serializable {
@Id @GeneratedValue(strategy=GenerationType.IDENTITY)
protected Long id;
...
}
Embedding: Embedding allows for the attributes of an embedded class to be stored

in the same table as the embedding class. The reference attribute in the embedding
class is annotated with @Embedded; while the embedded class definition is annotated
with @Embeddable. For example, we could define an Address class that can be stored
(in embedded form) in the same table as the remaining person attributes.
@Entity public class Person implements Serializable {

@Id protected Long id;
...
@Embedded protected Address address;
}
@Embeddable public class Address {...}
Inheritance: JPA offers all the three kinds of mapping a class hierarchy to relational
tables (as described in Section 9.2.3). With the @Inheritance annotation we can de-
fine which of the strategies is used.
– The first case (“Store each class in a separate table”) is declared with the anno-
tation @Inheritance(strategy=InheritanceType.JOINED): superclasses and
subclasses each are mapped to their own table and each table contains only those
columns defined by the attributes in the class; as mentioned previously, the ID at-
tribute has to be defined in the top-most superclass (and hence will be a column
in the corresponding table) and in all subclass tables the ID is used as a foreign
key to link rows belonging to the same object in the separate tables.
– The second case (“Store only the subclasses in tables”) is declared with the
@Inheritance(strategy=InheritanceType.TABLE_PER_CLASS) annotation:
In a table for a subclass, columns will be created for all attributes inherited from
any superclass; hence, we have duplication of all superclass attributes on sub-
class tables. Note however that concrete superclasses (those that are not abstract
classes) will get their own table, too: any concrete superclass can be instantiated
and its objects will be stored in the appropriate superclass table. If this is not
desired, we can annotate a superclass with @MappedSuperclass (instead of an-
notating it with @Entity): in this case no table for the superclass will be created
and no object of the superclass can be stored (only objects of its subclasses).
– The third case (“Store all classes in a single table”) is declared with the anno-
tation @Inheritance(strategy=InheritanceType.SINGLE_TABLE): All objects
of all superclasses and subclasses are stored together in one table that contains
columns for all attributes. As described in Section 9.2.3, we need an additional
type column to differentiate between objects of different class; this type column
is called discriminator column in JPA. We can give this discriminator column a
name with the @DiscriminatorColumn annotation: for example, for our person
hierarchy we can annotate the Person class with @DiscriminatorColumn(name =
"PERSON_TYPE"); each subclass will be annotated with a discriminator value: for
example, the Employee class can be annotated with its own discriminator value
like @DiscriminatorValue("employee") such that any employee in the table

will have the value employee in the PERSON_TYPE column.
Relationships: JPA allows for cardinalities for relationships (as the ones described
in Section 1.3.1). Hence, an attribute that references an object of another class can
be annotated with either @OneToOne, @OneToMany (or its reverse @ManyToOne), or
@ManyToMany. A one-to-many relationship corresponds to a collection attribute (Set,
List, Map,...). A relationship can be unidirectional – which means we only have a
forward reference – or bidirectional – in this case we also have an inverse reference
that helps ensure referential integrity as described in Section 9.1.3. For bidirectional
relationships the forward reference is on the so-called owning side; the backward
(inverse) reference is on the owned side. The owning side is responsible for man-
aging the relationship; for example, maintaining the correct foreign key values that
participate in the relationship. In order to do so, the owned side has to annotate the
backward reference with the information which attribute on the owning side consti-
tutes the forward reference; this is done with the mappedBy statement. In our previous
library example we had the case of a one-to-many (1:n) relationship between a reader
and his loaned books. When implementing this as a bidirectional relationship with the
Book class as the owning side, and the Reader as the owned side, then the mappedBy
statement declares that the attribute borrower in the Book class is responsible for the
management of the relationship:
@Entity public class Book implements Serializable {

@Id protected Long BookId;
@ManyToOne protected Reader borrower;
}
@Entity public class Reader implements Serializable {
@Id protected Long ReaderId;
@OneToMany (mappedBy="borrower")
protected Set<Book> booksBorrowed = new HashSet();
}
Several other options are available that can be used to configure storage of relation-
ships in JPA. For example, relationships may be loaded as either eager or lazy. Lazy
loading means that loading of a referenced object is deferred until the object is actu-
ally accessed for the first time. Eager loading means that a reference object is loaded
when the referencing object is loaded. Another issue is cascading of operations to ref-
erenced objects; for example, we might configure that whenever an object is stored
(“persisted”), referenced objects are also persisted. These settings can be specified in-
dividually for each relationship; they could however also be configured globally in
the XML mapping file (for example, persistence by reachability means that all ref-
erenced objects are always persisted when the referencing object is persisted).
Java Persistence Query Language: The Java Persistence Query Language (JPQL) is a
SQL-like language that offers projection onto some attributes (in its SELECT clause),
explicit JOINS, subqueries, grouping (GROUP BY), as well as UPDATE and DELETE op-
erations.
For example, from our Person table with embedded Address information we can re-
trieve Alice Smith’s hometown and its ZIP code as follows:
SELECT p.address.city, p.address.zipcode FROM Person p

WHERE p.firstname=’Alice’ AND p.lastname=’Smith’
Query Objects: Query objects can be created by callen the createQuery() method
of the EntityManager API. Queries are processed by executing getResultList().
Named queries can be stored for reuse with different parameters; this can be done
by using the @NamedQuery annotation. Dynamic queries are specified at runtime;
their number of parameters can change and they can have named or positional pa-
rameters. For example, to get a list of persons for a given ZIP code we can define a
query method findByZipcode where the zip code can be input as a parameter:
public List findByZipcode(int zip) {

@PersistenceContext EntityManager em;
Query query = em.createQuery("SELECT p FROM Person p" +
" WHERE p.address.zipcode = :zipparameter");
query.setParameter("zipparameter", zip);
return query.getResultList();
}
When using a JPA-compliant ORM tool, several system-specific settings have to be

made; these are usually declared in an XML configuration file. For example, for the
Hibernate ORM tool, a configuration file can contain properties (like the location of
driver for the underlying RDBMS, the URL for the database connection, and a user-
name and password for the database connection), as well as mappings (like names of
classes we want to store in the database):
<hibernate-configuration>
<session-factory>
<property name="hibernate.connection.driver_class">
org.postgresql.Driver</property>
<property name="dialect">
org.hibernate.dialect.PostgreSQLDialect</property>
<mapping class="org.dbtech.Person"/>
</session-factory>
</hibernate-configuration>
9.3.2 Apache Java Data Objects (JDO)
The Java Data Objects API is meant to be highly independent of document formats or
data models of databases or any database-specific query languages. The main pur-
pose of JDO is to let Java programmers interact with any underlying database (or data
format) without using database-specific code.
Web resources:
– Apache JDO: http://db.apache.org/jdo
– specifications: http://db.apache.org/jdo/specifications.html
– Apache SVN repository: http://svn.apache.org/viewvc/db/jdo/
In JDO there a three types of classes:

– persistence capable classes that are mapped to the storage layer – they are anno-
tated as @PersistenceCapable;
– persistence aware classes that interact and modify persistence capable classes –
they are annotated as @PersistenceAware;
– normal classes that are totally unaware of any storage related issues and are not
stored themselves.
Persistence capable and persistence aware classes must be declared by either using an
XML medatadata file or by using annotations (for example, @PersistenceCapable).
Persistence related operations are offered by the interface PersistenceManager.
Field-level persistence modifiers can be persistent, transactional or none (in which
case defaults or persistence by reachability are applied). In general, an object (if it
is declared transient) that is referenced by a persistent object as a persistent field
will become persistent if the referencing object is stored and with the closure of its
references in the object graph. For fields of objects, there is a so-called default fetch
group: depending on the type of the field, some fields are loaded by default whenever
an object of a class is loaded (for example simple types like numbers); however arrays
and other collections types by default are not loaded.
JDO supports two types of object identifiers: application identity is based on the
values of some fields in the object (these fields are then also called primary key); datas-
tore identity however considers an internal identifier that the programmer can neither
declare nor influence.
An object can hence be in one of several different states during its entity lifecycle:
Transient: An object that is newly created and is not yet or will never be persisted.
Persistent New: A newly created object that has been stored to the data store for
the first time.
Persistent Dirty: A persistent object that has been modified after it has been last
stored. Its state in memory is different from its state in the data store.
Persistent Clean: A persistent object that has not been modified after it has been
last stored and hence represents the same state in memory and in the data store.
Persistent Deleted: Any persistent object that is to be removed from the data
store.
Hollow: An object that is stored in the data store but not all its fields have been
loaded into memory.
Detached Clean: An in-memory object that is disconnected from its datastore rep-
resentation and has not been changed since it was detached.
Detached Dirty: An in-memory object that is disconnected from its datastore rep-
resentation but has been changed since it was detached.
Furthermore, modifications of an object can be part of a transaction or not. Depending

on this, even more states of an object are possible by differentiating whether an object
is transactional or non-transactional. The persistence manager offers several meth-
ods to manage the different states of an object. For example, to store a newly created
object, makePersistent is called as follows:
PersistenceManagerFactory pmf = JDOHelper

.getPersistenceManagerFactory(properties);
PersistenceManager pm = pmf.getPersistenceManager();
Transaction tx = pm.currentTransaction();
try {
tx.begin();
Person p = new Person("Alice","Smith",31);
pm.makePersistent(p);
tx.commit();
}
finally {
if (tx.isActive()){
tx.rollback();
}
}
An object can be retrieved from storage by using its identifier:
Object obj = pm.getObjectById(identity);
Moreover, the getExtent method returns a collection of all persisted objects of a given
class, that can be iterated.
Extent e = pm.getExtent(Person.class, true);

Iterator iter=e.iterator();
As a query language the Java Data Objects Query Language (JDOQL) can be used. A
query object is created with the help of the persistence manager; a selection condition
can be passed as a parameter of a filter; and then the query can be executed.
Query q = pm.newQuery(Person.class);
q.setFilter("lastName == \"Smith\"");
List results = (List)q.execute();
9.4 Object-Relational Databases
Given the wide-spread use of relational database management systems on the one
hand and the predominance of object-oriented programming languages on the other
hand, most RDBMSs now include some object-oriented functionality. At the heart of
this, the SQL standard also includes some object-oriented extensions; the basic unit
of storage however is still a tuple inside a relation. RDBMSs supporting (part of) the
object-oriented extensions of the SQL standard have been termed object-relational
database management systems (ORDBMSs). Note however that (equivalent to the
purely relational SQL standard) no ORDBMS implements this standard to its full
extent. We now briefly discuss some of these SQL extensions. The presented SQL no-
tation is just meant as an illustrative example: the specific object-relational syntax
differs very much between different object-relational database systems.
SQL Objects and Object Identifiers (OIDs): In the purely relational model, a relation
is a set of tuples (that is, rows). In the object-oriented extension, a relation can also
be a set of SQL objects. A SQL object is a tuple with an additional object identifier;
a SQL object is constructed by inserting values into a typed table (cf. the description
of user-defined types below). A tuple ranges as usual over a set of attributes (that is,
columns). Attribute domains can still be the usual SQL primitive (single-valued) types
such as CHAR(18), INTEGER, DECIMAL, BOOLEAN; in addition, more complex types
are allowed as detailed below.
Tuple Values: With tuples values, SQL can support composite attributes. Inside a
tuple, an attribute can have another tuple as its value. A tuple groups together several
attributes of (possibly) different domains. An example for a tuple value is an address
tuple inside a person tuple: While each person has first name, last name and address
as its attributes; the address itself consists of street, house number, city and ZIP code.
In SQL, a subtuple can be constructed with the ROW constructor:
CREATE TABLE Person (

Firstname CHAR(20),
Lastname CHAR(20),
Address ROW(
Street CHAR(20),
Number INTEGER,
ZIP CHAR(5),
City CHAR(20)
)
)
In queries, components of a tuple value can be accessed using path expressions with
the dot operator multiple times. For example, the ZIP code in the address can be ac-
cessed via the address of a person:
\texttt{SELECT P.Name FROM Person P WHERE P.Address.ZIP=31141}.
Insertion of new values can also be done using the ROW constructor:
INSERT INTO Person(Firstname,Lastname,Address)

VALUES (’Alice’,’Smith’,ROW(’Main Street’,1,31134,’Newtown’) )
Updating values can either be done using path expressions:
UPDATE Person SET Address.ZIP = 31134 WHERE Address.ZIP = 31141
or using the ROW constructor:
UPDATE Person SET Address=ROW(’Long Street’,5,31134,’Newtown’)

WHERE Address=ROW(’Main Street’,1,31134,’Newtown’)
AND Lastname=’Smith’
Collection Attributes: Collection attributes group values of the same type. In Sec-
tion 9.2.1 we saw that one complication that the object-relational mappings involve are
collection attributes (hence, multi-valued attributes in the conceptual model). Such
collections are now supported by the object-oriented extensions of SQL; in particu-
lar, multisets (called “bag” in the ODMG standard; see Section 9.1.4), and arrays. As
multisets are a generalization of sets, we can simulate the conventional sets by check-
ing for and disallowing duplication in multisets. With collection types we can indeed
avoid redundancy and normalization efforts for multi-valued attributes. For example,
the Person table in Section 9.2.1 suffers from duplications of values due to multiple
hobbies and multiple children. Turning the Hobby and Child attributes into collection
attributes avoids this duplication. More precisely, we let Person be a relation schema
with attributes ID, Name, Hobby, Child; and the attribute domains dom(ID): Integer,
dom(Name): String, dom(Child): set of Integers, dom(Hobby): set of Strings. The re-
sulting table (equivalent to the one from Section 9.2.1) is shown in Table 9.3.
Table 9.3. Collection attributes as sets
Person ID Name Hobby Child

1 Alice {Swimming, Hiking} {3,6}
2 Bob {Football,Cycling} {3,6}
3 Charlene {Hiking} {5}
4 David {Climbing} {5}
5 Emily {Cycling} {}
6 Fred {Swimming} {}
There are two operations that can be executed on tables with (multi-)set-valued
columns: nesting and unnesting. Nesting is the process of grouping values of a col-
umn from several rows into a single set in a single row. For example, converting the
Person table in Section 9.2.1 into Table 9.3 is done by nesting the Hobby attribute and
nesting the Child attribute. The reverse operation of unnesting write values from a
set-valued attribute into separate rows.
Let us have a closer look at another collection: the array. As it is a collection, an
array combines several values of the same type. However, an array has a fixed length
(that is, a maximum amount of values that can be stored in the array), and its values
can be accessed by an index number. The ARRAY constructor can be used to build
an array of elements by writing a pair of square brackets [ and ] and separating the
individual elements by commas. For example, an array of four String elements can be
constructed as ARRAY[’Alice’,’Bob’,’Charlene’,’David’]. Moreover, the ARRAY
constructor can build an array from a single column of a table. For example, an array
containing all names currently stored in the Person table: ARRAY(SELECT Name FROM
Person). When declaring an attribute in a table, we can set the length of the array
by writing the length value into square brackets. Let us assume that each person can
have up to three telephone numbers that are stored in an array:
CREATE TABLE Person(

Telephone VARCHAR(13) [3],
Name CHAR(20)
)
Insertion of values can be done using the ARRAY constructor:
INSERT INTO Person VALUES

(ARRAY[’935279’,’908077’,’278784’], Alice’)
The selection of one of the telephone numbers can than be done by specifying the
index of the telephone number; for example:
SELECT Telephone[2] FROM Person WHERE Name=’Alice’
User-Defined Types (UDTs): Close to the object-oriented concept of a class, SQL offers
user-defined types to structure data into reusable definitions. For example, we can
create a type for persons as follows:
CREATE TYPE PersonType AS (

Firstname CHAR(20),
Lastname CHAR(20),
Address ROW(Street CHAR(20),
Number INTEGER, ZIP CHAR(5), City CHAR(20))
)
SQL also supports inheritance for UDTs. For example, a new type StudentType inherits
first and last name and address information from PersonType if derived from Person-
Type with the key word UNDER:
CREATE TYPE StudentType UNDER PersonType AS (

University CHAR(20),
StudentID INTEGER
)
As another object-oriented concept, SQL support methods for UDTs. The method sig-
nature (for example, the return value type) is declared independently from the actual
implementation. For example, we may declare the signature of a method called study:
METHOD study() RETURNS BOOLEAN
Later on we define the method explicitly to apply to the StudentType with a link to an
external implementation (for example in the language C):
CREATE METHOD study() FOR StudentType

LANGUAGE C
EXTERNAL NAME ’file:/home/admin/study’
Method definitions can also directly contain the method implementation written in
SQL. UDTs have two main application contexts in a SQL database:
1. A UDT can be used as a type for an attribute inside a table.
2. A UDT can be used as a type for a table.
As an example for the first application context, we create a new type for names and
use this NameType for the name attribute in the person table:
CREATE TYPE NameType AS (

Firstname CHAR(20),
Lastname CHAR(20)
)
CREATE TABLE Person (
Name NameType,
Age INTEGER
)
We can then use path expressions to query for typed attributes. For example, querying
the Person table for all firstnames of people with lastname Smith by descending into
the NameType object:
SELECT (P.Name).Firstname FROM Person P

WHERE (P.Name).Lastname=’Smith’
As an example for the second application context, with the key word OF we can create
a table of type StudentType. That is, the table has the attributes defined by the type
(without the need to declare them again):
CREATE TABLE Student OF StudentType
Tables of a UDT are called typed tables; only rows (tuples) inside a typed table are SQL
objects and hence have an OID. Only tuples of typed tables can hence be referenced by
a reference attribute (see description of reference types below). Tuples in the following
untyped table Student1 are not SQL objects and hence do not have an OID:
CREATE TABLE Student1 (

Firstname CHAR(20),
Lastname CHAR(20),
Address ROW(Street CHAR(20),
Number INTEGER, ZIP CHAR(5), City CHAR(20))
University CHAR(20),
StudentID INTEGER
)
In a similar vein, values for typed attributes (for example, the Name attribute of the
Person table) are not SQL objects.
References: Another feature of SQL objects is that they can be referenced by their OID
from other tables. Here it is important to note that every typed table has a so-called
self-referencing column that stores the OIDs of the tuples. When creating a table,
we can give the self-referencing column an explicit name (with the expression REF
IS) and also declare the OID values to be system-generated when creating a tuple. For
example, in the typed student table, we can call the self-referencing column studoid.
CREATE TABLE Student OF StudentType

REF IS studoid SYSTEM GENERATED
Otherwise the self-referencing column exists but is unnamed. When declaring a refer-
ence attribute in another table, we can reference a typed table by declaring the refer-
enced type and – as the scope – the table from which referenced tuples can be chosen.
CREATE TABLE StudentRecord (

Course CHAR(20),
Mark DECIMAL(1,1),
Testee REF(StudentType) SCOPE Student
)
The reference attribute hence stores the OID of a SQL object in a typed table; it is used
as a direct reference to a specific tuple in the typed table. To access attributes inside
the referenced tuple, a special dereference operator -> has to be used. For example, to
access the student ID of some student taking a database course referenced in a student
record:
SELECT R.Testee->StudentID FROM StudentRecord R

WHERE R.Course=’Databases’
There are also other ways of dereferencing reference attributes specific to each object-
relational DBMS.
In summary, object-relational database systems support major object-oriented
features on top of conventional relational technology. The data model on disk is
still the relational table format and hence different from the object models used in
object-oriented programming languages. However, not all ORDMBSs support all the
object-oriented SQL extension and they each use a different syntax. Hence, code
portability is is only limited when using object-oriented extensions of SQL.
9.5 Object Databases
Pure object database managements systems (ODBMSs) use the same data model (that
is, a particular “object model”) as object-oriented programming languages (OOPL).
There is no need to map the objects into a different format – be it relational or some-
thing different; there might however be the need to map objects written in a particu-
lar OOPL into the object model used by the ODBMS. Nevertheless, there is no object-
relational impedance mismatch (see Section 9.1): due to a uniform data and storage
model an object database combines features of OOPLs and DBMSs.
An object database has to meet both object-oriented requirements (like complex objects, object iden-
tity, encapsulation, classes and UDTs, inheritance and polymorphism) and database requirements
(like persistence, disk-storage organization, concurrent transactions, recovery mechanisms, query
languages and database operations).
In this section we survey some strategies relevant for object storage.
9.5.1 Object Persistence
Different options to persistently storing objects could be used; some of them depend
on the object-oriented programming language (OOPL) used or even the system an
object-oriented application is run on. For one, a snapshot (or checkpoint) of the part
of the main memory that is occupied by the application can be stored to disk. The main
memory organization is however highly system-dependent and a checkpointed appli-
cation could then only be restarted by the system that created the snapshot. Hence,
checkpointing an application does not offer data independence: how the stored ob-
ject is accessed highly depends on the physical internal storage format; moreover,
from a logical point of view, a stored application can only be accessed in its entirety
and not be filtered to access only the relevant subset of objects inside the application.
Neither schema evolution nor versioning can be easily achieved with checkpointing,
and we cannot differentiate between transient attributes (which should not be stored)
and persistent attributes (which should be stored).
Serialization is the process of converting objects into a reusable and transportable
format. When serializing complex objects, the closure of the complex object is serial-
ized: a serialized deep copy of the object is created with all references followed and
serialized; for large objects with lots of deep references, (de-)serialization is a time-
consuming process. Object identity however is usually lost when objects are serial-
ized. This means when two objects are serialized that reference the same object, the
referenced object will appear in the two different deep copies; upon deserialization,
the referenced object will be instantiated twice with different identities. Ensuring ob-
ject identity with serialization can only be achieved by influencing the serialization
and deserialization process and hence requires extra effort by the application pro-
grammer. Class information is usually not part of the serialization – only the object
state is serialized; for deserialization the class definition must be made available to
the deserialization process. Moreover, serialization exposes private attributes of an
object because they are fully accessible in the serialized object. Serialization has no
support for schema evolution as the process of deserialization will fail when class def-
inition used for deserialization differs from the one used for serialization. As serializa-
tion usually is OOPL-specific, using a serialized object with a different OOPL requires
transforming the serialized object into the required format by hand.
In contrast to the above, pure object databases store objects together with their
class definition (and hence together with their methods). Object identity and rela-
tionships between objects are preserved and, when objects are retrieved from the
database, the correct object graph is reconstructed.
There are roughly three different options to indicate that a particular object should
be persisted in the object database – some ODBMSs allow a combination of these
strategies:
Persistence by reachability: Some objects are denoted as root objects. All ob-
jects that are referenced by a root object (directly and indirectly) will be automat-
ically stored whenever the root object is stored. Restricting the level of references
followed up to a certain depth in the object graph is possible in some ODMBSs for
an improved storage performance.
Persistence by type: All objects of a user-defined class will automatically be per-
sisted. The class can be denoted as a persistent class by either inheriting from a
system-defined class (for example, a marker interface) or by using annotations (as
in JPA; see Section 9.3).
Persistence by instantiation: When an object is created it can be marked as per-
sistent.
ODBMSs handle objects very efficiently: for example, a query can navigate in the ob-
ject graph (that is, follow references between objects) until the relevant information is
reached and without actually loading unnecessary objects along the navigation path;
moreover, updated objects can be stored back to disk without actually persisting un-
modified objects in the object graphs – some ODBMSs allow the application program-
mer to mark updated objects as “dirty” to notify the ODMS that this particular object
must be written to disk.
9.5.2 Single-Level Storage
While the object-relational mapping follows a two-level storage approach described in

Section 9.2.4 (the OOPL object model in main memory, and relational table model on
disk), an ODBMS follows a single-level storage approach, with a similar representation
of objects in main memory as well as on disk. In contrast to the conventional file man-
ager and buffer manager presented in Section 1.2, an ODBMS has a component called
object manager that loads or stores objects: given some OID, the object manager re-
turns the address (a pointer) to the object in main memory (possibly after loading it
into main memory from the disk); or the object manager persists an object from the
main memory to disk.
Objects and references between objects can be seen as a directed graph: Objects are
vertices (nodes) and references are directed edges (from the referencing to the refer-
enced object). This graph of the objects in an application and references between them
is often called the virtual heap of the application. During execution of the applica-
tion the heap is traversed along the references by going from one object to the next
via method calls. Not the entire heap of an application is loaded into main memory:
hence, inside the virtual heap we make a distinction between resident objects (the
one loaded into main memory) and non-resident objects (the ones not in the main
memory but only stored on disk). When traversing the virtual heap at runtime, the ap-
plication may use a reference attribute inside a resident object to access the contents
or execute a method of a non-resident object. This leads to an object fault (a notifica-
tion that the referenced object is not yet in the main memory) which causes the object
manager to make the referenced object resident; that is, the object manager loads the
required object into main memory from disk, and returns a pointer to the referencing
object to the newly resident object. More precisely, inside the main memory we can
usually again distinguish between the ODMBS page buffer (which is directly managed
by the object manager) and an application’s local object cache; hence, the ODMBS will
first load the object into the DBMS page buffer and then copy the object to the local ob-
ject cache of the application for further processing. Because in the single-level storage
the ODBMS can handle objects directly, the steps to load an object differ from the steps
required in the two-level storage (see Section 9.2.4):
1. The application needs to access some object stored in an ODBMS; the application
hence produces some query to access the object in the database (for example, by
telling the DBMS the OID of the object, or specifying search conditions for finding
relevant objects);
2. the DBMS locates a page containing the demanded object on disk (possibly using
indexes that help find matching objects);
4. the DBMS locates the object’s representation inside the page;
5. the DBMS copies the object representation into the application’s local object cache
(potentially conversions the DBMS object representation into the OOPL-specific
representation are necessary);
6. the application can access and update the object’s attribute values in its local ob-
ject cache;
7. When the application wants to store an updated object, the DBMS transfers the
modified object representation into the DBMS page buffer (possibly converting it
from the OOPL-specific representation into the DBMS object representation);
8. the DBMS eventually writes pages containing modified objects back from the
DBMS page buffer onto disk.
9.5.3 Reference Management
As mentioned above, the objects inside a virtual heap are connected by references.
We discuss three basic options for reference management: direct referencing, indirect
referencing and OID-based referencing.
As a first option, pointer-based direct referencing uses direct addressing with
physical memory addresses. The main memory address (that is, a pointer the refer-
enced object in main memory) is stored in the reference attribute. This pointer corre-
sponds to the actual physical address which a resident object currently occupies in
main memory. When a non-resident object is loaded into main memory, it must be
loaded exactly into the physical address which is contained in the reference attribute
referencing this loaded object; this hence requires hence a sophisticated main mem-
ory management to avoid that the same pointer is used to reference different objects.
As a second option for implementing references, with indirect referencing refer-
ence attributes contain virtual memory addresses (see Section 1.2). Using virtual ad-
dresses introduces a level of indirection because it requires a mapping from virtual
to the actual physical address; it requires appropriate management of pointers in an
indirection table. The virtual (and also the physical) address space is usually a lot
smaller than the persistent OID space; this could lead to difficulties with long-term
storage of objects (because after some time, different objects may be located at the
same address).
Another option is use OID-based referencing: The OID of the referenced object is
stored in the reference attribute. Applications have to make sure that the OIDs are per-
manent and persistent (see the discussion on OID permanence in Section 9.1.1). When
following an OID-based reference, the DBMS has to map the OID to a main memory ad-
dress. This task requires to look up the OID in the Resident Object Table: When an ob-
ject is loaded into main memory, it may reside on an arbitrary physical address in the
ODMBS page buffer (or application’s local object cache). A lookup mechanism must
map the OID of a resident object to its current physical memory address. This mapping
can be stored in a lookup-table called Resident Object Table (ROT) maintained by the
ODBMS. Whenever an OID-based reference is followed, the object manager first looks
for the OID in the ROT. If the OID is present in the ROT, the object is already resident
and the object manager can return the memory address as recorded in the ROT; when
the OID cannot be found in the ROT, the object has to be loaded from disk and its OID
has to be stored in the ROT together with the current memory address of the object.
9.5.4 Pointer Swizzling
When the same OID-based reference is followed very often, the table lookup in the ROT
might be inefficient. To achieve a better performance the process of pointer swizzling
can be applied: an OID inside a resident objects is temporarily replaced by the current
C not accessed: A accesses C: B accesses C using ROT:

oid address oid address
oid address
A @839t A @839t
A A @839t A A
B @u47 B @u47
B @u47
C @0e3 C @0e3
B C B C B C
Fig. 9.8. Resident Object Table (grey: resident, white: non-resident)
memory pointer of the referenced object; when such an object is stored back to disk,
the swizzled pointers have to be replaced by the original OIDs (“unswizzled”). Hence,
for swizzled OIDs, the lookup in the ROT can be avoided but instead the referenced
object can be accessed directly. Two forms of pointer swizzling have been analyzed:
Edge Marking: Each reference attribute (representing an edge in the object graph)
is accompanied with a tag bit. If the tag bit is set to 1, the object is already resi-
dent (that is, loaded into main memory) and the value of the reference attribute
is the appropriate main memory pointer. If otherwise the tag bit is set to 0, the
object is non-resident (has not been loaded into main memory) and the value of
the reference attribute is an OID; in this case an object fault is generated, the ob-
ject is loaded, the referencing OID is changed to the appropriate memory pointer,
and finally the tag bit is set to 1. In other words, only references that are followed
(“dereferenced”) are swizzled. It might hence happen that a resident object refer-
ences the same object that already has been loaded by dereferencing an attribute
in another attribute. In this case, the former object is unaware that the referenced
object has been dereferenced by another object. To illustrate this (see Figure 9.9),
assume that two resident objects A and B both reference a non-resident object C;
at first, both A and B have the tag bit on their C-reference set to 0. When object A
accesses object C, object C becomes resident, the pointer from A to C is swizzled
(set to C’s current main memory address), and the tag bit on this pointer is set to 1.
The tag bit of reference from B to C is still 0: object B still assumes that object C is
non-resident as long as the reference inside object B has not been dereferenced.
Hence, object B will only notice that C is resident by looking up C in the ROT, after
which B swizzles the C-pointer and sets the tag bit to 1. When an object is stored
back to disk, first of all the pointers inside the object have to be unswizzled (con-
verted back to OIDs). Secondly, references to this object have to be unswizzled,
too, as otherwise these references would be dangling pointers. Inverse references
have to be maintained to unswizzle these references, or a counter for these refer-
ences has to be maintained and removing the object out of main memory can only
be done when this counter is 0.
C not accessed: A accesses C: B accesses C using ROT:
A A A
0 1 1
B C B C B C
0 0 1
Fig. 9.9. Edge Marking (grey: resident, white: non-resident)
A loaded: B loaded: C loaded:
A A A
B C B C B C
Fig. 9.10. Node Marking (grey: resident, white: non-resident)
Node Marking: For node marking, artificial fault blocks are created in the main
memory that serve as placeholders for non-resident objects (see Figure 9.10).
When an object is loaded, all references are immediately replaced by memory
pointers: References to resident objects are replaced by a direct memory pointer
to the resident object (using the ROT); references to non-resident objects are re-
placed by memory pointer to a fault block. The fault block hence stands initially
for a non-resident object. If another resident object references the same non-
resident object, this reference is swizzled into a memory pointer to the same fault
block; that is, a fault block may be referenced by many resident objects. When
the non-resident object is loaded into main memory, the fault block caches its
memory pointer: later traversals towards the object retrieve its memory pointer
from the fault block and the ROT need not be used. Unnecessary fault blocks may
be removed by a garbage collector, so that pointers to the fault block are replaced
by direct links to the now resident object. When an object is stored back to disk,
however, these direct pointers must then again be replaced by pointers to a fault
block to avoid dangling references. While node marking reduces the need to ac-
cess the ROT, it has the disadvantage of increased storage capacity and time for
creation of fault blocks. The indirection introduced by fault block also takes extra
time every time a resident object is accessed by retrieving the object’s address
from the fault block.
We briefly survey some implementations for object persistence: DataNucleus as one

reference implementation for JDO and JPA; as well as ZooDB as an academic project
developing an open source object database. In addition, OrientDB is described as a
multi-model database including an object-oriented storage engine in Section 15.4.3.
9.6.1 DataNucleus
An implementation of JDO as well as JPA is DataNucleus which supports a variety of

query languages as well as a variety of databases – not only RDBMSs but also graph
databases or JSON documents.
Web resources:
– DataNucleus: http://www.datanucleus.org/
– documentation page: http://www.datanucleus.org/documentation/
– GitHub repository: https://github.com/datanucleus
Plugins of DataNucleus for widely used data stores are the following:
HBase plugin: In general, each field of an object is mapped to a column in an HBase
table. A name for the column family and a name for the qualifier (column name) can
be set by an appropriate annotation:
@Column(name="{familyName}:{qualifierName}")
@Extension annotations can be used to modify some settings specific to HBase like
BloomFilter configuration, maximum number of stored versions, and compression.
For example, the bloom filter can be configured to consider the row key as follows:
@Extension
(key = "hbase.columnFamily.meta.bloomFilter", value = "ROWKEY")
MongoDB plugin: An object is mapped to a document, and a field of the object is

mapped to a field in the document. References between objects are implemented by
storing the IDs of referenced objects in the referencing (owning) object; such references
can also be bidirectional. In some cases instead of ID-based referencing, embedding
of a referenced object (annotated with @Embedded) can be a viable alternative; em-
bedding can either be nested or flat. In a nested embedding, the referenced object is
stored as a JSON object (sub-document) in a field of the owning object – and it can
itself be nested. A flat embedding maps each field of the referenced object to a field in
the owning object.
Neo4J plugin: Each object will be mapped to a node in the Neo4J graph; references be-
tween objects are mapped to edges in the Neo4J graph between the appropriate nodes.
Several other datastores and output formats (including REST commands) are sup-
ported and new plugins can be added by extending the AbstractStoreManager class.
9.6.2 ZooDB
ZooDB is a Java-based object database coming from an academic background which

is currently under development.
Web resources:
– ZooDB: http://www.zoodb.org/
– GitHub repository: https://github.com/tzaeschke/zoodb
It supports parts of JDO and relies on a persistence manager and the ZooJdoHelper
class to interact with the database. A persistence manager can be obtained for each
database:
PersistenceManager pm = ZooJdoHelper.openOrCreateDB("dbfile");
Each object to be stored in ZooDB has to extend the ZooPC class. The ZooPC class
manages the persistence states of objects during their lifetime (as described in Sec-
tion 9.3.2); the ZooPC class executions the transitions between the different states –
for example, from detached clean to detached dirty. Before writing or reading fields
of an object they have to be activated for the write or read operation to enable this
state management; it is hence decisive to call the appropriate activate method of the
ZooPC class before the actual write or read operation (for example, in the getter and
setter methods). Each object needs an empty constructor (to be able to use the Java re-
flection mechanism) but can also have other constructors with parameters. A simple
Person class hence looks like this:
public class Person extends ZooPC {
private String firstname;

private String lastname;
private int age;
private Person() { } // empty constructor necessary

public Person(String firstname, String lastname, int age) {

this.firstname = firstname;
this.lastname = lastname;
this.age = age;
}
public void setFirstname(String name) {

zooActivateWrite();
this.firstname = name;
}
public String getFirstname() {

zooActivateRead();
return this.firstname;
public void setLastname(String name) {

zooActivateWrite();
this.lastname = name;
}
public String getLastname() {

zooActivateRead();
return this.lastname;
public void setAge(int age) {

zooActivateWrite();
this.age = age;
}
public String getAge() {

zooActivateRead();
return this.age;
}
In the main class, new Person objects can then be created and stored to the database
file:
Person alice = new Person("Alice","Smith",34);

pm.makePersistent(alice);
From the database, the entire extent can be retrieved:

Extent<Person> extent = pm.getExtent(Person.class);

for (Person p: extent) {
System.out.println(p.getFirstname()+" "+p.getLastname());
}
extent.closeAll();
A query object can be created and executed as follows:
Query query = pm.newQuery(Person.class, "age == 34");

Collection<Person> persons = (Collection<Person>) query.execute();
for (Person p: persons) {
System.out.println(p.getFirstname()+" "+p.getLastname());
}
query.closeAll();
The database books by Ricardo [Ric11] and Connolly and Begg [CB09] both con-
tain chapters on the object-oriented and the object-relational paradigm and provide
examples. The textbook by Dietrich and Urban [DU11] contains an in-depth cover-
age of object-oriented and object-relational databases and also contains case stud-
ies with db4o. Object-relational mapping technology was analyzed in the articles
[O’N08, LG07]. [MK13] discuss the generic data access object and introduce the DAO
dispatcher pattern. A Java implementation of the Data Access Object can be found at
the Perfect JPattern repository.
As for storage management, the basic ideas of node marking and edge marking
were established by Hosking, Moss and Bliss [HMB90]. Moss [Mos92] also analyzed
whether swizzling results in performance improvements. Normalization for objects
has been recently studied in [MT13]; previous approaches (like [TSS97, ME98]) focused
on the definition of object dependencies.
Development of ODBMSs started together with the emergence of the object-
oriented programming paradigms. [KZK11] compared several object databases re-
garding some of their decisive features like supported query languages. There are
some mature commercial systems on the market; the open source field is however a
bit reduced in particular since db4o has been discontinued.
|
Part III: Distributed Data Management
10 Distributed Database Systems
For several decades, centralized database management systems – running on a single
database server – have been predominant. There are several reasons for this:
– complexity of a single-server system was lower and administration easier;
– the prevalent use case was to evaluate short queries (frequent reads) on a coherent
set of data, whereas data modifications happened only rarely (infrequent writes);
– network communication speed was slow and hence sending data between differ-
ent servers was too costly;
– parallelization required rewriting a query into subqueries and recombining the
results and this overhead diminished the positive effects of a parallel execution
of subqueries.
Whenever there were more demanding requirements (like increased data volume or
more frequent writes), the obvious reaction was to equip the single database server
with higher capacity in terms of processor speed, memory size or disk space. This way
of improving a single server is often termed scaling up or vertical scaling. However,
the amount of data and queries a single database server can handle is limited and a
single server is always a single point of failure for which a crash might turn out to be ex-
tremely costly. Hence, scaling out or horizontal scaling (connecting several cheaper
servers in a network) is now seen as a viable – and in some cases the only – option
to improve the throughput and latency of a database system at the cost of coordina-
tion and synchronization of the database servers. This disadvantage however pays off
for large scale systems or global enterprises with several data centers – in particular,
due to increased network communication speed. In this chapter we survey the prin-
ciples of distributed database systems with a focus on failure models and epidemic
protocols.
10.1 Scaling horizontally
The ability of a database system to flexibly scale out by distributing data in a server net-
work is called horizontal scalability. Moreover, these servers can work independently:
the individual servers have their own processors, memory and disk systems and only
communicate with other servers by a network connection. This architecture is sup-
posedly cheaper than one powerful centralized machine. Historically, it is sometimes
called a shared-nothing architecture (in contrast to systems where servers share
components – like a shared-disk storage or shared-memory). The most common use
case today is a distributed database on a shared-nothing architecture; in other words,
a distributed database management system (DDBMS) that runs on a network of inde-
pendent servers. Due to this independence, the servers need not be large, expensive
236 | 10 Distributed Database Systems
ones but instead may consist of cheaper commodity hardware so that each server
can easily be replaced by a new one. In a distributed database data is spread across
several database servers.
A distributed database is a collection of data records that are physically distributed on several servers
in a network while logically they belong together.
A distributed DBMS can become beneficial not only when handling large volumes of
data, but also when aiming for improved availability and reliability in smaller scaled
systems. Important features of a DDBMS (as identified in [DHJ+ 07]) are the following:
Load balancing: User queries and other processes should be assigned to the
servers in the network such that all servers have approximately the same load
(that is, the same amount of processing tasks); an imbalanced load – and in par-
ticular, hotspots consisting of those servers that usually execute a lot more tasks
than other servers – can lead to a lower performance of the system because the
DDBMS does not make use of all the resources available.
Flexible scalability: Servers may flexibly leave and join the network at any time
so that the DDBMS can be reconfigured according to the current storage or per-
formance demands. The term membership churn is used to describe the leaving
and joining of database servers in the network.
Heterogeneous nodes: The DDBMS may run on a network of servers where some
servers might have more capabilities than others. With such support for such het-
erogeneous nodes, the DDBMS can for example be stepwise migrated onto more
performant nodes without the need to upgrade all nodes at once.
Symmetric configuration: Every node is configured identically to the others;
hence, each node has the ability to replace a failed node. In particular, user
queries can be handled by any server in the system.
Decentralized control: Peer-to-peer algorithms for data management improve
failure tolerance of a DDBMS because they avoid the case of a distinguished node
which would be the single point of failure for the system.
10.2 Distribution Transparency
To a user, the distributed DBMS should appear as if he was interacting with a single
centralized server. In particular, the user must be allowed to send his query to only
one node of the system and the distributed DBMS adapts and redirects his query to
one or more data nodes. Hence, for a user it must basically be transparent how the
DBMS internally handles data storage and query processing in a distributed manner.
This is the notion of distribution transparency which has several more aspects that are
relevant for database systems:
10.3 Failures in Distributed Systems | 237
Access transparency: The distributed database system provides a uniform query

and management interface to users independent of the structure of the network
or the storage organization.
Location transparency: The distribution of data in the database system (and
hence the exact location of each data item) is hidden from the user. The user can
query data without having to specify which data item is to be retrieved from which
database server in the network.
Replication transparency: If several copies of a data item are stored on different
servers (for recovery and availability reasons), the user should not be aware of
this and should not have to care about which copies he is accessing. Replication
implies that the problem of data consistency has to be handled: the distributed
database system should ensure that the different copies are updated regularly so
that users can access any copy and still retrieve correct data.
Fragmentation transparency: If a large data set has to be split into several data
items (usually called fragments, partitions or shards), the distributed database
system does this splitting internally and the user can query the database as if it
contained the entire unfragmented data set. In particular, to answer a user query,
subqueries are redirected to different servers and the data items relevant to the
query are recombined by the database system.
Migration transparency: If some data items have to be moved from one server to
another, this should not affect how a user accesses the data.
Concurrency transparency: When multiple users access the database system,
their operation must not interfere or lead to incorrect data in the database sys-
tem. Concurrency is much more difficult to manage for a distributed system than
for a centralized one. A major problem is how to resolve conflicts due to the dis-
tributed nature of the system: A conflict occurs, for example, when two users con-
currently try to each write to a different copy of a replicated data item on two dif-
ferent servers.
Failure transparency: As a distributed database system is more complex than
a centralized one, many more failure cases can arise. The distributed database
system should hence do its best to continue processing user requests even in the
presence of failures.
10.3 Failures in Distributed Systems
In a distributed system with several independent components connected by a net-

work, parts of the network may fail. From a technical perspective, failures can for ex-
ample be the following:
Server failure: A database server may fail to

process messages it receives for example due to
S3
a faulty network component; or the server fully
crashes and has to be restarted. Servers may S2 S4
also be delayed in processing messages (due to
overload) or may send incorrect messages due
S1 S5
to errors while processing data.
Message failures: When messages are transmitted over communication links in
the network, messages may be delayed or lost during times of high congestion of
the network. Even if a message is eventually transmitted, a receiving component
may fail to handle a delayed message due to a timeout. At times messages may
also be duplicated due to faulty components.
Link failure: A communication link between

S3
two servers may be unable to transmit mes-
sages – or it might corrupt or duplicate mes- S2 S4
sages. Hence, a link failure can cause a message
failure.
S1 S5
Network partition: A network is partitioned
when it is split into two or more subnetworks
S3
that are unable to communicate because all
communication links between them are bro- S2 S4
ken. As a special case, a network is also called
partitioned when one of the subnetworks con-
S1 S5
sists of just one single server.
In a more abstract setting, failures of nodes (that is, servers) in a network can be cat-
egorized as follows:
Crash failures: A crash failure is a permanent failure of a server and corresponds
to aborting a communication protocol. That is, once the server crashed, it will
never resume operation so that communication with it is not possible any longer.
Omission failures: An omission failure corresponds to not taking (in other words
omitting) some action; for example, an omission failure occurs when some server
fails to send a message it should be sending according to some communication
protocol.
Commission failures: A commission failure corresponds to taking an action that
is not correct according to (and hence is a deviation from) a communication pro-
tocol. A server might for example send an incorrect or unnecessary message.
Crash failures are a special case of omission failures (because a crashed server fails
without recovering but until the crash the server acts according to protocol). The union
of omission and commission failures is called Byzantine failures (based on the arti-
cle [LSP82]) – hence components of the distributed system may fail in arbitrary ways.
In contrast, the term non-Byzantine failures usually refers to omission failures (like
crash failures and message loss) but in addition explicitly also covers duplication and
reordering of messages.
Distributed DBMSs have to provide a high level of fault tolerance: even in the pres-
ence of failures, the unaffected servers should continue to process user queries. A dis-
tributed system may in general be devised based on a certain failure model which
describes the set of failures that the system can tolerate. Two common failure models
are:
Fail-stop model: All server failures are crash failures that permanently render
the server unavailable and hence remove it from the set of available servers.
Fail-recover model: A server may halt but it may later resume execution – for
example after a restart. There are two particular cases for resuming execution:
the server may resume execution in the state before it was halted (it can hence
remember its internal state and all the messages is has processed previously) or
it may start from scratch (and hence it forgets any previous state it was in or any
messages it has processed).
10.4 Epidemic Protocols and Gossip Communication
Due to the many properties (like failure tolerance and scalability) that a distributed
database system should have, the propagation of information (like membership lists
or data updates) in the network of database servers is difficult to manage. In the sim-
plest scenario, whenever new information is received by one server, the server sends
a notification to all the other servers he knows. However the initiating server might
not be aware of all servers currently in the network and some of his messages might
be lost due to network failures. Moreover, the network might quickly change due to
insertions or removals of servers.
As the more flexible alternative, the database servers can be seen as participants
in a peer-to-peer network where there is no central coordinator. These peers can coor-
dinate themselves by communicating pairwise. From a database perspective, epidemic
protocols are a category of peer-to-peer algorithms, where information (for example
data updates) is spread like an infection all over the server network. Another applica-
tion of epidemic algorithms is membership of peers in the network: each server has to
maintain a list of names of those servers that are part of the network and hence are
possible communication partners. This membership list can then be kept up-to-date
by an epidemic algorithm: servers exchange their membership lists in a peer-to-peer
fashion so that the information which servers are part of the network slowly spreads
over the entire network.
The notion of epidemic protocols has its roots in an article on updates in dis-
tributed databases ([DGH+ 88]). With an epidemic algorithm, servers in the network
pass on a message like an infection. In the above membership example, a message
could for example be the notification that some new server has joined the network so
that all servers that receive this message can update their membership list accordingly.
In analogy to epidemiology, servers that have received a new message that they want
to pass on to others are called infected nodes; nodes that so far have not received
the new message are called susceptible nodes. Nodes that already have received the
message but are no longer willing to pass it on are called removed nodes.
There are three different communication modes that can be applied in epidemic algo-
rithms:
push-only: An infected server contacts another server and passes on all the new
messages it has received. That is, to spread the infection, the infected server has
to find another server that is susceptible.
pull-only: A susceptible server contacts another server and asks for new mes-
sages. To spread the infection, the susceptible server has to find an infected server.
push-pull: One server contacts another server and both exchange their new mes-
sages. After this exchange, both servers have the same state (for example an iden-
tical membership list). Both servers are susceptible and infected at the same time.
A term that is often used as a synonym for epidemic message exchange across servers
is gossiping; the term expresses that messages spread in a server network like rumors
in human communication. It has its background in the graph-theoretical analysis of
the gossip problem [Ber73].
Two variants of epidemic algorithms for database updates discussed in [DGH+ 88]
are anti-entropy and rumor spreading. They have the following properties:
Anti-entropy: Anti-entropy is a periodic task that is scheduled for a fixed time
span; for example, anti-entropy can be configured to run once every minute. With
anti-entropy, one server chooses another server (from its local membership list) at
random to exchange new messages in one of the communication modes described
above. Anti-entropy is called a simple epidemic because any server is either sus-
ceptible or infective (there are no removed servers) and the infection process does
not degrade over time or due to some probabilistic decision.
Rumor spreading: With rumor spreading, the infection can be triggered by the
arrival of a new message (in which case the server becomes infective); or it can be
run periodically. With rumor spreading, the infection proceeds in several rounds.
In each round, a server chooses a set of communication partners; the number of
communication partners chosen is called the fan-out. Rumor spreading is the case
of a complex epidemic because infection of other servers is a dynamic process:
the amount of infections decreases with every round as the number of removed
servers grows. This decrease of infections can be varied as follows:
probabilistic: After each exchange with another server, the server stops be-
ing infective with a certain probability.
counter-based: After a certain number k of exchanges, the server stops being
infective. There are two extreme cases: in the infect-and-die case, the num-
ber k is equal to the fan-out – that is, the server runs one round of infection
and then stops; in the infect-forever case, the number k is infinite and the
server never stops.
blind: The server becomes removed without taking the feedback of commu-
nication partners is into account. In particular, in the probabilistic case, he
decides to become removed with a certain probability after each exchange;
in the counter-based case, the server becomes removed after a number k of
exchanges.
feedback-based: The server becomes removed if it notices that the communi-
cation partners already have received the new message. In particular, in the
probabilistic case, whenever the infective server notices that the communi-
cation partner already knows the message he wants to spread, then he stops
being infective with a certain probability; in the counter case, the being infec-
tive stops after k exchanges with a server that already knows the message.
One problem that can occur with epidemic algorithms is the case of isolated subnets:
message exchanges only takes place inside subnets but there are no exchanges be-
tween the subnets so that the sets of messages between the subnets always differ. This
is the case of a logical partition where the communication links are working but nev-
ertheless the subnets do not communicate (in contrast to a physical network partition
where some communication links might be broken). This is in particular a problem
when an epidemic algorithm is used for membership lists: then the servers only con-
sider the other servers in their subnet as being members of the network – without
ever becoming aware of the other subnets. A solution around this is to use a set of
seed servers: a set of servers with which every server joining the network starts the
message exchange.
10.4.1 Hash Trees
A major issue with epidemic protocols is how two servers can identify those messages
in which they differ. For a large amount of messages, a complete comparison of all
messages is not feasible as this would slow down the epidemic process tremendously:
the entire message list of one server has to be sent to the other server and the server has
to go through the two message lists sequentially to find missing messages. A simple
improvement is to use a list of hash values: comparison of hash values (which are
Hash1,2,3,4 top hash
Hash1,2 Hash3,4
Hash1 Hash2 Hash3 Hash4
message list Message 1 Message 2 Message 3 Message 4
Fig. 10.1. A hash tree for four messages
shorter than the entire messages) is faster; but on the downside, the hash values have
to be computed and still the list of hash values has to be compared sequentially.
Hence, a much more efficient way of comparison has to be found. This is possi-
ble with a hash tree (or Merkle tree [Mer87]): a hash tree starts with a hash of each
message in a leave node, and then iteratively concatenates hashes, hashes the con-
catenations again and combines the hash values into a tree structure (see Figure 10.1).
For the inner nodes, the closer a hash value is to the root, the more leaves (and hence
messages) it covers. The last hash value at the root of the tree is called the top hash.
Now, with a hash tree, message list comparison is improved: first of all, when the
two top hashes are identical, the two message lists are identical, too – based on the
assumption that no collisions occur that result in identical hash values for different
inputs. Hence, the case that there are no new messages to spread can be determined
by just sending the value of the top hash to the other server that compares the sent one
with his own top hash. However, if the top hashes differ, we go the the next level of the
tree and compare the hash values there. Whenever we encounter an inner node that
has identical hash values in the two hash trees under comparison, then we know that
the messages below this inner node are identical; nor further comparisons are neces-
sary for the subtree starting at this node. On the other hand, as long as hash values
differ for an inner node, we have to go one level deeper and compare the hash values
of the child nodes. When we reach a leaf node with different hash values, we have
identified a message on which the two message lists differ. An important precondition
for this to work is that both servers use the same sorting order for the messages. If the
sorting order differs in the to-be-compared trees, hashes are combined in a different
manner and the comparison of the hash values reveals many differences although the
message lists are identical. That is, in order to avoid unnecessary hash comparisons,
we have to ensure identical root hashes for identical message list. This can for example
be done as follows:
– Let the two servers agree on a sorting order, sort all messages according to this
order and then compute the Merkle tree just before the comparison. This process
obviously introduces a delay due to sorting and the on-the-fly computation of the
hash trees.
– Another option is to make each server precompute Merkle trees for any possible
sorting order of the messages. For a comparison of their trees, the servers then just
have to find those two trees with the same sorting order. This option only makes
sense for small message lists because computing all possible Merkle trees as well
as updating them whenever a new message arrives costs time; and storing all trees
requires a lot of storage space.
Merkle trees are also used for record comparison in extensible record stores or key-
value stores. Some of these stores execute a major compaction (see Section 8.2.4) be-
fore computing the Merkle tree; in this way, unnecessary records (that are masked by
delete markers) are removed and the records are sorted based on the order configured
for the keys and based on the timestamps.
10.4.2 Death Certificates
There is a drawback with the decentralized control of peer-to-peer networks when it

comes to withdrawing information. With peer-to-peer algorithms in general – and epi-
demic protocols in particular – it becomes quite complicated to delete messages. The
problem is the following: if a message is deleted locally (in one message list), then
further message exchanges with the peer servers might reintroduce the deleted mes-
sage. As an example consider membership lists: when a server leaves the network, it
does not suffice to delete the server’s entry from some membership lists; for the dele-
tion to become effective, the server’s entry has to be deleted from all membership lists
at the same time. However due to the peer-to-peer nature of the server network this
will be impossible to achieve. That is why explicit delete messages (so-called death
certificates) have to represent the withdrawal of information. Whenever a death cer-
tificate is received by a server, it will delete the corresponding original message (for
example, it will remove the server for which the death certificate was received from
the local membership list) but it will keep the death certificate. In this way the dele-
tion of messages can spread with the usual epidemic behavior: whenever a message
exchange between peers takes place all death certificates are exchanged, too, such
that messages for which a death certificate exists are immediately deleted and will
not be exchanged. This in effect leads to a local deletion of messages while avoiding
any reintroduction of deleted messages.
Death certificates are effective for the deletion of messages, but they have another
disadvantage: because death certificates themselves cannot be deleted (in order to
avoid reintroductions of messages), over time they can pile up and occupy all the stor-
age space of the servers. In practical settings the following options can help alleviate
this problem:
Time-to-live values: One option to avoid this overabundance of death certificates
is to attach a time-to-live value to each certificate. Whenever this time span has
passed, a server can delete the death certificate. It could however still happen that
some servers have not received a death certificate during its time-to-live; these
servers can then cause a reintroduction of the to-be-deleted message. The time-
to-live value must hence be high enough to keep the probability of reintroduction
tolerably small. Yet in practice, if many deletions take place, even with a time-to-
live value there may be too many death certificates around that lead to a perfor-
mance degradation of the epidemic behavior.
Dormant certificates: Another option is to permanently keep death certificates
only at some few servers; these are then so-called dormant certificates [DGH+ 88].
All other servers can delete the certificates after some time. Only in case of a rein-
troduction the corresponding dormant death certificate is reactivated and spread
again to the other servers.
The standard textbook by Tanenbaum and van Steen [TvS06] gives a general overview
of distributed systems. The textbook by Özsu and Valduriez [ÖV11] provides an in-
depth treatment of distributed database systems with a focus on approaches for the
relational data model. A textbook focusing on failure tolerance and consensus is the
one by Attiya and Welch [AW04].
Gossiping and epidemic algorithms have raised scientific interest for quite some
time. Starting with a purely mathematical treatment (the gossip problem [Ber73]), one
of the first practical approaches was presented in [DGH+ 88]. Several variants have
been studied since then – for example in [LLSG92, LM99, JVG+ 07, KvS07, BGFvS09].
Hash trees have been applied by Merkle [Mer87] for authentication of digital signa-
tures.
11 Data Fragmentation
In a distributed database system, two major questions are (1) how the entire set of
data items in the database can be split into subsets and (2) how the subsets can be
distributed among the database servers in the network. Question (1) addresses the
problem of data fragmentation (also called sharding or partitioning); Question (2)
addresses the problem of data allocation.
We first survey properties and types of fragmentations from a theoretical point of
view and then discuss fragmentation approaches for different data types.
11.1 Properties and Types of Fragmentation
What good fragmentations and good allocations are for a given database highly de-
pends on the runtime characteristics of the system. In this sense, the query behavior
of users plays an important role for the quality of fragmentation and allocation: for
example, properties to be considered for the fragmentation and later on allocation of
the fragments could be:
– the type of accesses (mostly reads or mostly writes),
– the access patterns (which data records are accessed regularly and which only
rarely),
– the affinity of records (which data records are accessed in conjunction with which
other data records),
– the frequency of accesses and
– the duration of accesses.
Once a good fragmentation and allocation have been established, a distributed

database can take advantage of
– data locality (ideally, data records in the same fragment are often accessed to-
gether),
– minimization of communication costs (ideally, no data records have to be moved
to another server in order to answer a query),
– improved efficiency of data management (ideally, queries on smaller fragments
can be executed faster than on the large data set and index structures can be
smaller and hence data records can be found faster),
– and load balancing (ideally, all servers get assigned the optimal amount of data
and only have to process user queries according to the capacities of each server
and without the danger of hotspots – that is, without overloading a few servers
with the majority of user queries while the other servers are mostly idle).
246 | 11 Data Fragmentation
However, a given data distribution is in general not optimal for any kind of queries
that a user may come up with. Hence, on average a fragmented database system has
to live with several disadvantages. In particular, queries that involve subqueries over
different fragments are costly because the global query has to be split into subqueries
(a process called query decomposition), the affected fragments have to be identified (a
process called data localization) and data records have to be sent over the network to
computed the final result. Moreover, distributed transactions are extremely difficult to
manage: in case of concurrent accesses, system-wide consistency can be maintained
only with very complex processes; the same applies to distributed recovery when one
or more servers crashed during such a transaction.
Several criteria are important when designing a distributed database with frag-
mented data sets and when developing algorithms for the maintenance of data frag-
mentation. The first and foremost demand is that a given data fragmentation must be
correct in the sense that the original data set can be entirely reconstructed from its
fragmented representation. When the data set is recombined from the fragments, the
following two correctness properties are required:
Completeness: none of the original data records is lost during fragmentation and
hence no data is missing in the reconstructed data set; in other words, complete-
ness means that each bit of data from the original data set can be found in at least
one of the fragments.
Soundness: no additional data are introduced in the reconstructed data set; in
other words, only data that belong to the original data set can be reconstructed.
More formally, for a data set D, let D1 , . . . , D n be its fragments. Let be an operator
that combines the fragments (for example union or join). Correctness of the fragmen-
tation then ensures that D = D1 . . . D n .
Additionally, non-redundancy can be fulfilled as an optimality criterion: any frag-
ment should be minimal in the sense that it does not overlap with any other fragment.
More formally, for any two fragments D i and D j (where i 6= j), it ideally holds that
D i ∩ D j = ∅. However, as we will see shortly, some redundant information might be nec-
essary to allow for reconstruction (for example, by using tuple IDs or shadow nodes).
In order to be practical for large data sets that are frequently modified, many more
requirements are necessary for an efficient long-term management of the distributed
data sets (some of these properties are also summarized in [AN10]):
Unit of distribution: the individual data records that the fragments of the data set
are composed of should neither be too coarse nor too fine-grained. If data records
are too coarse (like entire relational database tables), the fragmentation becomes
inflexible. If data records are too fine-grained (like individual attributes inside an
object or inside a vertex) logically coherent entities are split and their data is sepa-
rated into different fragments: this would soon make the fragmentation (and later
on the query processing) inefficient.
11.1 Properties and Types of Fragmentation | 247
Fragment sizes: The sizes of the obtained fragments should be configurable in

the fragmentation algorithm. This facilitates load balancing in the distributed
database system. In particular, equally sized fragments are advantageous for an
efficient load balancing in a network of homogeneous server.
Workload awareness: Some fragmentation approaches consider only the data
records stored in the database system, while others estimate what a typical se-
quence of read and write requests (a so-called workload) in the distributed system
will be; and then they optimize the fragmentation with respect to this workload.
One aim of a workload-aware fragmentation will be to optimize efficiency of dis-
tributed query processing. This means on the one hand, that queries should be
answered within a single fragment and without crossing fragment boundaries.
This is particularly important when the recombination of the fragments is costly
– like a distributed join or navigational access in an XML tree or a graph; it is less
a problem where subquery results are recombined by just taking their union – like
in key-based access in key-value stores or extensible record stores. On the other
hand, workload-aware fragmentation should avoid hotspots where most of the
queries will affect only a small set of the fragments; instead, queries should be
distributed well among all fragments. Workload-aware fragmentation can hence
lead to a balanced distribution; but it is more complex than data-dependent frag-
mentation, such that it might be inappropriate for a distributed database system
with real-time constraints.
Local view: The local view property of a fragmentation algorithm means that frag-
ments can be computed by just looking at a small part of the data and processing
the data set step by step. That is, there is no need to have information about the
entire data set (or even load the entire data set into memory); fragmentation algo-
rithms that need a global view of the entire data set are usually impractical for a
distributed database system.
Dynamic adaptation: An existing fragmentation can be adapted upon modifi-
cations in the data set. An algorithm can be run on the existing fragmentation
(possibly in several iterations) to improve it and adapt it to the new situation.
Distributed computation: The process of fragmentation can be distributed
among different servers. In particular, for an existing fragmentation, each frag-
ment can be modified by the server it is assigned to.
Apart from these desired properties, fragmentations can be achieved in different ways.
Types of fragmentation algorithms include:
Handmade: A handmade fragmentation requires a database administrator to
identify fragments in a data set and set up the database system accordingly. Frag-
mentation by hand is only feasible on a very high abstraction level. For example
placing employees of a company into a fragment for their department in the com-
pany node, works on the abstraction level of the department. A more fine-grained
abstraction level quickly becomes infeasible to maintain by hand. For example,
configuring fragments for working groups of employees where often new working
groups are created or employees switch their groups will be difficult to set up and
keep track of.
Random: In a randomized fragmentation, every data record has the same proba-
bility to be assigned to a fragment. This is a simple procedure that satisfies many
of the above properties like local view and supporting dynamic additions of data
records. Keeping partitions equally-sized requires some more complex probability
distribution if data records can be deleted and thus some fragments shrink quicker
than others. On the downside, random fragmentation is not workload-aware with
the effect that the randomized process may cause excessive communication load
between fragments or may create hotspots.
Structure-based: A structure-based fragmentation looks at the data schema def-
inition (or the data model in general) and identifies substructures that constitute
the fragments.
Value-based: A value-based fragmentation looks at the values contained in the
data items to define the fragments. For example, by specifying some selection
predicates the values contained in some fragments can be constrained. Several
such selection predicates can be combined into so-called minterms [ÖV11]: the
minterms correspond to disjunctions of selections so that each minterm defines
one fragment that is disjoint from the other fragments. The structure (and hence
if available the schema of the data) is not altered.
Range-based: As a special form of value-based fragmentation, range-based frag-
mentation splits the primary key (or another attribute that can be sorted) into dis-
joint but consecutive intervals. Each interval defines a fragment.
Hash-based: If each data record has a key (for example, the key of a key-value
pair) the hash of this key can determine the database server that the data record
is assigned to. In this setting, each database server in the distributed database
system is responsible for a subrange of all hash values.
Cost-based: Cost-based fragmentation relies on a cost function for finding a good
fragmentation. The aim is then to find a fragmentation with minimal overall cost.
For example, for a graph data model a typical cost function is the number of cross-
fragment edges where one node of the edge lies in a different fragment than the
other node; the number of such inter-fragment edges – the so-called edge cut –
should then be minimal to reduce communication costs when traversing edges.
For a more realistic cost estimation, workload awareness can then be added to the
cost-based fragmentation process: for a given sample workload the aim is to find
a fragmentation that minimizes the cost of the given workload. For the different
data models different cost functions may be used. For example, in the case of the
graph data model we might for example only minimize the edge cut for edges that
are traversed in the sample workload whereas other (non-traversed) edges can be
ignored.
Affinity-based: This form of fragmentation is based on a specification of how

affine certain data records are – in other words, how often they are accessed to-
gether in one read request.
Clustering: Specialized algorithms can be used for finding coherent subsets in
the data (so-called clusters). Clustering algorithms often apply heuristics that do
not find a global optimum but find a near-optimum solution quickly.
11.2 Fragmentation Approaches
Depending on the data model chosen, different ways of fragmenting data are possible
and require different recombination methods. This sections surveys some of them.
11.2.1 Fragmentation for Relational Tables
In relational database theory, several alternatives of splitting tables into fragments

have been discussed (see for example [ÖV11]). The two basic approaches are verti-
cal fragmentation and horizontal fragmentation. In addition, derived fragmentation
is based on a given horizontal fragmentation and hybrid fragmentation combines both
vertical and horizontal fragmentation:
– Vertical fragmentation: Subsets of attributes (that is, columns) form the frag-
ments. Hence, vertical fragmentation is a form of structure-based fragmentation.
Rows of the fragments that correspond to each other have to be linked by a tuple
identifier. That is, a vertical fragmentation corresponds to projection operations
on the table such that all fragments have a different database schema (with only
a subset of the attributes). The fragmented tuples can then be recombined by a
join of all fragments over the tuple identifier. To illustrate this consider the orig-
inal data in Table 11.1 that is split into two fragments with an additional tuple
Table 11.1. Vertical fragmentation
Original data A B C D
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
Fragment 1 ID A B Fragment 2 ID C D
1 a1 b1 1 c1 d1
2 a2 b2 2 c2 d2
3 a3 b3 3 c3 d3
Table 11.2. Horizontal fragmentation
Original data A B C D
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
Fragment 1 A B C D
Fragment 2 A B C D
a1 b1 c1 d1
a3 b3 c3 d3
a2 b2 c2 d2
identifier (ID). This is an example for an affinity-based vertical where columns A,

B and C, D are assumed to be more affine than other subsets of columns (that is,
they occur in the same transaction very often for a given workload). For different
workloads, affinity may change; for example, if columns B and C are more affine
than columns A and B, then columns B and C might be placed in one fragment
while columns A and D together form the second fragment.
– Horizontal fragmentation: Subsets of tuples (that is, rows) form the fragments. A
horizontal fragmentation can be expressed by a selection condition on the table
with the effect that all fragments have the same database schema (which is identi-
cal to the original database schema). Hence, horizontal fragmentation is a form of
value-based fragmentation. Recombination is achieved by taking the union of the
tuples in the different fragments. In the example in Table 11.2, the first two rows
form the first fragment while the last row constitutes the second fragment.
– Derived fragmentation: A given horizontal fragmentation on a primary table
(the primary fragmentation) induces a horizontal fragmentation of another ta-
ble based on the semijoin with the primary table. In this case, the primary and
derived fragments with matching values for the join attributes can be stored on
the same server; this improves efficiency of a join on the primary and the derived
fragments.
– Hybrid fragmentation: A hybrid fragmentation denotes an arbitrary combination
of horizontal and vertical fragmentation steps. The more fragmentation steps are
combined, the more complex it is to recombine the original data set. The frag-
ments then differ much in the schemas that they have to follow.
11.2.2 XML Fragmentation
A recent comprehensive survey on XML fragmentation is given in [BM14]. For XML

fragmentation the first distinguishing feature is whether a set (a “collection”) of XML
documents must be fragmented into subcollections or a single large XML document
must be fragmented into subdocuments. In the first case (fragmentation of collec-

tions), a large collection (of small XML documents) is divided into subcollections
which can each be stored on different servers.
– A value-based (sometimes called horizontal) fragmentation of a collection is de-
fined by selection operations on the XML documents that identify the subcollec-
tions. For example, in a collection of XML documents describing departments of a
company, the European departments can form one subcollection (for example by
selection on a location element with value ‘Europe’) while the American depart-
ments form another (by selection on a location element with value ‘America’). If
the documents in the original collection obeyed an XML schema definition then
the documents in each of the subcollections also obey the same schema.
– A structure-based (sometimes called vertical) fragmentation of a collection splits
the XML schema definition of the original collection into separate schema defi-
nitions that all specify subtrees of the original documents. Hence the structure-
based fragmentation of the collection leads to a fragmentation of each document
contained in the collection. Continuing the above example, all subtrees of all doc-
uments describing the location of each department will be stored in one collec-
tion, while all subtrees describing the staff will be stored in another collection
and hence both subcollections have their own schema.
A single large XML document can also be fragmented: it is split into subtrees. Frag-
mentation of such a document can be divided into value-based fragmentation and
structure-based fragmentation, too. In a value-based fragmentation, the large docu-
ment is split into subtrees by applying selections to it based on values stored in text
nodes. Similar to the example above, when a document contains information on all de-
partments of a company, value-based fragmentation can extract one subtree for all Eu-
ropean departments (where a location element has the value ‘Europe’) and all Amer-
ican departments (where a location element has the value ‘America’). In a structure-
based fragmentation, structural components like element names are considered in-
stead. Continuing our example, the subtree of the company document describing the
location of each department forms one fragment (with a location element as its root
node), while the subtree describing the staff forms another fragment (with a staff ele-
ment as its root node). In both cases (value-based fragmentation and structure-based
fragmentation) the subtrees have their own schema different from the schema of the
original document. After fragmenting an XML document, the subtrees must be con-
nected by shadow nodes (also called proxy nodes [KÖD10]) to enable reconstruction
of the original XML documents. An example for an XML tree with shadow nodes is
shown in Figure 11.1. The shadow node has the same node identifier (for example, x)
as the original node but contains management information like on which server the
original node can be found. Moreover, the original node becomes the root node of a
subtree on the other server.
x x x
a) Original XML document b) Fragmented XML document
Fig. 11.1. XML fragmentation with shadow nodes
11.2.3 Graph Partitioning
A graph is a more complex data model (than, for example, a tree-shaped XML docu-
ment or a set of key-value pairs) because data records (the nodes) are highly connected
and even the connections (the edges) carry information. Due to the more complex
graph model, distributing large graphs over multiple servers requires more sophisti-
cated management of the partitioning. In particular, usually more edges are cut when
distributing graph nodes compared to the XML fragmentation approaches and their
tree-shaped data model. Hence a general optimization criterion for graph partition-
ing is to reduce the edge cut: that is, the amount of edges that connect nodes that are
placed on different servers.
Some examples for graph partitioning methods are the following:
– A handmade partitioning would group related nodes into the same partition. For
example, placing nodes describing employees of a company near the company
node.
– A random partitioning would assign each node to a certain partition based on a
probability distribution. In the simplest case, each node has the same probability
to be assigned to a partition.
– A hash-based partitioning would assign a range of hash values to each server;
then it would compute a hash value for each node (for instance on the vertex
ID) and assign the node to the corresponding server. For example, if there are k
servers, for a vertex ID v and a hash function H, we would assign a vertex to server
i whenever H(v)modk = i.
– A workload-driven partitioning reduces the amount of distributed transactions
accessing data from different servers inside the same transaction; for graphs this
means that the amount of cut edges is reduced inside each transaction.
Original graph: Fragment 1: Fragment 2:
v v v′
w x w x′ w′ x
Fig. 11.2. Graph partitioning with shadow nodes and shadow edges
For all partitioning approaches (except the hash-based partitioning), global meta-
information must be maintained, that records which vertex (with which ID) is stored
on which server, so that vertices can quickly be found based on their ID.
Moreover, whenever a graph query traverses an edge between two vertices that are
located on two different servers, the graph database must quickly identify the second
server that hosts the corresponding second vertex. In this case, shadow nodes (and
corresponding shadow edges) are maintained as shown in Figure 11.2 that enable fast
traversal between fragments.
11.2.4 Sharding for Key-Based Stores
Data stores with key-based access include not only key-value stores but also docu-
ment databases (where the document ID is the key) and column-family stores (by us-
ing the row key). In other words, the basic way to interact with key-based data store is
to send a query containing the key, and then the data store identifies the record cor-
responding to the key and returns this record. These key-based stores rarely support
operators that combine different records; this is in contrast to relational databases that
support join operations. Consequently, when fragmenting data in key-based stores the
fragments require less interconnection between the fragments and fragments can be
stored independently on different servers. The term used for this kind of fragmenta-
tion in key-based stores is sharding. Sharding is similar to horizontal fragmentation
for relational databases in the sense that it may be used to split a dataset into individ-
ual data records like rows or documents. However, the term sharding also connotes
that the shards can be schemaless (as can be the entire set of records in the data
store). In particular, these data stores do not support general value-based fragmen-
tation (based on a selection operation on values inside the data records); they only
support range-based or hash-based fragmentation for the keys. Moreover, there is no
need to support derived fragmentation due to the lack of a join operation.
Some stores order the records by key. In this case, order-preserving partitioning
of the records may be applied: the order on the keys provides semantic information on
how related the records are; when two keys are close in the order defined on the keys,
then they are also closely related and are often accessed by the same client. However,
order-preserving partitioning may lead to hotspots when one partition is prone to
more excessive accesses than other partitions.
11.2.5 Object Fragmentation
Following the terminology of the relational fragmentation, some approaches [KL00]

define vertical and horizontal fragmentation for object-oriented programs. In this
case, either a single object can be fragmented into subobjects – this corresponds to a
structure-based fragmentation because new class definitions for the subobjects have
to be specified; or the set of objects is fragmented into subsets of objects as a value-
based fragmentation. Alternatively, as third and most viable option, the entire object
graph (the virtual heap; see Chapter 9) can be fragmented into subsets of objects. We
briefly sketch these different approaches and leave further elaboration as an exercise.
1. Vertical fragmentation of single objects splits an object into two or more objects.
Each of the new objects contains a subset of the original object’s variables. Then
the class definition has to be split, too, into several classes. The advantage would
then be that when accessing the object not the entire set of its variables must be
loaded from the database but only one subset. This notion of vertical fragmen-
tation can however be criticized: it contradicts the object-oriented paradigm of
encapsulation because it breaks the object boundaries. With a good design and
a good normalization of the program, objects should represent an entity and en-
capsulate all the necessary state and behavior; hence objects should not be frag-
mented any further.
2. Horizontal fragmentation of the object graph. The set of instances of a class can
be fragmented into subsets. This can for example be based on the value of some
instance variables: a set of objects representing employees of different companies
can be fragmented into subsets for employees of each of the individual companies.
Or the fragmentation can be affinity-based by identifying a subset of objects that
are often accesses together. A derived horizontal fragmentation is also possible:
those objects of other classes can be fragmented based on the fact with which
objects in a horizontal fragment they interact or which objects are referenced by
them.
3. Graph partitioning algorithms (see Section 11.2.3) can also be applied to the virtual
heap. A minimum edge cut partitioning of the virtual heap would then group those
objects together that are highly connected by edges (representing references or
associations) between them. Edges in the object graph can occur due to direct
referencing of another object but also indirectly due to the class hierarchy (that
is, the inheritance relation between objects).
11.3 Data Allocation
For fragmented data a major problem is how to distribute the fragments over the
database servers; a data allocation mechanism has to decide which fragment should
be stored on which server. As an important side effect of data allocation, load bal-
ancing has to be taken into account: fragments should ideally be distributed evenly
across all servers. There are several forms of data allocation:
Range-based allocation relies on range-based fragmentation (see Section 11.1):
When a range-based fragmentation is obtained (for example, order-preserving
partitioning of key-based stores), the identified ranges have to be assigned to
the available servers. This range-based allocation might have an adverse effect
on load balancing: because the entire set of possible records is split into ranges,
the majority of actual records in the database might be located only in some few
ranges (while for the other ranges there are no or just few records contained in
the database). Servers getting assigned the more populated ranges hence have
higher load. But even if ranges are more or less equally populated, the load bal-
ancing depends also on the query patterns that occur: if records in one range are
accessed more often than those in other ranges, the server having this range has
more processing load than the others. That is, this server becomes a hot spot and
is more prone to delaying or failing.
Hash-based allocation uses a hash function (like MD5 or MurmurHash) over the
input fragments to determine the server to which each fragment is assigned. In
this case, ranges for the servers are defined on the set of output values of the hash
function. Even with hash-based allocation, hotspots can occur because some
records on one server are accessed more often than the ones on other servers;
however, with hash-based allocation the distribution of the records among the
servers is usually more balanced because the hash function distributes its input
values well over its entire output hash values. For key-based stores, hash-based
allocation is often used in conjunction with hash-based fragmentation (see Sec-
tion 11.1): that is, for key-based access, first the hash function is computed on the
key of each record; then – depending on the range in which this hash value lies –
the corresponding server is determined. This hash value must then also be com-
puted when a key is queried to determine the server on which the key is located.
Consistent hashing as a popular form of hash-based allocation is presented in
detail in Section 11.3.2.
Cost-based allocation describes the data allocation task as an optimization prob-
lem. The cost model can take several parameters into account; for instance, two
important parameters are minimizing the amount of occupied servers and mini-
mizing the amount of transferred data. A more detailed discussion is provided in
Section 11.3.1.
11.3.1 Cost-based allocation
Cost-based allocation can be seen as a combinatorial optimization problem with sev-

eral parameters where the goal is to minimize a cost function. The optimization prob-
lem can be expressed by a so-called integer linear program (ILP).
More formally, the data distribution problem is a description of the task of dis-
tributing database fragments among a set of servers in a distributed database system.
The data distribution problem is basically a Bin Packing Problem (BPP) in the follow-
ing sense:
– K servers correspond to K bins
– bins have a maximum capacity W
– n fragments correspond to n objects
– each object has a weight (in other words a size or a capacity consumption) w i ≤ W
– objects have to be placed into a minimum number of bins without exceeding the
maximum capacity
This BPP can be written as an integer linear program (ILP) as follows – where x ik is a
binary variable that denotes whether fragment/object i is placed in server/bin k; and
y k denotes that server/bin k is used (that is, is non-empty):
K
X
minimize yk (minimize amount of servers) (11.1)
k=1
K
X
subject to x ik = 1 (each fragment i assigned to one server) (11.2)
k=1
n
X
w i · x ik ≤ W · y k (capacity of each server k not exceeded) (11.3)
i=1
y k ∈ {0, 1} k = 1, . . . , K (11.4)
x ik ∈ {0, 1} k = 1, . . . , K, i = 1, . . . , n (11.5)
To explain, Equation 11.1 means that we want to minimize the number of servers (that
is, bins) used; Equation 11.2 means that each object is assigned to exactly one bin;
Equation 11.3 means that the capacity of each server is not exceeded; and the last two
equations denote that the variables are binary – that is, the ILP is a so-called 0-1 linear
program. When solving this optimization problem, the resulting assignment of the x-
variables represents an assignment of fragments to a minimum amount servers.
While in this simple example only the storage capacity (W) of the servers and the
storage consumption (w i ) of each fragment are considered, in general other cost fac-
tors can be included when calculating an optimal allocation with minimal cost. For
example, as additional background information, a typical workload can be consid-
ered and data locality of fragment allocation can be improved; in this way, network
file1
hash:0
server3 file5
hash:11 hash:1
file2 server1
hash:10 hash:2
file3
hash:3
file4
hash:8
server2 file7
hash:7 file6 hash:5
hash:6
Fig. 11.3. Data allocation with consistent hashing
transmission cost is reduced because less servers have to be contacted when evaluat-
ing queries from the workload.
11.3.2 Consistent Hashing
Due to the danger of hotspots, consistent hashing as a form of hash-based allocation

is usually preferred. It provides a better and more flexible distribution of the records
among a set of servers. It originates from work on distributed caching [KLL+ 97]; it has
later been popularized for data allocation by the Dynamo system [DHJ+ 07] and now is
widely applied by distributed database systems to allow data allocation that tolerates
churn in the network (that is, adding or removing database servers).
As consistent hashing is a hash-based allocation scheme, a hash function is com-
puted on each input fragment. The first essential concept of consistent hashing is that
the hash values are now seen as a ring that wraps around: when we have reached the
highest hash value we start again from 0. The second essential concept is that a hash
value is computed not only for each fragment but also for each database server (for
example by taking the hash of the server name or IP address).
By computing a hash value for a database server, each server has a fixed position
on the ring; the advantage of these hash values is that they presumably distribute
the servers evenly on the ring (if the hash function has good distribution properties).
Similarly for each data item a hash value is computed – for example the hash of the
key of a key-value pair. In this way, data also distribute well on the ring. A widely used
allocation policy is then to store data on the next server on the ring when looking in
clockwise direction; this is illustrated in Figure 11.3.
file1
hash:0
server3 file5
hash:11 hash:1
file2 server1
hash:10 hash:2
file3
hash:3
file4
hash:8
server2 file7
hash:7 file6 hash:5
hash:6
Fig. 11.4. Server removal with consistent hashing
A major advantage of consistent hashing is its flexible support of additions or removals

of servers. Whenever a server leaves the hash ring, all the data that it stores have to be
moved to the next server in clockwise direction (see Figure 11.4). Whenever a server
joins the ring, the data with a hash value less than the hash value of the new server
have to be moved to the new server; and they can then be deleted from the server
where they were previously located – that is, the next server in clockwise direction
(see Figure 11.5).
An important tool to make consistent hashing more flexible is to have not only
one location on the ring for each physical server but instead to have multiple loca-
tions: these locations are then called virtual servers. Virtual servers improve consis-
tent hashing in the following cases:
– The virtual servers (of each physical server) are spread along the ring in an arbi-
trary fashion and virtual servers of different physical servers may be interleaved
on the ring. In this way, all servers have a better spread on the ring; this leads to
a more even data distribution among the servers on average.
– Heterogeneous servers are supported: a server with less capacity can be repre-
sented by less virtual servers than a server with more capacity. In this way the
weaker server has to handle less load than the stronger server.
– New servers can be gradually added to the ring: instead of shifting its entire data
load onto a new server at once, virtual servers for the new server can be added
one at a time. In this way the new server has time to start up slowly and take on
its full load step by step until its full capacity is reached.
file1
hash:0
server3 file5
hash:11 hash:1
file2 server1
hash:10 hash:2
file3
hash:3
file4 server4
hash:8 hash:4
server2 file7
hash:7 file6 hash:5
hash:6
Fig. 11.5. Server addition with consistent hashing
In many implementations, the hash ring is divided into several ranges of equal size.
Each server is assigned a subset of these ranges. Whenever servers join or leave the
network, the ranges get reassigned.
Fragmentation and allocation for relational databases is extensively discussed in

[ÖV11].
Ma and Schewe [MS10] discusses both vertical and horizontal fragmentation for
both object-oriented and XML data and introduces the notion of a split fragmenta-
tion. Horizontal or vertical fragmentation of XML documents while improving query
processing has been studied by several authors including [KÖD10, FBM10, ARB+ 06].
For graphs usually a minimum edge cut partitioning is obtained. The influential
article by Karypis and Kumar [KK98] describes a graph partitioning algorithm with a
coarsening, a partitioning and an uncoarsening phase. Dynamic changes in graphs
can be handled by the xDGP system of Vaquero et al. [VCLM13] by vertex migration.
With a particular focus on graph partitioning for graph databases, Averbuch and Neu-
mann present a graph partitioning algorithm that can adapt to graph changes and an
implementation and evaluation with the Neo4J graph database.
Consistent hashing was first used for distributed caching [KLL+ 97] and has later
on been widely applied for peer-to-peer systems like distributed hash tables – for ex-
ample in [SMK+ 01]. It has then been applied to data fragmentation and allocation in
key-value stores [DHJ+ 07] as well as document stores and extensible record stores.
12 Replication And Synchronization
Replication refers to the concept of storing several copies of a data record at different
database servers. These copies are called replicas and the number of copies is called
the replication factor. When applying replication to large data sets, first a fragmen-
tation of the data set is obtained and then the fragments are replicated among a dis-
tributed database system. Replication on the one hand improves reliability and avail-
ability of a distributed system because replicas can serve as backup copies whenever
one of the servers fails and becomes unavailable. But even if no failures occur, repli-
cation enables load balancing (by redirecting client requests to idle replicas) or data
locality (by redirecting client requests to replicas closest to them). Yet, this kind of con-
current accesses to different replicas lead to consistency problems of the data records.
Techniques for ordering these accesses have to be implemented.
In this chapter, we discuss the replication models of master-slave and multi-
master replication. We present protocols for distributed concurrency control and
consensus. We extensively discuss logical clocks as a method to order events in a
distributed database system.
12.1 Replication Models
Replication has several advantages:

– it improves the reliability of the distributed database system by offering higher
data availability: if one of the replicas fails to handle a user query, another replica
can take over.
– and it offers lower latency than a non-replicated system by enabling load bal-
ancing, data locality and parallelization: any replica can answer any user’s read
request and so the database system can redirect requests to idle copies; moreover,
replicas answering a user request should at best be closest to the user’s location
to reduce latency while other replicas answer requests of other users.
The advantages of replication however imply a major disadvantage: replication causes

consistency problems in a distributed database system. When a user updates a data
record at one server, network delays or even network failures may prevent the database
system from updating all other replicas of the data record quickly; this leads to replicas
having outdated data which could be read by other users in the meantime. Moreover
there is the concurrency problem: two or more users might concurrently update the
same data record on different replicas and the database system must offer a mecha-
nism to resolve this conflict.
262 | 12 Replication And Synchronization
Slave
update
read
read
Master
write
update
read
Slave
Fig. 12.1. Master-slave replication
The two basic models of replication are master-slave and multi-master replication.
While the consistency problem exists for both, the concurrency problem is avoided
in the master-slave case but at the cost of higher latency for write requests.
12.1.1 Master-Slave Replication
In master-slave replication, write requests are handled only by a single dedicated

server that is called the master. After a write, the master is responsible for updating
all other servers that hold a replica – the so-called slaves. Read requests can be ac-
cepted by both the master and the slaves. Figure 12.1 shows an example of a master
server with two slaves, one client executing write and read requests and one client
only executing read requests. Master-slave replication offers enough redundancy in
case of a master failure: when the master fails, one of the slaves can be elected to be
the new master and all write request are redirected to it.
Having a single master server for all write requests in the database system is a
bottleneck that slows down the processing of writes tremendously. A pragmatic so-
lution is to partition the set of all data records into disjoint subsets and to each such
subset assign one server as the master server. This master assignation could be com-
bined with the partitioning process discussed in Chapter 11: In combination with data
partitioning, data records in the same partition (or fragment, or shard) are copied to
the same replication servers and one of the servers is designated master for the entire
partition while the others act as slaves. In Figure 12.2, we see two data records (A and
B) with a replication factor 2. One server is the master server for record A and only
this server will accept write requests for record A. It is then responsible to update the
second server – acting as a slave for record A. Similarly, the second server is master
for record B while the other server is slave for record B.
read A and B read A and B

update A
Master for A Master for B
write A Slave for B update B
Slave for A write B
Fig. 12.2. Master-slave replication with multiple records
Master
write
synchronize
read read
Master
write synchronize
synchronize read
write
Master
Fig. 12.3. Multi-master replication
12.1.2 Multi-Master Replication
When all servers holding a replica of a data record can process write request, they
all act as masters for the data record; hence, we can talk of multi-master replication
in this case. An equivalent term is peer-to-peer replication based on the fact that
the masters are peers with identical capabilities and they have to synchronize one
with the other. Multi-master replication offers higher write availability than master-
slave replication because clients can contact any replica server with a write request
and hence write requests can be processed in parallel. In Figure 12.3 all servers accept
write and read requests for a data item and the servers have to regularly synchronize
their state among themselves. Due to the consistency problem, some clients may re-
trieve outdated data whenever the replica answering the client’s read request has not
finished the synchronization process. Even worse a situation arises due to the con-
currency problem: replicas may be in conflict when different clients wrote to different
replicas without a synchronization step in between the writes.
12.1.3 Replication Factor and the Data Replication Problem
Maintaining more replicas can improve reliability (for example, reduce the amount of
data loss) in case of failures in the distributed database system. On the other hand,
more replicas lead to more overhead for keeping all replicas synchronized. When de-
signing a distributed database, the core question is: How many replicas are sufficient
read1 (x) write1 (x) write2 (x) read1 (x)

Server 1
Server 2
read2 (x) write1 (x) write2 (x) read2 (x)
Fig. 12.4. Failure and recovery of a server
read1 (x) write1 (x) read1 (x)

Server 1
reconcile x
Server 2
read2 (x) write1 (x) write2 (x)
Fig. 12.5. Failure and recovery of two servers
for a reliable distributed database system? Assume that we have 2 as the replication
factor: that is, one master stores the primary copy while and one slave stores a sec-
ondary copy in the master-slave replication case; or alternatively two master servers
each store a copy of a data item in the multi-master case. Ideally, if just one of the
servers fails or is temporarily unavailable, the other server will take over and answer
all incoming user requests including the writes. As soon as the first server recovers,
it has to synchronize with the second server and reconcile its database state to reflect
the most recent writes before it can answer any new user requests (see Figure 12.4).
However, with two-way replication, a couple of error cases can occur. For exam-
ple, if the second server fails, too, before the first one has recovered, any writes ac-
cepted by the second server will not be visible to the first server (see Figure 12.5); it
might hence return stale data to a read request. After both servers have recovered,
the writes accepted by the servers independently need to be ordered in a reconcilia-
tion step; after the reconciliation, both servers should reflect the most recent database
state. If the second server does not recover at all, all updates accepted independently
by it (without having reconciled the updates with the first server) will be lost.
A replication factor of 3 is widely accepted as a good trade-off between reliability
and complexity of replica maintenance. However, with 3-way replication, too, it might
happen that all three replicas fail or are not able to communicate for some time.
An extension of the basic Data Distribution Problem in Section 11.3.1, the Data Repli-
cation Problem expresses that replicas will be placed on distinct servers. In the ILP
representation, the variables y k for the bins and x ik for fragments are kept.
K
X
minimize yk (minimize amount of servers) (12.1)
k=1
K
X
subject to x ik = m (each fragment i assigned to m servers) (12.2)
k=1
n
X
w i · x ik ≤ W · y k (capacity of each server k not exceeded) (12.3)
i=1
y k ∈ {0, 1} k = 1, . . . , K (12.4)
x ik ∈ {0, 1} k = 1, . . . , K, i = 1, . . . , n (12.5)
When solving this optimization problem, the resulting assignment of the x-variables
represents an assignment of fragments to servers where m replicas of each fragment
are assigned to m different servers.
12.1.4 Hinted Handoff and Read Repair
Hinted handoff has been devised as a flexible mechanism to handle temporary fail-
ures: if a replica is unavailable, the write requests (or any system messages) for this
replica are delegated to another available server with a hint that these requests and
messages should be relayed to the replica server as soon as possible. To maintain the
replication factor, the other server itself should not hold a replica of the affected data
item. When the connection to the replica server is reestablished, the hinted server
holding the delegated requests and messages can pass them on to the replica; the
replica can then update its state before accepting new requests.
Yet, the relaying server itself might become unavailable or might fail before it can
deliver requests and messages to the unavailable replica. To reduce the adverse effects
of such a situation, background tasks can be run to keep replicas synchronized even
in case of more complex failure cases. One way to deal with this is to regularly run
an epidemic protocol (see Section 10.4) in the background which will synchronize
the replicas. Another option is called read repair proceeding as follows. Whenever a
data item is requested by a client, a coordinator node sends out the read request to
a couple of replicas; after the responses are retrieved, the coordinator checks these
responses for inconsistencies. The replicas holding outdated data are updated with
the current value which in turn is also returned to the requesting client. Read repair
is often combined with read quorums (see Section 13.1.1): Because the coordinator
nodes needs a set of unambiguous responses anyway, it contacts a set of replicas larger
than the needed quorum, returns the majority response to the client and sends repair
instructions to those replicas that are not yet synchronized.
12.2 Distributed Concurrency Control
Distributed concurrency control ensures the correct execution of operations (more

generally, transactions) that affect data that are stored in a distributed fashion on dif-
ferent database servers. In case of data replication, a typical application of a concur-
rency control protocol is synchronizing all replicas of a record (on all database servers
that hold a replica) when a write on the record was issued to one of the database
servers. That is, if there a n replicas and one of them has been updated the remain-
ing n − 1 replicas have to be updated, too.
Specifications of concurrency control protocols use the term agent for each stake-
holder participating in the protocol; each agent can furthermore have different roles.
The most important role is the coordinator: the coordinator is the database server
that communicates with all other agents and is responsible for either leading the dis-
tributed operation to success or aborting it in its entirety. Concurrency protocols hence
offer a solution to the consensus problem: the consensus problem requires a set of
agents to agree on a single value.
The two-phase commit protocol presented in Section 12.2.1 requires all partici-
pating agents to agree to a proposed value in order to accept the value as the cur-
rently globally valid state among all agents. In contrast, quorum consensus proto-
cols only require a certain majority of agents to agree on a proposed value where the
exact definition of majority depends on a balance between read and write behaviour
and the types of failures that the protocol should be resistant against. The Paxos al-
gorithm as a prominent and widely-used quorum consensus protocol is presented in
Section 12.2.2. Lastly, multi-version concurrency control as a timestamp-based con-
currency mechanism that offers non-blocking reads is introduced in Section 12.2.3.
12.2.1 Two-Phase Commit
The two-phase commit (2PC) addresses the execution of a distributed transaction

where all agents have to acknowledge a successful finalization of the transaction. 2PC
is initiated by the coordinator of the transaction who wants to reach a consensus for
the transaction results by all agents. In the simplest case, all agents try to agree on
accepting a single value of an update request that has been received by the coordina-
tor. The two-phase commit (2PC) protocol has a voting phase and a decision phase.
In each phase, the coordinator sends one message to all agents and – if everything is
working according to the default protocol – receives a reply from each agent. Time-
coordinator agent 1 agent 2 agent 3
prepare
Phase 1
ready
commit
Phase 2
acknowledge commit
Fig. 12.6. Two-phase commit: commit case
outs and restarts of the coordinator or any of the other agents have to be handled by
additional protocols.
In the case that no timeouts and no restarts occur, the agents can either jointly
agree to commit the value or the coordinator decides to abort the transaction. In the
voting phase, the coordinator sends all agents a prepare message asking them whether
they are able to commit the transaction. Each agent can then vote by either replying
ready (the agent is willing to accept the transaction) or failed (the agent does not accept
the transaction) – or it does not reply at all which causes a timeout protocol to handle
this situation.
In the decision phase, the coordinator notifies the agents of a common decision
resulting from the votes: the transaction can only be globally committed if all agents
voted ready in which case the coordinator sends a commit message to all agents; af-
terwards, all agents have to send an acknowledgement to the coordinator to achieve a
global commit. This commit case is shown in Figure 12.6.
The abort case applies if at least one agent voted failed. In order to abort the trans-
action globally, the coordinator has to send an abort message to all agents that have
voted ready. Afterwards, the agents that voted ready have to acknowledge the abort
and have to internally undo all transaction operations (rollback).
A major problem with a large-scale application of the two-phase commit protocol
is that a failure of a single agent will lead to a global abort of the transaction. Moreover,
the protocol highly depends on the central role of a single coordinator. In particular,
the state between the two phases – before the coordinator sends his decision to all
agents – is called the in-doubt state: in case the coordinator irrecoverably fails be-
coordinator agent 1 agent 2 agent 3
prepare
Phase 1 failed
ready
ready
abort
Phase 2
acknowledge abort
Fig. 12.7. Two-phase commit: abort case
fore sending his decision, the agents cannot proceed to either commit or abort the
transaction. This problem can only be solved by a complex recovery procedure that
contacts all other agents and asks for their votes again. Even more severe is the case
that both the coordinator and one agent fail during the in-doubt state. In this case,
one cannot say whether the failed agent already received a commit or an abort mes-
sage from the failed coordinator. The entire system is hence blocked until at least one
of them recovers.
A so-called three-phase commit protocol adds another phase (the pre-commit
phase) to avoid this blocking behavior; it however comes at the cost of more message
delays and hence a larger roundtrip time.
12.2.2 Paxos Algorithm
Starting with [Lam98] a family of consensus protocols called Paxos algorithms has
been devised. Different versions of the Paxos algorithm can ensure progress under
different failure models provided that a certain majority of agents is alive and working
correctly.
The first and basic Paxos algorithm is meant to cope with non-Byzantine failures
(in particular, crash failures, message loss, duplication and reordering of messages)
as long as more than half of the accepting agents follow the protocol. From a database
system point of view, the Paxos algorithm can be applied to keep a distributed DBMS
in a consistent state. A client can for example issue a read request for some database
record; the database servers then have to come to a consensus on what the current
state (and hence the most recent value) of the record is. In this setting, the database
servers act as one or more agents in the Paxos protocol. The following types of agents
take part in a Paxos protocol:
Proposer: A proposer is an agent that waits for a client request and then initi-
ates the consensus process by asking acceptor agents to send a possible response
value. A proposer assigns a number to its request (called proposal number, or
sometimes command number or ballot number). Depending on the answers re-
ceived from the acceptors, the proposer chooses a response value. This response
value is sent to all acceptors once more to obtain a final consensus.
Leader: For handling a specific client request, one of the proposers is elected to
be the leader for this specific client request. There may be several proposers com-
peting to be leader and they may issue proposals with different proposal numbers.
Acceptor: An acceptor can accept a proposal based on the proposed value and on
the proposal number; a correct acceptor only accepts proposals which are num-
bered higher than any proposal it has accepted before. This is why an acceptor has
to always remember the highest proposal it accepted so far – even if it crashes and
later on restarts; thus persistent disk storage and recovery is needed for acceptors.
Learner: A learner is any other agent that is interested in the value on which the
acceptors agreed. Learners will be informed of a response value by each of the
acceptors. If a majority of acceptors advocates a certain value, this value is chosen
as the consensus outcome. One of the learners can return the response value to
the client. Usually, the leader is also a learner so that he receives the notification
that his chosen value was finally agreed to by a majority of acceptors.
The basic Paxos algorithm consists of a read phase (Phase 1) and a write phase
(Phase 2) as shown in Figure 12.8. In the read phase, the leader prepares the con-
sensus by communicating with the acceptors to retrieve their current state. This is
necessary because the acceptors might be in different states and they might have al-
ready answered requests sent by other proposers (with different proposal numbers).
For the read phase to be successful, the leader has to receive an answer from a major-
ity of acceptors. In the write phase, the leader can choose a value from the answers
sent to him by the acceptors; he then sends the chosen value to all acceptors. Each
acceptor will notify all learners of the chosen value. Again for a learner to effectively
learn the value, he has to receive notifications from a majority of acceptors. The two
phases each consist of two messages:
Phase 1a: The leader chooses a proposal number to identify himself. He sends a
prepare message with his proposal number propNum to the acceptors.
Phase 1b: The acceptors reply with an acknowledgement message promise. The
acknowledgement contains the proposal number propNum as well as the high-
est proposal number maxAcceptPropNum for which the acceptor has previously
sent an accepted message and the corresponding value of this accepted message;
client leader acceptor 1 acceptor 2 acceptor 3 learner

request
prepare(propNum)
Phase 1
promise(propNum,maxAcceptPropNum,value)
accept(propNum,chosenValue)
Phase 2
accepted(propNum,value)
return(value)
Fig. 12.8. A basic Paxos run without failures
that is, maxAcceptPropNum is different from propNum and stems from a previous
run of Phase 2 for which however no consensus was reached due to some failure.
By sending an acknowledgement message promise for proposal number propNum
each acceptor informs the leader of the value he is willing to accept as the consen-
sus result. And the acceptor promises not to send any more accepted messages for
proposal numbers less than propNum.
Phase 2a: When the leader receives a promise message from a majority of accep-
tors, he compares all replies and identifies the one with the highest proposal num-
ber maxAcceptPropNum; he chooses the corresponding value as the possible con-
sensus value. If there is no such highest proposal number, he can choose a value
freely – hoping that a majority of acceptors will finally approve his choice. After
choosing a value, the leader sends an accept message (with his proposal number
propNum and the chosen value) to the acceptors for final approval.
Phase 2b: An acceptor accepts a chosen value for a given propNum by sending
an accepted message (with proposal number propNum and the chosen value) to
all learners – as long as he has not sent a promise message for a higher proposal
number (in a different run of phase 1). As an alternative the acceptors can send
their accepted messages to the leader who in turn notifies all the learners; this
reduces the amount of messages sent (when there is more than one learner) at the
cost of introducing an additional message delay (because learners have to wait for
the notification from the leader).
With the basic Paxos protocol the following safety properties are guaranteed to hold
for each individual run of the protocol (see [Lam05]):
Nontriviality: Any value that a learner learns must have been proposed by a pro-
poser.
Stability: A learner learns one single value or none at all.
Consistency: All learners learn the same value; that is, no two learners learn dif-
ferent values.
Paxos also has the following liveness property (for a given learner and a given pro-
posed value) under the assumption that the given learner, the proposer of the value
and the necessary majority of acceptors are working correctly and all messages even-
tually reach their destination:
Liveness: If some value has been proposed, a learner will learn a value – although
not necessarily the proposed one and although several rounds of communications
(that is, message delays) might be necessary to establish the learned value.
The basic Paxos protocol can support non-Byzantine failures of the participating
agents as follows:
Failures of proposers: The leader can fail as long as there is at least one backup
proposer who can be the new leader and eventually gets his chosen value ac-
cepted. That is why the read phase (Phase 1) is necessary: in Phase 1 the new
leader retrieves the highest proposal number for which an accepted message has
ever been sent previously by any of the acceptors – but no learner has received
a majority of accepted messages for this proposal number so far. In case the pro-
posal number chosen by the new leader (in his prepare message in the current
run of Phase 1) is superseded by a proposal number sent by any other proposer,
the leader can start a new run of Phase 1 with a higher proposal number. This
is shown in Figure 12.9. One problem of this restart behavior is the case of com-
peting proposers (also called dueling proposers; see Figure 12.10): two or more
proposer are trying to be leaders but they are never able to finish Phase 2 because
in the meantime a majority of acceptors has already sent a promise message for
another higher proposal number – hence making it impossible for them to send
an accepted message for the lower proposal number. In this case it can happen
that the competing proposers each try to increase their proposal numbers forever
without making progress so that no consensus is reached. A practical recommen-
dation to avoid this case is to introduce small random delays before starting a
new run of Phase 1 so that one of the competing proposers gets a chance to finish
Phase 2.
Failures of learners: If all learners fail, the consensus value will not be sent to
the client although a majority of acceptors accepted a chosen value. Hence, there
must be at least one learner working correctly (and timely). A practical solution is
to have one distinguished learner (who could act as the leader at the same time)
client leader1 leader2 acceptor 1 acceptor 2 acceptor 3 learner

request
prepare(p1)
Phase 1 prepare(p2)
p1<p2
promise(p2,maxAcceptPropNum,value)
accept(p2,chosenValue)
Phase 2 accepted(p2,value)
return(value)
Fig. 12.9. A basic Paxos run with a failing leader
to be responsible for returning the consensus value to the client – but to also have
one or more backup learners that can take over in case the distinguished learners
fails (or delays).
Failures of acceptors: Regarding the acceptors, for a consensus to be successful
with a certain proposal number, the leader has to receive promise messages for
the proposal number from a majority of acceptors and later on at least one learner
has to receive accepted messages for the proposal number from a majority of ac-
ceptors – although not necessarily the same acceptors have to send the promise
and accepted messages. Now we can establish the minimum quorum size needed
in basic Paxos. The basic Paxos protocol can tolerate F faulty acceptors as long as
there are at least F + 1 non-faulty acceptors that agree on a value. That is, for a
total of N > 2 · F acceptors, basic Paxos can work reliably, even if F acceptors fail.
In other words, basic Paxos can proceed with a quorum size of F + 1 as long as
there are at most F faulty acceptors – and their messages are eventually received
by proposers and learners (see Figure 12.11).
If however a majority of acceptors fails, a new run of Phase 1 has to be started with a
higher proposal number and with a quorum containing other acceptors than the failed
ones (see Figure 12.12).
client leader1 leader2 acceptor 1 acceptor 2 acceptor 3 learner

request
prepare(p1)
prepare(p2)
Notaccepted(p1)
Phase 1
p1<p2
p2<p3 prepare(p3)
p3<p4
Notaccepted(p2)
prepare(p4)
Fig. 12.10. A basic Paxos run with a dueling proposers


request
prepare(propNum)
Phase 1
promise(propNum,maxAcceptPropNum,value)
accept(propNum,chosenValue)
Phase 2
accepted(propNum,value)
return(value)
Fig. 12.11. A basic Paxos run with a minority of failing acceptors
Several variants of the basic Paxos protocol have been devised to offer enhanced func-
tionalities. One such variant of Paxos called cheap Paxos relies on the optimistic per-
spective that F failing acceptors can be tolerated by having F + 1 active acceptors and
F auxiliary acceptors. The auxiliary acceptors need not take part in the Paxos protocol
as long as the active acceptors are working correctly; only if one of the active acceptors
fails, one auxiliary acceptor is included in the protocol as a replacement for the faulty
acceptor – and only until the faulty acceptor has recovered. This version of Paxos re-
duces message transmissions (because only F + 1 acceptors are contacted in the best
case) and the auxiliary acceptors can be idle (or processing other tasks) as long as no
faults occur.
A further generalization of Paxos – called generalized Paxos – relies on the obser-
vation that commutative commands can be executed in any order. Instead of enforcing
a total order of all commands, it hence suffices to have a partial order; in other words,
non-commutative commands have to be executed in the right sequence by all agents,
whereas commutative commands can be executed in any order by the agents. Note
that in this case consensus is reached not only for a single client request but for a
continuous sequence of commands.

request
prepare(p1)
Phase 1
Phase 2
accepted(p1,value)
acceptor 4 acceptor 5
prepare(p2)
Phase 1
p1<p2
Phase 2 accepted(p2,value)
return(value)
Fig. 12.12. A basic Paxos run with a majority of failing acceptors

12.2.3 Multiversion Concurrency Control
In a distributed database system, lock-based concurrency control on replicated data

is very expensive as locks must be managed globally for all servers. It may also lead to
deadlocks more often than on a single centralized server.
That is why a form of multiversion concurrency control (MVCC) is often employed.
MVCC-based transactions consist of
– a read phase where a local copy for the transaction is created which the transac-
tion can operate on;
– a validation phase where the MVCC system checks whether the transaction is
allowed to apply its modification to the global authoritative data set;
– and a write phase where the modified data are copied to the global data set.
The main advantage of MVCC is that each client sees its own copy of the current val-
ues in the database; that is, the database is always accessible for read access with-
out restrictions – a feature called non-blocking reads. When writes take place in-
side a transaction then the client version is compared with the current version in the
database system at commit time: when the client version is older than the database
version, writes of other transactions have occurred in between. To avoid consistency
problems, the client transaction must be aborted and restarted with a new version.
Although MVCC offers non-blocking reads and less overhead than lock-based ap-
proaches, MVCC has some disadvantages. Maintaining versions for different clients for
one has a storage space problem: several copies of data items have to be held available
for the accessing clients and the clients produce new versions during the interaction.
And due to the late abort at commit time, many restarts of transaction may occur. For
more details on MVCC see the Section 13.1.2 on snapshot isolation.
12.3 Ordering of Events and Vector Clocks
Assuming a network of servers that communicate by sending messages, these mes-

sages usually have to be be processed in an certain order because there are causal
dependencies between the messages. For example, when a server receives a message
from a second server, the message may influence internal computations of the first
server and may also influence messages that the first server sends to other servers.
In a distributed system, message delivery is usually not synchronized and mes-
sages might be delayed without a time bound. When a server receives several mes-
sages, he cannot be sure that the order in which the messages arrive is the same order
in which they were produced by the other servers. To make things worse, in a dis-
tributed system, it is difficult to establish a common notion of time in general: the
system clocks of every server might differ; that is why attaching the current local sys-
tem time (of the sending server) as a timestamp to a message is of no use for ordering
messages chronologically in a global sense.
However it is important to observe that no exact time is needed when ordering
messages in a network: instead of relying on physical clocks a notion of a logical clock
suffices. A logical clock basically consists of a counter that count events. Events can be
sending or receiving messages; moreover, there may be internal events inside a server
where no communication with other servers is needed but the internal events increase
the local counter of a server, too. Each increment of the counter represents a single
logical clock tick from the perspective of a server. Hence, we may not only talk about
messages exchanged between servers but, more generally, about events in processes
that are run by a set of servers. When one event in a process happened before another
event, the first event may have caused the second. That is why we want to process
events according to a causality order.
12.3.1 Scalar Clocks
Logical clocks as described by Lamport [Lam78] (and hence often called Lamport
clocks) provide a partial order of all messages exchanged in a distributed system.
Messages are sent and received by servers in a network; or more precisely, by a set of
client processes which run on the servers in the network.
A Lamport clock orders events (like sending and receiving messages) based on a
relation called happened-before relation; this relation is used as a formal notion for
the fact that we are not interested in the exact physical time but only in the ordering of
events. In general, there may happen many internal events inside each client process,
but for the global perspective of the system only the communication events between
processes are relevant. If in the global perspective of the distributed system one event
e1 happened before another event e2 , it is denoted as e1 → e2 . The happened-before
relation is induced by (1) the total ordering of events happening at a single server in
a single process and (2) the fact that the send event of a message must have happened
before the receive event of that message. Moreover (3), if one event happened before a
second event, and the second event happened before a third event, then the first event
must also have happened before the third event; hence the happened-before relation
is transitive. More formally these three properties can be written as follows.
Happened-before relation:
1. if e1 is an event inside a process and e2 is an event happening in the same process after e1 , then
e1 also happened before e2 in the global view of the system: e1 → e2 .
2. if e1 is the event of sending a message m and e2 is the event of receiving m, then e1 happened
before e2 : e1 → e2 .
3. if it is the case that e1 → e2 and e2 → e3 then also e1 → e3 .
To implement Lamport clocks, each client process has its own counter (that has a
scalar value) to denote its local (and logical) time. The initial value of each local
counter is 0. Whenever a process sends a message it attaches its local counter to the
message (incremented by one); in other words, a message “piggybacks” a timestamp.
When a process receives a message, it also has to increment its local counter: the
important point is that other processes in the system might have processed more mes-
sages and some counters might hence be ahead of others. This difference is corrected
when receiving a message with a timestamp higher than the local clock: the receiving
process takes the maximum of its local clock and the timestamp to be its new local
clock; then this new local clock is incremented by one to account for the message
retrieval event. Hence more formally, we define C i to be the counter for process i and
let m1 , m2 , ... and so on be the messages that are exchanged between the processes.
Then the counters have the following properties:
1. Initialization: Initially, all counters are 0: C i = 0 for all i
2. Send event: Before process i sends a message m j , it increments its local counter
and attaches the counter as a logical timestamp to the message:
(a) C i = C i + 1
(b) send (m j , C i ) to receiving process(es)
3. Receive event: Whenever process i receives a message m j , it reads the attached
timestamp t and in case the timestamp is greater than its local counter it advances
the counter to match the timestamp; before processing the message, the counter
is incremented:
(a) receive (m j , t)
(b) C i = max(C i , t)
(c) C i = C i + 1
(d) process m j
The example in Figure 12.13 shows how message exchange advances the local coun-
ters. Before sending message m1 , Process a increments its counter C a by 1 and attaches
it as the timestamp t1 to the message. Next, Process b receives the message and no-
tices that its counter is behind the timestamp and it has to adjust it: max(C b , t1 ) =
max(0, 1) = 1; before processing the message, C b is incremented by 1 yielding a value
of 2. After some time, Process b prepares another message m2 , increments its counter
C b to be 3, and sends the message m2 together with the value 3 as its timestamp t2 .
Now, Process a has to advance its clock (which is still 1) by taking the maximum:
max(C a , t2 ) = max(1, 3) = 3 and immediately increments C a to be 4 before processing
the message.
Figure 12.14 shows a more advanced situation with three processes. In particular,
we see here the case that with Lamport clocks it may happen that two events in differ-
ent processes have the same clock value: In Process a the scalar clock 5 denotes the
event of receiving message m3 while in Process b the scalar clock 5 denotes the event
of sending message m4 .
Process a 0 1 4
(m1 , 1)
(m2 , 3)
Process b 0 2 3
Fig. 12.13. Lamport clock with two processes
Process a 0 1 (m1 , 1)
5
(m3 , 4)
Process b 0 2 3 4 5
(m4 , 5)
(m2 , 1)
Process c 0 1 6
Fig. 12.14. Lamport clock with three processes
Moreover, scalar clocks do not provide the notion of a globally total order and hence
for some (concurrent) events their processing in Process b may occur in arbitrary order.
Consider for example Figure 12.14: here we see that messages m1 and m2 are being
sent at the same global time 1; that is also why the timestamps for the two messages
are identical.
A globally total order over all processes in the distributed servers may be necessary to schedule mes-
sages with an identical timestamp in an unambiguous way.
For example, with a total order in Figure 12.14, the two messages with identical times-
tamp m1 (sent by Process a) and m2 (sent by Process b) may only be processed in a
certain order. A simple way to establish such a total order of events is breaking ties
by using process identifiers: the process IDs can be totally ordered (on each server)
and they may contain the name of the server they are running on to obtain a globally
unique identifier. We can let the clock of a process with a lower ID always take prece-
dence over a process with a higher ID. In our example, we can assume that the process
IDs are ordered as a < b < c, and append the process ID pid to the timestamp t such
that messages piggyback the combination t.pid. We can then compare two timestamps
t1 .pid1 and t2 .pid2 as follows.
Comparison based on process IDs: For timestamps t1 and t2 and process IDs pid1 and pid2 as well as
a predefined ordering on process IDs, it holds that t1 .pid1 < t2 .pid2 whenever t1 < t2 or in the case
that t1 = t2 whenever pid1 < pid2 .
In Figure 12.15 we see that message m1 hence has to be processed before message m2
at Server b. This ordering by process IDs is however somewhat arbitrary and may not
Process a 0 1 5
(m1 , 1.a)
(m3 , 4.b)
Process b 0 2 3 4 5
(m4 , 5.b)
(m2 , 1.c)
Process c 0 1 6
Fig. 12.15. Lamport clock totally ordered by process identifiers
capture any semantic order of messages that may be necessary for a correct processing
of the messages.
12.3.2 Concurrency and Clock Properties
As already mentioned, the happened-before relation provides a partial ordering: that

is, some events might happen in parallel and we don’t care what their actual order is.
For these events we can neither say that one happened before the other nor the other
way round; they are called concurrent events. More formally, for two events e1 and
e2 , when we know that e1 6→ e2 and e2 6→ e1 , then we say that e1 and e2 are concurrent
and write e1 ||e2 .
Concurrency of Events: For two events e1 and e2 in a distributed system, the happened-before relation
→, Let e1 and e2 be processes in a distributed system, the two events are concurrent (written as e1 ||e2 )
if and only if e1 6→ e2 and e2 6→ e1 .
Another important notion when talking about events in a distributed system is causal-
ity: does one event influence another event – and in particular, can the first event po-
tentially be the cause of the other event? Ideally, we want to have a global clock with
the property that whenever the global clock value of one event is less than the global
clock value of a second event – that is, C(e1 ) < C(e2 ) – then we can be sure that e1
may at least potentially influence another event e2 . However, Lamport clocks do not
have this property of representing causality: the clock of one event may be less than
the clock of another, yet the former cannot (not even potentially) influence the latter.
To illustrate this case, Figure 12.16 shows a message exchange between four processes
which however happen totally independently. In particular, the receipt of message m1
happens when the global clock is 2, and the sending of message m4 happens when the
global clock is 3; however the receipt of message m1 cannot have caused the sending
of message m4 in any way because there is no communication at all between Server b
and Server c.
Process a 0 1 4
(m1 , 1)
(m2 , 3)
Process b 0 2 3
Process c 0 1 4
(m3 , 1)
(m4 , 3)
Process d 0 2 3
Fig. 12.16. Lamport clock with independent events
Hence, all we can say about Lamport clocks is that they satisfy the following weak
clock property: if one event is in a happened-before relation another one, then the
global Lamport clock of the first event is less than the one of the second.
Weak Clock Property: For two events e1 and e2 in a distributed system, the happened-before relation
→, and a global clock C: if e1 → e2 then C(e1 ) < C(e2 ).
What we can derive from this property (by using the contrapositive) is that when the
Lamport clock of one event is not less than the one of the other, then the former event
cannot be in a happened-before relation to the latter: if C(e1 ) 6< C(e2 ) then e1 6→ e2 .
A more helpful property is the opposite direction: whenever the global clock of
the first event is less than the one of the second, then we can be sure that the first
event is in a happened-before relation to the second. And from this fact we can draw
the conclusion that the first event may have had an influence on – or may have caused
– the second event. The strong clock property says that both directions are fulfilled.
Strong Clock Property: For two events e1 and e2 in a distributed system, the happened-before relation
→, and a global clock C: e1 → e2 if and only if C(e1 ) < C(e2 ).
The strong clock property is not satisfied for Lamport clocks.
12.3.3 Vector Clocks
The strong clock property is satisfied for vector clocks: instead of a single counter
for the whole system, a vector clock is a vector of counters with one counter for each
client process. In the distributed database setting, a client process handles read and
write requests coming from a database user. With vector clocks it is crucial to have a
separate counter for each client processes (even for those running on the same server)
as otherwise servers could not accept concurrent write requests from multiple users.
The problem with a single server counter (instead of individual process counters) is
that write attempts by different processes on the same server would simply increase
the server counter and there is no way to tell apart different processes; this can lead to
lost updates.
Vector clocks provide a partial order of events.
The partial order of vector clocks expresses concurrency and causality better than
Lamport clocks because it exposes which client process has seen which message. Mes-
sages piggyback the current vector clock of the sending process. Differences in the vec-
tor clocks are consolidated when receiving a message by taking the maximum for each
vector element. Before sending and after receiving a message only the vector element
of the sending or the receiving process is stepped forward. More formally, vector clocks
are maintained with the following steps:
1. Initialization: For n client processes, a vector clock is a vector (or an array) of n
elements. Each of the n processes maintains one local vector clock. Initially, for
each process all elements are 0: for the vector clock VC i of process i, VC i [j] = 0 for
all j (where i, j ∈ {1, . . . , n})
2. Send event: Before process i sends a message m k , it increments only the i-th ele-
ment of the local vector clock and attaches the entire vector as a logical timestamp
to the message:
(a) VC i [i] = VC i [i] + 1
(b) send (m k , VC i ) to receiving process(es)
3. Receive event: Whenever process i receives a message m k , it reads the attached
timestamp vector t; it iterates over all elements and in case the timestamp ele-
ment t[j] is greater than its local vector clock element VC i [j] it advances the clock
element to match the timestamp element; before processing the message, the i-th
vector element is incremented:
(a) receive (m k , t)
(b) for j = 1 to n: VC i [j] = max(VC i [j], t[j])
(c) VC i [i] = VC i [i] + 1
(d) process m k
In Figure 12.17 we see how Process b steps its own vector element forward when send-
ing and receiving messages, and how the other two processes advance their clocks to
match the timestamp when receiving a message.
In order to define causality and concurrency of events we must be able to compare
vector clocks – that is, we have to specify a partial ordering on the vectors. This is done
by a pairwise comparison of the vector elements. More specifically, let VC(e1 ) be the
vector clock for an event e1 and let VC(e2 ) be the vector clock for an event e2 (poten-
a:0 a:1 a:2

Process a b:0 b:0 b:3
c:0 c:0 (m1 , [1, 0, 0]) c:1
(m3 , [1, 3, 1])
a:0 a:0 a:1 a:1 a:1

Process b b:0 b:1 b:2 b:3 b:4
c:0 c:1 c:1 c:1 c:1
(m4 , [1, 4, 1])
(m2 , [0, 0, 1])
a:0 a:0 a:1

Process c b:0 b:0 b:4
c:0 c:1 c:2
Fig. 12.17. Vector clock
tially in different processes), we then define VC(e1 ) ≤ VC(e2 ) and VC(e1 ) < VC(e2 ) as
follows.
Vector Clock Comparison: Let e1 and e2 be processes in a distributed system, VC(e1 ) be the vector
clock for event e1 and VC(e2 ) be the vector clock for event e2 , then
– VC(e1 ) is less than or equal to VC(e2 ) (written as VC(e1 ) ≤ VC(e2 )) if and only if the vector clock
elements of VC(e1 ) are less than or equal to the elements of VC(e2 ); that is, for all i ∈ {1, . . . , n}
VC(e1 )[i] ≤ VC(e2 )[i].
– VC(e1 ) is less than VC(e2 ) (written as VC(e1 ) < VC(e2 )) if and only if the vector clock elements of
VC(e1 ) are less than or equal to the ones of VC(e2 ) and at least one element of VC(e1 ) is strictly
less than the corresponding one of VC(e2 ); that is, for all i ∈ {1, . . . , n} VC(e1 )[i] ≤ VC(e2 )[i]
and there is at least one j ∈ {1, . . . , n} for which VC(e1 )[j] 6= VC(e2 )[j].
Due to the fact that vector clocks enjoy the strong clock property, we can define causal-
ity and concurrency by the vector clocks of events: For one particular event e, we can
define all events with lower vector clocks as causes of e and all events with higher vec-
tor clocks as effects of e. All other events have vector clocks incomparable to the one
of e; these are the events concurrent to e. More formally, we define the sets of causes,
effects and concurrent events as follows:
– For a given event e the set of causes is causes(e) = {e′ | VC(e′ ) < VC(e)}
– For a given event e the set of effects is effects(e) = {e′ | VC(e) < VC(e′ )}
– For a given event e the set of concurrent events is
concurrent(e) = {e′ | VC(e′ ) 6< VC(e) and VC(e) 6< VC(e′ )}
a:0 a:1 a:2

b:0 b:0 b:2
Process a
c:0 c:0 c:0
d:0 d:0 (m1 , [1, 0, 0, 0]) d:0
(m2 , [1, 2, 0, 0])

a:0 a:1 a:1
b:0 b:1 b:2
Process b
c:0 c:0 c:0
d:0 d:0 d:0
a:0 a:0 a:0

b:0 b:0 b:0
Process c
c:0 c:1 c:2
d:0 d:0 (m3 , [0, 0, 1, 0]) d:2
(m4 , [0, 0, 1, 2])

a:0 a:0 a:0
b:0 b:0 b:0
Process d
c:0 c:1 c:1
d:0 d:1 d:2
Fig. 12.18. Vector clock with independent events
In the example of Figure 12.17 we can see that the event of receiving message m2 (with
vector clock [0,1,1]) is by this definition a cause of the event of sending of message
m3 (with vector clock [1,3,1]) and also the event of receiving message m3 (with vec-
tor clock [2,3,1]); that is, these three vector clocks are comparable and we have that
[0,1,1]<[1,3,1]< [2,3,1]. On the other hand, we cannot compare the timestamp of mes-
sage m1 (which is [1,0,0]) and the timestamp of message m2 (which is [0,0,1]) and we
hence know that the events of sending m1 and m2 are concurrent.
In Figure 12.18 (as opposed to Figure 12.16) with vector clocks we can now easily
determine that the communication between Process a and b is entirely independent
of the communication between Process c and d: All events in Processes a and b are
incomparable to all events in Processes c and d.
12.3.4 Version Vectors
While vector clocks are a mechanism for stepping forward the time in a message-
passing system, version vectors are a mechanism to consolidate and synchronize
several replicas of a data record. Usually, one data record is replicated on a small set of
servers; that is, the number of replicas of a data record is considerably lower than the
large overall number of servers or the number of interacting clients. This will be useful
when trying to handle scalability problems of version vectors (see Section 12.3.5)
In order to determine which view of the state of the database contents each user
has, for every read request for a data record the answer contains the current version
vector for that data record. When subsequently the user writes that same data record,
the most recently read version vector is sent with the write request as the so-called
context. With this context, the database system can decide how to order or merge
the writes. For example, it might happen that a replica receives write requests out of
(chronological) order – or the replica might even miss some writes due to message
loss or other failures. In this case, version vectors can be used to decide in which or-
der the writes should be applied and which write is the most recent one. Moreover,
if the version vectors at two replicas of the same data record differ, they are said to
be in conflict. Conflicting versions can for example occur with multi-master replica-
tion, where clients can concurrently update the data record at any server that holds
a replica; this is the case of conflicting writes. A synchronization process reconciles
conflicting replicas. The resulting synchronized version is tagged with a version vector
greater than any of the conflicting ones so that the conflict is resolved.
A simple form of synchronization of two conflicting replicas is to take the union
of them. A typical application for this union semantics is an online shopping cart: if
the shopping cart of some user is replicated on two servers and the two versions differ
(maybe due to failures), the final order contains all items in both versions and the
user might have to manually remove duplicates before placing the order. If such an
automatic synchronization is not possible when the version vectors of the replicas are
concurrent, the conflict has to be resolved by a user. That is, the user has to read the
conflicting versions, decide how to resolve the conflict and then issue another write
to resolve the conflict. The written value gets assigned a version vector that subsumes
both conflicting ones.
The process of maintaining version vectors is slightly different from the one for
maintaining vector clocks. The aim is to reconcile divergent replicas into one common
version – that is, a version with an identical version vector at all replicas. In contrast,
the vector clocks of client processes in a message passing system usually differ. We
assume that we start with an initial version that is identical at all replicas. Further
modifications of replicas are possible by updates (a client issues a write to one replica)
and synchronization (two replicas try to agree on a common version). We now describe
version vector maintenance with union semantics for the synchronization process.
In this setting, each data record consists of a set of values and the synchronization of
the data record computes the set union; that is, the result of a merge is again a set of
values.
For updates, if the context is equal to or greater than the current version vector,
then the current version in the database is overwritten, because the client has previ-
ously read the current version or even a more recent version from a different replica.
Then the write context is taken to be the new version vector and the vector element of
the writing client is advanced. If the context is smaller than the current version vector,
it means that the version that was read by the client is outdated and has been over-
written by some other client. However, we do not want to lose this write even if it is
based on an outdated version. Hence, the database system merges the current version
and the written version and then advances the vector element of the writing client. If
the context and the current version vector in the database are in conflict, the client
has read from one replica (holding a version written by another client) but writes to
another replica (holding a version of a concurrent write of yet another client). If the
database does not want to lose any of those writes, it merges the written version and
the current version; then, it takes the maximum of the context and current version
vector elements, and lastly advances the vector element of the writing client.
If in a multi-master setting clients are allowed to update a data record at any
replica, the synchronization step ensures that all replicas have the same version of
the data record. The synchronization can then be implemented as an epidemic algo-
rithm as described in Section 10.4. Whenever one replica has a larger version vector
than the other replica, the larger version vector is taken as the most recent one and the
corresponding data record replaces the other one. In case the two version vectors are
in conflict, with the union semantics we replace both version vectors by their element-
wise maximum and merge the two data records by taking the union.
1. Initialization: For n client processes, a version vector is a vector (or an array) of
n elements. Each replica of a data record maintains one version vector. Initially,
for each process all elements are 0: for version vector VV i at replica i, VV i [j] = 0
for all j (where j ∈ {1, . . . , n})
2. Update: When a client process j sends a write request to overwrite a set of values
vali at replica i, it sends the new set of values valj and the context ctxj (that is, the
version vector of the last read). Based on the context, the replica checks whether
it has to overwrite its value set or take the union, then computes the maximum
over the context and its own version vector and lastly advances the element of the
writing client:
(a) if ctxj ≥ VV i , then set vali = valj ; else set vali = vali ∪ valj
(b) for k = 1, . . . , n: VV i [k] = max{VV i [k], ctxj [k]}
(c) VV i [j] = VV i [j] + 1.
3. Synchronization: Whenever two replicas i and j have different version vectors
(that is, VV i 6= VV j ) for one data record, the synchronization process reconciles the
two versions by either overwriting one value set (if one vector clock supersedes the
other) or by taking the union of the two value sets (and merging the version vectors
by taking their element-wise maximum). After the synchronization process, the
replicas have identical values vali and valj as well as identical version vectors VV i
and VV j for the data record:
– if VV i > VV j : set valj = vali and VV j = VV i
– if VV j > VV i : set vali = valj and VV i = VV j
Client a Client b Client c

write
write write read
read union
a:0 a:1 a:1 a:1 a:1
Replica 1 b:0 b:0 b:1 b:1 b:1
c:0 c:0 c:0 c:1 c:1
synch
synch
read
a:0 a:0 a:1 a:1
Replica 2 b:0 b:1 b:1 b:1
c:0 c:0 c:0 c:1
Fig. 12.19. Version vector synchronization with union merge
– else set vali = valj = vali ∪ valj and for k = 1, . . . , n: VV i [k] = VV j [k] =
max{VV i [k], VV j [k]}
More advanced forms of merging usually involve a semantic decision that requires
interaction of the user or a more intelligent application logic. If such a user inter-
action is needed, sibling versions for a data record have to be maintained until
the user writes a merged version: the database systems stores all concurrent ver-
sions – the siblings – together with their attached version vectors. That is, for each
data record the replica i maintains a set D i of tuples of values and version vectors:
Di = {(vali1 , VV i1 ), (vali2 , VV i2 ), . . .}. As soon as a user writes a merged version that
is meant to replace the siblings, the version vector of the merged version is set to
be larger than all the version vectors of the siblings; the siblings (and their version
vectors) can then be deleted. Likewise, the context of a writing client j is a set C of
version vectors containing all the version vectors of those siblings that were returned
in the answer to the last read request for the data record: Cj = {ctxj1 , ctxj2 , . . .}. More
formally, version vector maintenance with sibling semantics works as follows:
1. Initialization: For n client processes, a version vector is a vector (or an array)
of n elements. Each replica of a data record maintains a set D of pairs of values
and version vectors. Initially, the set contains a single pair (vali1 , VV i1 ) where the
version vector element for each process is 0: for version vector VV i1 at replica i,
VV i1 [j] = 0 for all j (where j ∈ {1, . . . , n})
2. Update: When a client process j sends a write request to overwrite some or all
values in the data set Di at replica i, it sends the new value valj and the context
Cj (that is, the set of version vectors of the last read). The replica checks which
siblings in Di are covered by Cj , then computes the maximum over the context
Client a Client b Client c

write reconciling write
write
read siblings read
a:0 a:1 a:1 a:0 a:1 a:1
Replica 1 b:0 b:0 b:0 b:1 b:1 b:1
c:0 c:0 c:0 c:0 c:1 c:1
synch
synch
read
a:0 a:0 a:1 a:0 a:1
Replica 2 b:0 b:1 b:0 b:1 b:1
c:0 c:0 c:0 c:0 c:1
Fig. 12.20. Version vector synchronization with siblings
vectors as the new version vector for valj , advances the element of the writing
client, and adds valj to its data set:
(a) if there is a pair (val, VV) ∈ D i and a ctxjl ∈ C j such that ctxjl ≥ VV, then
remove (val, VV) from D i
(b) for k = 1, . . . , n: VV new [k] = max{ctxjl [k] | ctxjl ∈ C j }
(c) VV new [j] = VV new [j] + 1
(d) add (valj , VV new ) to D
Note that only the siblings with version vectors less than the newly generated
VV new are overwritten; other siblings may not be deleted because
– the reconciling client did not want to overwrite them;
– the reconciling client did not read these siblings (stale read on an outdated
replica);
– they have been introduced by an intermediate write by another client.
3. Synchronization: Whenever two replicas i and j have different data sets D i and
D j for one data record, the synchronization process reconciles the two versions
by only keeping the values with the highest version vectors. After the synchro-
nization process, the replicas have identical data sets D i = D j = D′ for the data
record:
– D′ = {(val, VV) | (val, VV) ∈ D i ∪ D j and there is no (val′ , VV ′ ) such that
(val′ , VV ′ ) > (val, VV)}
Figure 12.20 shows a synchronization step that creates two siblings because the ver-
sion vectors of the two synchronized replicas are concurrent; this concurrency was
caused by clients a and b because their writes were concurrently based on the same
(initial) version. The siblings are later replaced by a new version written by client c.
More generally, replica versions can be maintained with a fork-and-join semantics:

concurrent modifications lead to a fork in a graph of versions and after a merge these
concurrent versions can be joined again (see for example [SS05]).
12.3.5 Optimizations of Vector Clocks
In distributed systems with a large amount of processes and with a high communi-
cation frequency between the processes, traditional vector clocks do not scale well.
Indeed, for large systems, vector clocks raise problems due to the following reasons:
– The size of a vector clock (that is, the number of its elements) grows with the num-
ber of client processes taking part in the distributed communication: each vector
clock contains one element for each client process. Vector clocks hence have as
their size the overall number of client processes although some client processes
may not actively take part in the communication. It can hence quickly turn out to
be a problem in a distributed system to maintain vector clocks for a large num-
ber of client processes. What is more, modern distributed systems are dynamic:
new processes can join or leave the system at any time. Due to this, the overall
number of processes is not known in advance and the vector clocks must be able
to grow and shrink to support the dynamics of the system. Vector clocks should
hence be implemented as sets of tuples (processID, counter) of the process identi-
fier and the counter value of the process; tuples can be added to or removed from
this set whenever processes join or leave the system. Yet, this entails the problem
of having system-wide unique process identifiers which themselves must be long
enough to ensure this uniqueness. Moreover, the size of the tuple sets cannot be
bounded as there is no upper bound on the number of processes in the system.
Hence, even with the dynamic tuple-based implementation, vector clocks do not
scale well when the number of active processes increases.
– For long-running systems with frequent communication, the counter values in the
elements of the vector clock are incremented rapidly and hence furthermore in-
crease the size of the vector clock; when the size of counters is limited, this causes
an overflow the vector elements. In other words, vector clocks do not scale will
with the number of message exchanges.
– Message sizes increase when they piggyback large vector clocks. Hence the vector
clock alone may quickly exceed any reasonable message size and lead to unac-
ceptable communication overhead.
Distributed systems with vector clocks thus need some kind of vector clock bounding
in order to support a large number of client processes over a long period of time. Some
of these optimization – while ensuring some size bounds for the vector clocks – either
incorrectly introduce causalities (two events are considered causally related although
they are not) or they incorrectly introduce concurrencies (two events are considered
concurrent although they are not). In contrast, for version vectors the so-called dotted
version vectors [PBA+ 10] promise full correctness of the causality relation while only
needing vector clocks of the size of the replication factor. We survey some of these
options below. The following options for bounding have been analyzed:
Approximate vector clocks: Instead of using one element for each client pro-
cess, several client processes can share one vector element. That is, several client
process IDs i1 , i2 , i3 are mapped to the same index i and the vector clock only
has as many elements as there are different groups of client processes (where all
client IDs of a group are mapped to the same index). This approach leads to the
case that although the vector clocks for two events e1 and e2 can be ordered (that
is, VC(e1 ) < VC(e2 )), this ordering can no longer differentiate whether indeed e1
happened before e2 or whether they are concurrent. In other words, these vector
clocks only approximate the happened-before relation (see [BR02]). While they
satisfy the weak clock property, they do not satisfy the strong clock property but
only the property that if VC(e1 ) < VC(e2 ) then e1 → e2 or e1 ||e2 . Note that when
the vector clock only consists of one element (all process IDs map to the same
index), such approximate vector clocks coincide with the basic scalar Lamport
clock.
Client IDs versus replica IDs: For version vectors, one way to reduce the size of
the vectors is to use replica IDs instead of client IDs. This is based on the obser-
vation that synchronization takes place only between replicas and hence version
vectors only need one element per replica for each data record; that is, version vec-
tors have the size of the replication factor. However, with a simple counter for each
of the replica IDs we run into the following problem of lost updates. Two clients
might concurrently write to the same replica based on an identical context; or –
more generally – a client might issue a write with a stale context: another client
might already have overwritten the read version. With a version vector based on
client IDs, the database could simply handle the stale write as concurrent (for
example, by creating siblings). However, with version vectors simply based on
replica IDs, this concurrency cannot be expressed. The replica has basically only
two options to handle this: it could either reject the stale write (in this case this
write will be lost) or it could overwrite the existing newer version by stepping for-
ward the replica counter appropriately (in this case the version in the database
will be lost). Hence in both cases one of the concurrent versions will be lost – an
undesirable behavior for a database system. Note that this setting also leads to a
form of semantic ambiguity of version vectors: if the second write was instead is-
sued to a different replica, it would be handled as concurrent to the first one. This
difference is illustrated in Figures 12.21 and 12.22: in Figure 12.21 both clients write
to the same replica and Client b uses a stale context in its write request which
causes the replica to lose one of the updates (of either Client a or Client b); in
contrast, in Figure 12.22 Client b writes to a different replica than Client a and
hence both writes are correctly handled as concurrent because both replicas inde-
Client b
write
Client a
write write
read
read read
rep1:0 rep1:1 rep1:2 rep1:?
Replica 1
rep2:0 rep2:0 rep2:0 rep2:0
rep1:0
Replica 2
rep2:0
Fig. 12.21. Version vector with replica IDs and stale context
pendently step forward their version vectors and can later on be synchronized as
usual. In conclusion, while using replica IDs instead of client IDs leads to smaller
version vectors, the lost update problem seriously restricts the reliability of the
system.
Dotted Version Vectors: Fortunately, dotted version vectors [PBA+ 10] come to the
rescue: dotted version vectors use replica IDs (instead of client IDs) in conjunction
with more sophisticated version counters. More precisely, note that simple coun-
ters actually represent an interval of versions: from the initial value 0 up to the
current value of the counter. A dotted version vector not only uses such simple
counters (that represent an interval of version) for each replica, but in addition
it can use a single point of time (the so-called dot) that is independent from the
interval. With this mechanism, two concurrent writes on the same replica can be
handled as concurrent because the dot will be different for both writes. Hence,
dotted version vectors enable the use of replica IDs for version vectors without
the problem of lost updates – even if clients have the same context or if one client
has a stale context.
Vector clock pruning: Pruning means that some vector elements are discarded;
this can be implemented by assigning a timestamp to each element and setting
the timestamp to the local system time whenever a process advances the element.
Whenever the last update of an element becomes older than a specified time-to-
live value, this element is simply removed from the vector. Vector clock pruning
can raise situations where two comparable vector clocks become incomparable
due to discarded elements. In such cases, manual conflict resolution by a user
might become necessary. However, such cases should happen only rarely; for ex-
Client a
write write
read read
rep1:0 rep1:1 rep1:2
Replica 1
rep2:0 rep2:0 rep2:0
rep1:0 rep1:0
Replica 2
rep2:0 rep2:1
read
write
Client b
Fig. 12.22. Version vector with replica IDs and concurrent write
ample, when a process that has been idle for a long time rejoins the communica-
tion.
Vector clock resetting: The vector clock elements can be reset to 0 when the
counter values exceed some limit; in this way, the size of each vector clock ele-
ment can be bounded. Resettable vector clocks rely on the notion of communi-
cation phases: a new phase is started by sending a reset message that resets all
vector clocks in the system. Resettable vector clocks work well when causality of
events only has to be checked for events inside the same phase; comparison of
events from different phases may lead to incorrect results due to the reuse of vec-
tor clocks in the different phases. Design criteria for good resettable vector clocks
are low message overhead (for sending reset messages), non-blocking reset mes-
sages and fault-tolerance [AKD06, YH97].
Incremental vector clocks: It is often the case that not all vector clock elements
change in between two message sending events between two clients. Hence the
size of the piggybacked timestamp can be decreased when only the changed vec-
tor clock elements are piggybacked (instead of the entire current vector clock). In
other words, only the increment since the last communication must be sent. There
is an extra overhead involved in this incremental vector clock maintenance: In or-
der to determine which elements have changed, each client process has to keep
track of the last timestamp for each client process it has ever communicated with.
[SK92] propose such a system that relies on first-in-first out channels between
the client processes; that is, message ordering must be ensured, so that no mes-
sage overtaking can occur as otherwise the causal order cannot be reestablished.
[WA09] rely on re-ordering of version vectors to quickly determine the parts of the
vectors that have changed.
Replication is a long and widely studied area in distributed systems [BG82, KA10].
Gray et al [GHOS96] discuss several replication models for transactions as well as
simpler non-transactional replication like commutative updates. Whereas Wiesmann
[WPS+ 00] et al analyze 1) replication model, 2) server interaction and 3) voting as the
three parameters of replication. A comprehensive survey of optimistic replication ap-
proaches is given by Saito and Shapiro in [SS05]. Several approaches – for example
[LKPMJP05, TDW+ 12, LLC14] – propose a middleware between the database client and
the database backend to support data replication. Hinted handoff has been applied
for example in Amazon’s Dynamo system [DHJ+ 07].
Quorum systems also have a long history (see for example Thomas [Tho79] or Gif-
ford [Gif79]). A survey of quorum system is given in [JPPMAK03].
The basic idea of the Paxos algorithm was developed by Lamport [Lam98] and
several variations of it have been proposed including fast Paxos [Lam06], Byzantine
Paxos [Lam11], and high-throughput Paxos [KA14]. It has since been used in several
data management systems [RST11, CGR07].
Multiversion concurrency control (MVCC) has a long history, too. Serializability
theory for MVCC was discussed by Bernstein and Goodman in [BG83]. Several varia-
tions of MVCC are applied in modern database systems to improve concurrency per-
formance in database systems; see for example [SCB+ 14].
An influential starting point for the investigation of logical clocks and the defi-
nition of the happened-before relation in distributed systems was the seminal article
about scalar clock (that is, Lamport clocks) [Lam78]. Several adaptations ensued lead-
ing to vector clocks [SK92, TRA96, YH97, AL97, PST+ 97, BR02, AKD06] and version
vectors [PPR+ 83, PBA+ 10].
13 Consistency
In a distributed database system with replicated data, consistency has a wider conno-
tation than consistency in single-server systems (as in the ACID properties; see Chap-
ter 2). For example, a user may write a new value for a data item on one replica (resid-
ing at one server); but the same user may later on read an older value from another
server where the replica has not yet been updated due to delays or failures in the sys-
tem. Moreover, different users might try to update data items concurrently on differ-
ent replicas leading to a conflict. Different notions of consistency have been devised
to specify desired properties in replicated database systems. A database system with-
out replication (a “one-copy” database) is however the gold standard for distributed
systems; that is, ideally all updates should be immediately visible at all replicas and
should be applied in the same order.
13.1 Strong Consistency
Strong consistency denotes the ideal world for users of a replicated database system.
It demands that users never read stale data and writes are applied in the same order at
all replicas basically instantaneously after the user issued the write request. However,
such an ideal system behavior can never be achieved in a distributed shared-nothing
database system due to network delays, failures and concurrent operations at different
replicas.
To see how sequences of user requests at different replicas can interfere with each
other, consider a replicated data item x. Reads are only local but writes are propagated
to all other replicas. For example in Figure 13.1, Replica 2 reads x, processes a write
request from Replica 1 and then writes its own result (based on a now stale value of
x); this result is then propagated to all other replicas overwriting the previous value.
That is, we observe the case of a stale read in Replica 2 and the case of a lost update of
the write operation of Replica 1 because its write operation is not taken into account by
subsequent read operations – although logically some process might have to operate
on the value written by Replica 1.
While in reality instantaneous writes at all replicas and always up-to-date reads
are impossible, some definitions of strong consistency aim at an appropriate global
ordering of writes at all replicas at the cost of high synchronization requirements be-
tween the replicas. One early definition of strong consistency originating from dis-
tributed programming is called sequential consistency. Sequential consistency was
originally applied in multiprocessor systems. To improve performance in a distributed
program running on multiple processors, reordering of individual operations (with-
out however changing the final output) might be possible. Lamport [Lam79] observed
the problem that this reordering may lead to erroneous behavior when using a multi-
296 | 13 Consistency

Replica 1

Replica 2
write1 (x) write2 (x)

Replica 3
Fig. 13.1. Interfering operations at three replicas
processor system and demanded to maintain the ordering of each of the processors’
programs.
When transferring the notion of sequential consistency to a distributed database
system, it demands to maintain a global order of the write requests that are propagated
to all replicas. With sequential consistency, transactions are not considered. Instead,
at each replica, sequences of independent read and write operations are executed. The
write operations are propagated to all other replicas and hence a global interleaving
of all writes issued by the different replicas has to be found. Sequential consistency
ensures that
– there is a global ordering of all writes; that is, all replicas apply all writes in the
same order;
– the local operation order at each replica is preserved; in other words, if two op-
erations occur in sequence locally at one replica, they cannot be swapped in the
global ordering.
The sequence of operations shown in Figure 13.1 complies with the definition of se-
quential consistency, because all write operations are executed in the same order at
all replicas while also respecting the correct order of read and write operations locally
at each replica. If – in contrast – the order of the two write operations were swapped
at one of the replicas, then sequential consistency would be violated. However, we see
that sequential consistency cannot avoid stale reads because it is defined on individ-
ual operations: sequential consistency has no notion of transactions and with it no
means to define indivisible sequences of read and write operations.
A common definition for strong consistency respecting transactions is one-copy
serializability. A one-copy database is one that does not replicate data and hence

Replica 1
write1 (x) read2 (x) write2 (x)

Replica 2
write1 (x) write2 (x)

Replica 3
Fig. 13.2. Serial execution at three replicas
a write of a data item will only be directed to the server maintaining the single copy
of the data item. Serializability has been defined in Section 2.5.2 for non-replicated
databases: every interleaving of transaction must be such that it corresponds (that is,
is equivalent) to a serial execution of the transactions one after the other. One-copy se-
rializability extends this definition to multiple replicas: all replicas see interleavings of
different transactions (that is, request sequences of different users) in the same order.
This order must be serializable in the usual sense: the values read and written must
be the same as if the transactions were executed serially one after the other. However,
even in single-copy databases, serializability is hard to verify and instead locking or
timestamping are used. In distributed database systems, the situation is even worse:
the coordination overhead required between the replicas does not scale well.
To illustrate one-copy serializability, we can consider two simple transactions:
one executed by Replica 1 as T1 : hread1 (x), write1 (x)i and one executed by Replica
2 as T2 : hread2 (x), write2 (x)i. To fulfill the requirement of one-copy serializability,
we see that no interleaving is possible: if we would interleave these transactions for
example as hread1 (x), read2 (x), write1 (x), write2 (x)i (with a distributed execution as
shown in Figure 13.1), the read values will be different from the the values read in a
serial execution of these transactions where a write has to occur in between the reads
– thus violating the serializability requirement. That is, to achieve one-copy serializ-
ability we indeed have to execute the transactions serially: either in the order hT1 , T2 i
(as shown in Figure 13.2), or in the order hT2 , T1 i. Hence, the stale read at Replica 2 is
avoided, because the write in T1 is executed before the read in T2 .
S3
write quorum S3
S2 S4 S2 S4
S1 S5
read quorum S1 S5
read quorum write quorum
Fig. 13.3. Read-one write-all quorum (left) and majority quorum (right)
13.1.1 Write and Read Quorums
A flexible mechanism to avoid stale reads and lost updates among a group of servers
in a replicated database system is to use quorums when reading and writing data:
– a read quorum is defined as a subset of replicas that have to be contacted when
reading data; for a successful read, all replicas in a read quorum have to return
the same answer value.
– a write quorum is defined as a subset of replicas that have to be contacted when
writing data; for a successful write, all replicas in the write quorum have to ac-
knowledge the write request.
More generally, let R denote the size of a read quorum, W denote the size of a write
quorum, and N denote the replication factor. A usual requirement for a quorum-based
system is that any read and write quorums overlap: in other words, the sum of read
quorum size and write quorum size are larger than the replication factor – that is, as
a formula R + W > N. In this way, it can be ensured that at least one replica (indeed,
all replicas in the overlap) has acknowledged all previous writes and hence is able to
return the most recent value. Two typical variants of quorum-based systems (ROWA
and majority) are defined as follows. Read-one write-all (ROWA) requires that writes
are acknowledged by all replicas, but for reads it suffices to contact one replica. Hence
in a ROWA system, reads are fast but writes are slow. A majority quorum for both
reads and writes requires that (for N replicas) at least b N2 c + 1 replicas acknowledge
the writes and at least b N2 c + 1 replicas are asked when reading a value. Both variants
are shown in Figure 13.3.
When all replicas in a read quorum return an identical value, the requesting client
can be sure that this is the most recent value and hence the read value is strongly con-
sistent. It might of course happen that a read quorum contains replicas with stale data.
For example, the majority read quorum in Figure 13.3 contains servers S1 and S2 which
might (not yet) have received the most recent write. In this case, the returned values
will be ambiguous. To then determine the most recent value, it is common practice to
combine quorums with version vectors (see Section 12.3.4). Hence strong consistency
for reads can be achieved with quorums: provided that the read quorum and the write
quorum of each data item overlap, the value with the highest version vector can be
chosen as the most recently written value.
Even with intersecting read and write quorums, cases of concurrent writes can oc-
cur: the most recent value is not unique because there are concurrent version vectors.
That is why a similar intersection requirement can be put on the write quorum sizes to
achieve strong consistency during write operations. The requirement is that any two
write quorums must overlap: in other words, twice the write quorum size is larger than
the replication factor: 2W > N. Now, in order to avoid concurrent writes, write quorum
intersection can again be combined with version vectors to achieve strong consistency
for the writes: The servers in a write quorum can enforce total order of the writes by
rejecting any write requests that are incomparable to the current version vector stored
on each server. This forces the requesting client to read the current version and retry
the write with the up-to-date version vector.
Quorums have the following positive effects in a distributed database system:
– Quorums enforce the same order of write operations on all replicas when com-
bined with version vectors. Note however, that this order is not necessarily seri-
alizable; to achieve one-copy serializability extra synchronization effort between
the replicas is required.
– Availability of data is ensured as long as the desired quorum is reachable by the
client.
– Latency is reduced when the quorum size is smaller than the replication factor,
because not all replicas need to be contacted.
– Partition-tolerance is achieved as long as the desired quorum is part of an entire
partition and the partition is reachable for the client.
Flexibility comes from the fact that different quorums can be chosen for each data
item. These quorums can also be of different size for each data item.
Quorums can also be used when strong consistency is not required: Indeed, quo-
rum sizes can be chosen on every read and write request. If the restriction is dropped
that quorums overlap, they are called partial quorums. More precisely, if quorums are
not required to overlap, strong consistency cannot be ensured any longer – and hence
stale reads or conflicting updates can occur. Some database systems can be configured
to aim at majority quorums as the optimal case – but if only partial quorums can be
established, they are accepted by the database system as a possibility to react to fail-
ures. In this case, the database system continues nevertheless after a timeout even if
no majority of acknowledging servers can be reached. In this way, weak consistency
(see Section 13.2) can be achieved on demand to improve latency of the read or write
operation.
13.1.2 Snapshot Isolation
One property of multiversion concurrency control (MVCC; see Section 12.2.3) is called
snapshot isolation [BBG+ 95, EPZ05]. A snapshot x i of a data item x is a version of data
item which was written by transaction T i . A snapshot for one transaction is a set of
snapshots of those data items that the transaction accesses.
Following the customary notation [ASS13, SPAL11], we write transactions as se-
quences of operations. An interleaving of several transactions is called a history. A
history consists of several read (r i ) and write (w i ) operations which happen in a trans-
action T i . Each transaction ends with either a commit (c i ) or abort (a i ) operation.
Moreover, for each transaction T i there is one snapshot operation s i which happens
before any other operations in the transaction. The write set of a transaction T i is the
set of data items written by the transaction; it is denoted as writeset(T i ). The read set
(denoted readset(T i )) the set of data items read by the transaction. For simplicity of
notation we assume that every transaction first reads a data item before writing it and
each transaction writes a data item at most once. We write o < o′ when operation o
occurs before operation o′ in a history.
Snapshot isolation ensures the following two properties for any transactions
T i , T j , T k (for i 6= j 6= k) that are interleaved in a history h:
1. Read Rule: Whenever a transaction T i reads a version x j written by another trans-
action T j (that is, r i (x j ) ∈ h), and furthermore another transaction T k writes data
item x (that is, w k (x k ) ∈ h) and later on commits (that is, c k ∈ h), then the follow-
ing holds
– Transaction T j commits: c j ∈ h
– Transaction T j commits before transaction T i takes its snapshot: c j < s i
– Transaction T k commits before transaction T j commits, or transaction T i
takes its snapshot before transaction T k commits: c k < c j or s i < c k
2. Write Rule: For two transactions T i and T j that both commit (that is, c i ∈ h and
c j ∈ h when their writesets intersect (that is, writeset(T i ) ∩ writeset(T j ) 6= ∅), then
one must have committed before the other takes its snapshot: c i < s j or c j < s i .
The read rule enforces an ordering of the history such that a transaction only sees
values in its snapshot that have been written by a transaction that actually committed
before; and, if one transaction T i sees the value x j , then any transaction T k that also
writes x either must have committed before T j committed (so that T j overwrites the
value written by T k ) or will commit later so that T i will not observe any values written
by T k at all.
The write rule ensures that whenever two transactions modify the same data item
then one must have committed before the other one takes its snapshot. In this way,
the write rule also enforces a “first committer wins” strategy: if two transactions con-
currently try to commit although one has not seen the effects of the other, then only
the first committing transaction succeeds while the second one is aborted.
Snapshot isolation does not provide serializability. In particular, an anomaly called

write skew may occur under snapshot isolation but not under serializability. A further
property [EPZ05] can be checked at runtime to ensure serializability:
Dynamic Serializability Rule For any two transaction T i and T j that commit con-
currently, T i may not read data items that T j writes; that is, the read set of one
transaction may not intersect with the write set of the other: for s i ∈ h, c i ∈ h and
c j ∈ h, if s i < c j < c i , then readset(T i ) ∩ writeset(T j ).
One-copy serializability would require that even in a distributed system, a transac-

tion always takes a snapshot based on a global real time. As this is impossible to
achieve, snapshot isolation has been generalized [EPZ05] to be allowed to take a snap-
shot based on any older transaction that committed previously – not necessarily being
the latest transaction in the distributed database system. In this way, snapshot isola-
tion supports lazy replication: a transaction can take a snapshot of the current state of
any replica locally – although another transaction might have committed on a remote
replica at a later point of time.
It has been shown [ASS13] that snapshot isolation can be decomposed into four
properties. In other words, instead of ensuring the read rule and the write rule above,
we can as well ensure the four properties of avoiding cascading aborts (ACA),
strictly consistent snapshots (SCONS), snapshot monotonicity (MON), and write-
conflict freedom (WCF):
1. ACA: A history h avoids cascading aborts, if for every read r i (x j ) in h, commit c j
occurs before it: c j < r i (x j ).
2. SCONS: A strictly consistent snapshots reads all data items that the transaction
accesses at the same point in time; this can be expressed by the two properties
that have to hold for any transactions T i , T j , T k , and T l with k 6= j:
(a) SCONSa: when transaction T i observes writes of both T j and T l then T l may
not commit after T i read the modified value; more formally, when r i (x j ) ∈ h
and r i (y l ) ∈ h then the r i (x j ) is not allowed to precede the commit c l , that is,
r i (x j ) 6< c l ; however, they may be concurrent.
(b) SCONSb: when transaction T i observes writes of both T j and T l , and when T k
writes the same data item that T j writes and T i reads, and if it then holds that
T k commits before T l , then T k also commits before T j : when r i (x j ) ∈ h and
r i (y l ) ∈ h and and w k (x k ) ∈ h, if c k < c l then also c k < c j .
3. MON: Snapshots in a history h are monotonic if they can be partially ordered so
that if a transaction T i that takes its snapshot before another transaction T l com-
mits, the transaction T i will never read a value written by T l or any other transac-
tion T j that reads a value written by T l .
4. WCF: A history h is write-conflict free if two independent transactions never write
to the same object; two transactions T i2 and T i n−1 are independent if there is no
information flow by a cascade of read operations between the transactions; in
other words, there does not occur a set of reads r i2 (x i1 ), r i3 (y i2 ) . . . r i n (z i n−1 ) in h.
13.2 Weak Consistency
Ensuring strong consistency is usually costly (in terms of latency) or may even lead
to indefinite blocking or aborting of operations (hence reducing availability for these
operations). Achieving strong consistency might also be overly restrictive (or overly
pessimistic) in the sense that operations are suspended which could actually be exe-
cuted immediately without causing any problems. For example, a replica in a quorum
might respond with a delay larger than the others and the slower replica then keeps
the other replicas waiting. Moreover, strong consistency requires a high amount of
synchronization between replicas. For a write-heavy system this will turn out to be a
bottleneck. Reducing the synchronization requirements leads to weaker forms of con-
sistency. Weak consistency can improve performance of the overall system (in terms
of latency or availability); however, weak consistency can cause conflicts and incon-
sistencies of the data and may even lead to data loss.
Optimistic replication [SS05] takes the perspective that inconsistencies and con-
flicts may occur but they occur only rarely and they can be resolved after they have
been detected. However with optimistic replication only weaker forms of consistency
can be ensured. Weak consistency leads to advantages for other properties of repli-
cated systems like:
Availability: Requests are not blocked but every request can completed; for ex-
ample, [COK86] measure availability as the fractions of requests (or transactions)
in the entire system that complete.
Reduced latency: Requests return faster without waiting for acknowledgements
of (all) other replicas.
Failure tolerance: Under several failure scenarios, the system (or parts of it) still
remains functional; as a special case, partition tolerance means that even when
the distributed system is split into several subsystems (the partitions) with no
means of communication between the partitions, at least a part of the system is
still able to provide the requested functionality – for example, at least one of the
partitions can still accept write requests.
Scalability: a larger number of replicas (hence a larger replication factor for the
individual data items) can be supported because less synchronization between
replicas is needed.
There are different ways to weaken the notion of strong consistency. These weaker
definitions of consistency are easier to implement than strong consistency but pro-
vide less consistency guarantees with respect to supported operations, the amount of
records that can be accessed or the operation ordering that is enforced:
Operations: Enforcing consistency for sequences of individual operations (op-
erations are seen as individual commands not belonging to a larger transaction)
at the replicas is a weaker requirement than enforcing consistency for read-only
transactions which is again a weaker requirement than enforcing consistency for

read-write transactions.
Records: Consistency for operations or transactions may be restricted to span
only one individual record which is weaker than enforcing consistency when ac-
cessing multiple records in an operation or transaction.
Ordering: different operation orderings may be enforced by the consistency def-
initions – like real-time ordering (which requires the a fully synchronized global
clock); causality ordering (based on some causality relation between operations
where some operations might be concurrent); transaction ordering (taking into
account the ordering inside transactions as well as between transactions); oper-
ation sequence ordering (as prescribed by the operations issued at the individual
replicas); or arbitrary ordering.
For strong consistency, eager replication is necessary: write operations have to be

propagated to the replicas (or at least a quorum of replicas); these replicas have to
acknowledge that the write succeeded before any other operation can be executed. In
contrast, weak consistency can also rely on lazy replication: only one replica han-
dling the write request has to successfully execute it; then the write operation is prop-
agated to the other replicas, but it is optimistically assumed that the propagated write
will succeed and hence there is no need to wait for acknowledgements. That is also
why with lazy replication the propagation need not occur instantaneously but for ex-
ample could be executed in a batch. Lazy replication improves latency of writes, but
requires conflict handling should a conflicting write be detected at a later point of
time. Moreover, with lazy replication, stale reads might happen more often, because
some replicas may not have received the most recent update before answering read
requests. The duration from the acceptance of a write request by the first replica until
the last successful write execution by all replicas is called the inconsistency window.
Ideally, the inconsistency window is only very short in order to reduce the amount of
stale reads.
Several forms of weak consistency have been proposed in the literature. They
are customarily divided into data-centric and client-centric consistency models. Data-
centric consistency models focus on the propagation and ordering of read and write
operations between the replicas in a distributed database system. Client-centric con-
sistency models in contrast analyze the effects of consistency maintenance that are
visible to a user of the distributed database system.
13.2.1 Data-Centric Consistency Models
Data-centric consistency definitions look at the internals of communication between

the replicas. It wants to achieve consistency by restricting the order of read and write
operations on the replicas.
Eventual consistency: As defined in [TDP+ 94], replicas may hold divergent versions
of a data item due to concurrent updates and propagation delays. Eventual consis-
tency demands that these versions converge in case no new updates arrive. In other
words, replicas agree on a version after some time of inconsistency and as long as no
new updates are issued by users. Hence, eventual consistency requires the two prop-
erties (1) total propagation of writes to all replicas and (2) convergence of all replicas
towards a unique common value.
Causal consistency: Causal consistency relies on the happened-before relation that
is also used for the ordering of events by logical clocks (see Section 12.3.1). The three
properties of the causality relation can simply be restated in terms of read and write
operations on replicas as:
1. ordering of reads and writes on a single replica must be maintained
2. reads-from relation: any read operation accessing a value of a write operation
propagated by another replica must be scheduled after the write (where the write
operation corresponds to receive event)
3. transitivity must be maintained
The ordering requirement of causal consistency is that concurrent operations can be

executed in any order while causally related operations the same ordering is required
at all replicas.
As shown in Section 12.3.1, Lamport’s happened-before relation [Lam78] covers all
possible causes of an operation. Each operation then has to wait for all those previous
operations to complete that the current operation causally depends on. In complex
systems, it is impractical to track all the causal dependencies of all events: in other
words, it is difficult to keep track of complex causality graphs. As a semantic refine-
ment of Lamport’s happened-before relation, the notion of effective causality restricts
the set of possible causes to a set of actual causal dependencies between events. With
this refinement it is possible to only track the much smaller set of effective causes
which can be specified by the user for each event.
With causal consistency, eventual consistency cannot be guaranteed, because
replicas can execute concurrent writes in an arbitrary order. That is, the replicas may
not converge towards a unique value because no dependencies might exist between
writes to two different replicas of a data item. Ensuring convergence of replicas has
therefore been defined and analyzed in [LFKA11, MSL+ 11].
Parallel snapshot isolation (PSI): PSI [SPAL11] relaxes the properties of snapshot
isolation. While conventional snapshot isolation requires all commits to occur in the
same order on all replicas, with PSI replicas may use different orderings of the commit
operations in a history.
Non-monotonic snapshot isolation (NMSI): NMSI [ASS13] disregards condition
MON and replaces condition SCONS by a relaxed consistency condition CONS: A
transaction T i in a history h observes a consistent snapshot if for r i (x j ) ∈ h and
w k (x k ) ∈ h, when T i depends on T k (that is, there is a cascade of reads between a

value written by T k and T i ), then c k < c j .
13.2.2 Client-Centric Consistency Models
Client-centric consistency takes the perspective of the user interacting with the
database system. For the user the internal ordering of read and write operations
in a replicated database system is irrelevant as long as the database system presents
a consistent view to the user. The user can for example read a value from one replica,
update the value on another replica and then read the value again from a third replica.
In this case, client-centric consistency should ensure that the interaction of the user
makes sense and the database system avoids any inconsistent behavior (like returning
a stale value after an update). This can be achieved by restricting the read access to
those replica servers that have already processed the update; in other words, not all
replicas will be available for the read access. In particular, with client-centric consis-
tency it is allowed that different users indeed see different orderings of read and write
operations because different user may require different guarantees.
One approach to client-centric consistency are session guarantees [TDP+ 94]:
they automatically guarantee certain properties inside a session – where a session
is a time-restricted interaction sequence of a single user. Some session guarantees as
the following ones can be combined to yield a stronger notion of consistency:
Read your writes (RYW) Once the user has written a value, subsequent reads will
return this value (or newer versions if other writes occurred in between); the user
will never see versions older than his last write. Note however that RYW does not
ensure isolation of different users: if other users write the same data item, values
are simply overwritten.
Monotonic reads (MR) Once a user has read a version of a data item on one
replica server, it will never see an older version on any other replica server; in
other words, any subsequent read will return the same or a more recent version
even when reading from a different replica server. This is ensured by requiring
that all write accesses that are relevant for the first read R1 on server S1 will also
be processed by any server S2 before S2 can serve a subsequent read R2.
Writes follow reads (WFR) If a user reads a data item from one replica sever, but
subsequently writes a new value for the data item on a different replica server, the
second server must have processed all those writes that are relevant for the read
first, before processing the write.
Monotonic writes (MW) Once a user has written a new value for a data item in
a session, any previous write has to be processed before the current one. In other
words, MW strictly maintains the order of writes inside the session.
Consistent prefix (CP) Each replica has processed a subset of all writes according
to a common global order of all writes. That is, some writes may not yet be visible
at some replicas so that stale reads are possible; but those writes that have already
been processed comply with the global order.
Bounded staleness (BS) BS puts a limit to the staleness of each read value (see
for example [BVF+ 12, BZS14]). This can be done in terms of real time or version
(corresponding to logical time for counting the versions):
– a time-based definition of BS is t-visibility: the inconsistency window com-
prises at most t time units; that is, any value that is returned upon a read
request was up to date t time units ago.
– a version-based of BS is k-staleness: the inconsistency window comprises at
most k versions; that is, lags at most k versions behind the most recent ver-
sion.
13.3 Consistency Trade-offs
Trade-offs between consistency and other desired properties in distributed systems

have been discussed for a long time in several papers like [RG77, TGGL82, FM82,
BG84, DGMS85, COK86, GHOS96, FB99, GL02]. An early comprehensive survey of
protocols for consistency in partitioned networks was given in [DGMS85]. Even in
non-replicated single-server databases, the so-called isolation levels of RDBMS have
been discussed for decades as an improvement of latency by reducing consistency
requirements; these isolation levels have been recently surveyed in [BFG+ 13].
The discussion centered around the relation of the three properties strong con-
sistency (C) versus high availability (A) versus partition tolerance (P) has gained new
momentum with the formulation of the strong CAP principle in [FB99]. The strong
CAP principle says that in a distributed system from the three properties C, A and P at
most two of them can be achieved at the same time.
A more diplomatic formulation of this conjecture is also given in [FB99] with the weak
CAP principle: if higher guarantees are required for two of the three properties, only
weaker guarantees can be assured for the third. Based on this distinction, distributed
systems can roughly be categorized based on the following three types:
AP systems: Whenever a network partition occurs, an AP system prefers to be
available at the cost of inconsistencies that can be introduced. For example, in
a quorum-based system, partitions with only a minority of replica servers might
still accept write operations, although the write operations issued to different par-
titions might be conflicting. Inconsistencies must then later be resolved as soon
as the partition has been resolved.
CP systems: Whenever a network partition occurs, a CP system prefers to main-
tain consistency at the cost of reduced availability. For example, in a system using
majority quorum, any minority partition has to reject write and read operations
(in effect making them unavailable); only a partition with a majority of replica
servers can still process incoming user requests. As a second example, a ROWA
quorum system can still answer read requests (with any running replica server) –
but all write requests have to be rejected, as long as at least one replica server is
partitioned from the rest.
CA system: As long as there is no partition, a CA system should achieve as much
consistency and availability as possible. However, as soon as a partition happens,
the system can give not guarantees regarding consistency and availability any
longer.
Note however that there are not clear boundary between these categories and sys-
tems usually offer different guarantees in each of the categories. In particular, in dis-
tributed systems, partitions (or server crashes resulting in singleton partitions) can-
not be avoided. Moreover, partitions can usually not be distinguished from arbitrary
message losses. Hence any reliable distributed system must take precautions for these
communication failures; if a distributed system forsakes partition tolerance, it might
fail entirely due to a partition resulting in an unavailable system. As a consequence, a
common interpretation of the CAP principle can be stated as: In case of a network par-
tition, a distributed system can choose between maintaining either high availability
or strong consistency.
It has also been advocated that – instead of availability in general – the trade-
off is more between latency and consistency during normal (partition-free) operation
[Aba12]: the PACELC notion states that if there is a partition (P), how does the system
trade off availability and consistency (A and C); else (E), when the system is running
normally in the absence of partitions, how does the system trade off latency (L) and
consistency (C)?
Knowing about these trade-offs is important to configure the read and write behav-
ior of distributed database systems. From a practitioner’s perspective a good option is
to use adaptable consistency level for each individual read and write call – if this is
offered by the database API.
Strong consistency has long since been analyzed from a mostly theoretical perspective
[BG85]. Numerous weaker consistency models have been proposed in the last decades.
The ones surveyed here can for example be found in [TDP+ 94, BVF+ 12, BZS14]. Gray
et al [GHOS96] compare eager and lazy replication in the settings of the multi-master
(update-anywhere) and the single-master replication cases. [WPS+ 00] survey safety
properties of distributed protocols.
The notion of atomicity can be used to define staleness in a formal way. Starting
from the basic form of atomicity in [Lam86], extended definitions of atomicity include
k-atomicity [AAB05] (bounding staleness of read operations to the last k versions) and
∆-atomicity [GLS11] (bounding staleness of read operations to ∆ time units). Bailis et

al [BVF+ 12] predict staleness in a probabilistic model.
Snapshot isolation as one way to implement multiversion concurrency control has
been used and extended in several approaches; for example non-monotonic snapshot
isolation [ASS13], snapshot isolation with vector clocks [PRT14] or serializable gener-
alized snapshot isolation [BHEF11]. Lin et al [LKPMJP05] describe a middleware sys-
tem that provides snapshot isolation for read-one-write-all replication. They call one-
copy snapshot isolation a system that provides local snapshot isolation at each replica
and all local schedules can be merged into a single snapshot-isolation schedule.
Although the trade-offs between consistency, availability and partition toler-
ance have been discussed for decades (for example in [RG77, TGGL82, FM82, BG84,
DGMS85, COK86, GHOS96, FB99, GL02]), the strict formulation as the strong CAP
principle in [FB99] has spawned several discussions. Brewer [Bre12] addresses some
of these discussions in retrospect. Gilbert et al [GL02] argue the principle holds in two
different theoretical models (the asynchronous and the partially synchronous net-
work model) when arbitrary message loss may occur. The PACELC view on distributed
systems was introduced by Abadi in [Aba12].
Eventual consistency has been discussed from various perspectives for example
in [BG13, BGY13, BD13].
|
Part IV: Conclusion
14 Further Database Technologies
Several other database management technologies have been studied for decades apart
from the technologies surveyed in this book. Some of them have been optimized for
specialized applications. This chapter provides a brief overview of some of these tech-
nologies.
14.1 Linked Data and RDF Data Management
An individual data item may not contain sufficient information on its own. More valu-
able information can be derived when several data items are connected by links and
these links are annotated with semantic information about the relationship of the data
items. As such, linked data correspond in general to the property-graph model de-
fined in Section 4.3. However, the term linked data often refers more specifically to
data that is specified in the Resource Description Framework (RDF) where the actual
data items are referenced by uniform resource identifiers (URI). An RDF data set con-
sists of triples where two data items (the subject and the object) are linked by their
relationship (the predicate). Each of the elements (subject, predicate or object) can
either be an URI pointing to the actual data or a string literal.
Web resources:
– W3C recommendations:
– RDF: http://www.w3.org/TR/#tr_RDF
– SPARQL: http://www.w3.org/TR/#tr_SPARQL
– AllegroGraph: http://franz.com/agraph/allegrograph/
– documentation page: http://franz.com/agraph/support/documentation/
– GitHub repository: https://github.com/franzinc
– Apache Jena: http://jena.apache.org/
– documentation page: http://jena.apache.org/documentation/
– GitHub repository: https://github.com/apache/jena
– Sesame: http://rdf4j.org/
– documentation page: http://rdf4j.org/documentation.docbook
– OpenLink Virtuoso: http://virtuoso.openlinksw.com/
– documentation page: http://docs.openlinksw.com/virtuoso/
– GitHub repository: https://github.com/openlink/virtuoso-opensource
Several data stores are available that are specialized in storing RDF triples – they are
also called triple stores. A widely-used query language for RDF graphs is SPARQL –
a query language that has an SQL-like syntax.
312 | 14 Further Database Technologies
14.2 Data Stream Management
A data stream is an infinite sequence of transient values; that is, the data are not stored
persistently for later retrieval but instead they are processed “on the fly” as they are
produced. A data stream management system (DSMS) processes this data sequence
by running so-called continuous queries on the stream; in general, these are queries
that are executed interminably. Hence a data stream management system must handle
queries that might be running for months or even years.
The continuous queries consume the data in the stream step-by-step and usually
produce an infinite output stream. That is, the result will change over time. Continu-
ous queries can aggregate data from the entire stream (for example, calculating the
overall average of all values) or the look at subsets of the data stream independently
(for example, by using a sliding window and evaluating the query only on the data in-
side the window). Data streams can for example be produced by sensor networks or by
network traffic monitoring. Applications of data stream management are for example
real-time decision support systems or intrusion detection systems.
Web resources:
– Apache Flink: http://flink.apache.org/
– documentation page: https://ci.apache.org/projects/flink/flink-docs-master/
– GitHub repository: https://github.com/apache/flink
– Apache Samza: http://samza.apache.org/
– documentation page: http://samza.apache.org/learn/documentation/
– GitHub repository: https://github.com/apache/samza
– Apache Storm: http://storm.apache.org/
– documentation page: http://storm.apache.org/documentation/Home.html
– GitHub repository: https://github.com/apache/storm
A large amount of data stream management systems are based on the relational data
model and their continuous query languages are similar to SQL. In this case, one item
in the data stream can be represented as a pair htimestamp, tuplei where the tuple then
corresponds to a row of an infinite relational table (that is, a table with infinitely many
tuples) and all tuples adhere to the same relation schema.
When using the sliding window semantics in a continuous query one can usually
specify
– the range of the window: this can either be measured by the size of the window
(how many stream items a window must contain) or by a time constraint (for ex-
ample, the window must contain all items that were produced in the last 30 sec-
onds).
– the slide length: the slide length can be measured again by size (how many data
items must pass by before starting a new window) or by time interval (how many
seconds must elapse before starting a new window)
14.3 Array Databases | 313
From time to time it might happen that there are peaks in the data stream where there
too many data items to process them in real time. A data stream management system
must be prepared for this situation. A simple solution is to drop data items when they
cannot be processed immediately; the disadvantage is then that usually the accuracy
of the result is reduced. If high accuracy is required, then the excessive items can be
persisted to disk and processed later in idle times. For some queries it is also possible
to use only the summary (a synopsis) of several items and then taking the synopsis as
the input for more complex queries.
14.3 Array Databases
Array databases organize data along multiple dimensions and can be used to store and
manipulate data with complex structures. Those complex data often occur in natural
sciences like for example astronomical data obtained from satellite observations.
Web resources:
– Rasdaman: http://www.rasdaman.org/
– documentation page: http://www.rasdaman.org/wiki/Documentation
– source repository: http://www.rasdaman.org/browser
– SciDB: http://scidb.org/
– documentation page: http://www.paradigm4.com/resources/documentation/
In the array data model data are stored in multidimensional arrays. Each array cell
contains a tuple of a certain length; the elements of such a tuple can either be scalar
values or they can themselves be arrays. In other words, the array data model can
express arbitrary nestings of arrays. The tuples are addressed by specifying the corre-
sponding dimensions; in SciDB [SBZB13] the individual scalar values of a tuple can
furthermore be addressed by named attributes. An example from the SciDB paper
[SBZB13] shows how two specify a two-dimensional matrix (along the two dimensions
I and J) and each cell contains a tuples with two attributes (attribute named M of type
integer and an attribute named N of type float):
CREATE ARRAY example <M: int, N: float> [I=1:1000, J=1000:20000]
Array databases offer several advanced functions to manipulate array data. For exam-
ple, tuples can be aggregated or specialized join operators can be executed.
14.4 Geographic Information Systems
Geographic information (like map data) has long been considered a particular form
of data with special storage and evaluation needs. These needs have been answered
by Geographic Information Systems (GIS) and several databases offer GIS functional-
ity. GIS data often require specialized data types for geometric elements of maps (like
points, lines or polygons). On these data types, specific evaluation operations are usu-
ally offered (like computing the intersection of two elements).
Web resources:
– Open Geospatial Consortium: www.opengeospatial.org/
– standards: http://www.opengeospatial.org/standards
– Open Source Geospatial Foundation: http://www.osgeo.org/
– GeoNetwork opensource: http://geonetwork-opensource.org/
– documentation page: http://geonetwork-opensource.org/docs.html
– GitHub repository: https://github.com/geonetwork/
– GeoServer: http://geoserver.org/
– documentation page: http://docs.geoserver.org/
– GitHub repository: https://github.com/geoserver/geoserver
– PostGIS: http://postgis.net/
– documentation page: http://postgis.net/documentation
– GitHub repository: https://github.com/postgis/postgis/
– QGIS: www.qgis.org/
– documentation page: www.qgis.org/en/docs/
– GitHub repository: https://github.com/qgis/QGIS
– GRASS GIS: http://grass.osgeo.org/
– documentation page: http://grass.osgeo.org/documentation/
– SVN repository: http://trac.osgeo.org/grass/browser
– GeoJSON: http://geojson.org/
– specification: http://geojson.org/geojson-spec.html
The GIS community has developed a wide range of standards and specifications to
enable interoperability of several systems. For example, GeoJSON is a recent specifi-
cation to describe GIS data in JSON format. A simple example for the specification of a
single point (by defining its x-coordinate and its y-coordinate) looks like this (see the
GeoJSON specification for more details):
{ "type": "Point", "coordinates": [100.0, 0.0] }

14.5 In-Memory Databases | 315
14.5 In-Memory Databases
In-memory databases rely on servers with large-scale main memory. The main mem-
ory is the primary storage location for the data. This makes data management a lot
faster because it avoids the memory-to-disk bottleneck when writing and the disk-to-
memory bottleneck when reading data. In particular, in-memory data management
works at the granularity of memory addresses and not at the granularity of data blocks
like the memory pages that have to be transfered from and to the disk.
Durability of data is not ensured when data are just maintained in the main mem-
ory. A system crash or a power outage will usually erase the main memory and all data
is lost. Durability can for example be added to in-memory-databases by
Logging: Transaction logs are stored to the disk and then applied upon recovery
from a system crash.
Snapshots: The state of the database is stored to disk periodically; in other words,
a regular snapshot of the database is taken and stored durably.
Replication: All data is replicated to other in-memory database servers (at best at
geographically dispersed locations). In case of a crash of a single server, a repli-
cation server can take over.
Several databases offer a main-memory mode as an alternative to disk-based storage.
Web resources:
– Aerospike: http://www.aerospike.com/
– documentation page: http://www.aerospike.com/docs/
– GitHub repository: https://github.com/aerospike
– Apache Geode: http://geode.incubator.apache.org/
– documentation page: http://geode.incubator.apache.org/docs/
– GitHub repository: https://github.com/apache/incubator-geode
– Hazelcast: http://hazelcast.com/
– documentation page: http://hazelcast.org/documentation/
– GitHub repository: https://github.com/hazelcast
– Scalaris: http://scalaris.zib.de/
– documentation: https://github.com/scalaris-team/scalaris/tree/master/user-dev-guide
– GitHub repository: https://github.com/scalaris-team/scalaris
– VoltDB: http://voltdb.com/
– documentation page: http://docs.voltdb.com/
– GitHub repository: https://github.com/VoltDB/voltdb
14.6 NewSQL Databases
A major criticism towards traditional relational, SQL-based database systems was

their inability to run efficiently as a distributed database system. However, SQL has
the huge advantage of being a standardized language and as such being accepted by
the majority of database administrators and users. Hence, as kind of a reaction to the
diversity of NoSQL database systems and the overabundance of their different inter-
faces, the interest has come up to enhance relational database systems with a better
support for the changed requirements while keeping their relational data model; the
term NewSQL thus describes the adoption of the design principles underlying NoSQL
systems to build new distributed relational database systems with a SQL interface.
A redesign of conventional RDBMSs would in particular make them scalable: in a
NewSQL database, relational data can be stored on a varying number of severs while
efficiently answering SQL queries. Moreover, it would support failures in the network
while maintaining the ACID properties.
Web resources:
– TokuDB: http://www.tokutek.com/tokudb-for-mysql/
– documentation page: http://docs.tokutek.com/tokudb/
– GitHub repository: https://github.com/Tokutek/tokudb-engine
An invaluable book on in-memory data management is the monograph by Plattner

and Zeier [PZ11]. The VoltDB system is another in-memory data store [SW13]. Data
stream management is the topic of the books by Golab and Özsu [GÖ10] and by Garo-
falakis, Gehrke and Tastogi [MG12]. The notion of linked data and its underlying graph
semantics is covered in the position paper by Bizer, Heath and Berners-Lee [BHBL09];
the book by Wood et al covers the linked data paradigm from the practical perspec-
tive of RDF stores and the book by DuCharme [DuC13] focuses on SPARQL queries. A
good resource for geographic information systems including spatial data modeling is
the textbook by Chang [Cha10] and by Heywood, Cornelius and Carter [HCC11]. Array
data management is extensively surveyed in [RC13]. Recent array database systems in-
clude SciDB [SBZB13], rasdaman [BS14] and the SciLens platform on top of MonetDB
[IGN+ 12]. Stonebraker [Sto12] gives a brief discussion on New SQL data stores.
15 Concluding Remarks
NOSQL data stores excel with their distribution features and the schemaless storage
of data. However, given the huge variety of modern data stores and data manage-
ment systems, a major question for a production environment remains: which sys-
tem is the best choice for the application at hand? Moreover, several steps are nec-
essary to transfer a legacy application to support new data models, new query lan-
guages and new access methods and hence to fully integrate the new data store with
the existing application. In this chapter we provide a discussion on the consequences
that might ensue from changing the data models or precautions that must be taken
to smoothly integrate the novel systems. In addition, we survey approaches to enable
polyglot database architectures: database and storage systems that integrate multiple
data models into a common framework.
15.1 Database Reengineering
The notion of database reengineering covers all the aspects that are involved with
restructuring existing data management solutions and migrating data to new data
stores. It also includes the impact that a change of the data management level can
have on applications or users accessing the data store. A database reengineering pro-
cess can be seen as a special case of a general software engineering project. It hence re-
quires an appropriate project lifecycle management that supports the different phases
of the project; such phases can for example comprise (following the “lean and mean”
strategy of [RL14]):
Migration planning: The planning phase should clarify all constraints and re-
quirements involved in the reengineering process – including business goals, time
constraints, budget, and volume of migrated data – as well as analyze potential
risks.
State Analysis: This analysis phase first of all determines the state of the legacy
system in terms of data model and used technology as well as all dependencies
to external applications. Next the projected state of the future system should be
assessed in detail.
Gap Analysis: This analysis phase should identify the differences (that is, the
“gaps”) between the legacy and the future system states. It should also investigate
which steps are necessary to fill these gaps.
Data Analysis: In this phase a more detailed analysis of the future data model is
due. It should assess future access requirements on the data. Based on this and
the previous phases, the future data model is chosen.
318 | 15 Concluding Remarks
Data Design: This phase comprises an in-depth design of the future data model.
Even for schemaless data stores, a so-called implicit schema must be derived as
an interface for all external applications.
Data Model Transformation: In this phase a transformation on the model level
is executed. Formal schema mapping rules can be obtained.
Data Conversion: This activity comprises the conversion of the legacy data set
into the future data set. Ideally, this can be a fully automated process; yet errors
can occur due to inconsistencies or invalidity of the legacy data set.
Data Validation: Before the actual migration process is executed and production
systems are switched to the new data store, a validation of the converted data set
is due. This step ensures correctness of the converted data and hence can avoid
unwanted rollbacks.
Data Distribution: This phase includes the final migration to the new data store.
For a large data set, it is useful to split the data set into several subsets and to sub-
sequently execute a step-wise (incremental) migration on the smaller data sets.
Data migration is only one side of the coin. Another major activity is adapting the data
access layer of legacy systems to the new data management layer. This in particular
involves translating and rewriting queries.
15.2 Database Requirements
When choosing the appropriate data store in the Data Analysis phase of the database
reengineering process, several characteristics of the target data store must be consid-
ered. Due to the diversity of modern data stores, choosing a data store requires looking
carefully at the requirements the store should fulfill in order to comply with legacy ap-
plications as well as future demands. Here we survey some decisive features:
Data model: Which data model is suited best for the raw data (for example nested
and hierarchical data)? Which data model requires only minimal and error-free
transformation operations for converting the raw data into data records of the data
model? Is referencing from one data record to another data record supported? Is
a normalization of the data required to allow for non-redundancy in the data; or
is some duplication of data acceptable to reduce complexity of the data model?
Data types: Does the data store support the needed basic data types (for example
a date type of sufficient granularity)? Does it support container types (likes lists,
sets or arrays)? Can database users extend the type system and define their own
data types?
Database schema: Is schemaless of schema-based storage needed? How is
schema information stored? Can schema information automatically be generated
for a given set of records? Is schema validation offered – or even an advanced
constraint checking mechanism with active triggers? Is referential integrity
15.2 Database Requirements | 319
supported that checks whether referenced data records exist in the data store?
How easy is it to implement schema evolution with the data store? Which kinds
of schema change require a restart of the data store? Schemaless storage is not
always a good option: an implicit schema must be agreed upon by accessing
applications; hence, schema changes also require consolidation at the applica-
tion level. Lastly, without an explicit schema, storage or query optimizations are
harder to implement.
Operating System support: Which operating system does the data store run on?
Does the database system integrate well with the existing infrastructure?
Tools and APIs: How well does the database integrate with existing applications
or development environments? Are there extra tools for development, visualiza-
tion, or maintenance? Are all necessary APIs (like REST-based access) offered?
Query languages and expressiveness: Which programming languages or query
languages are supported? Can the queries of all accessing applications be ex-
pressed in the query language? Does the data store support advanced operations
(for example, joins or link walking) to avoid additional data processing at the
application side? With such advanced operations complexity of the accessing ap-
plication can be reduced. Are transactions needed – and which is the scope of the
transaction support (for example, only single record-based transactions, read-
only transactions versus full-blown read-write transactions)? Are aggregation
operators necessary to allow for analytical queries?
Standard support: Does the database adhere to standards in terms of persistence
(like for example, JPA compliance)? Does it support standardized query languages
so that portability of query code (both data manipulation and data definition op-
erations) is possible and provides some kind of platform independence?
Search and retrieval: How does the database support search on data records?
Is full-text search available? Which kinds of indexes are supported to speed up
search?
Versioning: Is it necessary that multiple versions of the same data record can be
maintained to allow for an analysis of how data have evolved over time? How can
different versions of a data record be accessed?
Workloads and performance: What kind of workload is expected to run on the
data store? Should the data store be read-optimized (and hence better support
read-heavy workloads) or should it be write-optimized (with a better support for
write-heavy workloads)? Are usually individual data values accessed and updated
directly or is sufficient to support only aggregation-oriented writes and reads (like
reading and writing an entire JSON document)?
Concurrency: How is concurrency handled? If a lock-based concurrency control
is employed: how fine-grained can this locking be configured; what is the level
of locking employed? If multiversion concurrency control is implemented, how
are conflicts resolved? Does the concurrency control approach comply with the
requirements of the accessing applications?
Distribution and scalability: Is a distributed data store required? How well does
the data store support a distributed installation? How does the system handle
churn (additions and removals of servers)? Does it support automatic partitioning
(“auto-sharding”) of the data? Is replication supported – and if so is multi-master
or master-slave replication more appropriate? Can the system support distributed
counters that can be auto-incremented among the database servers?
Consistency: Which consistency level is provided? Can consistency be configured
on a per-query basis? How failure-tolerant is the system and what is the failure
model supported by the system?
Maturity and support: Is a commercial data store or a commercial support
needed? In case of open source systems, how can a lack of community support
and major changes in the APIs be handled?
Security: Are security mechanisms required? Is role-based access control sup-
ported and how are users authenticated? Can access to certain records in the
database be restricted by an access control policy? Is a form of encryption of-
fered?
15.3 Polyglot Database Architectures
When designing the data management layer for an application, several of the identi-
fied database requirements may be contradictory. For example, regarding access pat-
terns some data might be accessed by write-heavy workloads while others are accessed
by read-heavy workloads. Regarding the data model, some data might be of a differ-
ent structure than other data; for example, in an application processing both social
network data and order or billing data, the former might usually be graph-structured
while the latter might be semi-structured data. Regarding the access method, a web
application might want to access data via a REST interface while another application
might prefer data access with query language. It is hence worthwhile to consider a
database and storage architecture that includes all these requirements.
15.3.1 Polyglot Persistence
Instead of choosing just one single database management system to store the entire
data, so-called polyglot persistence could be a viable option to satisfy all require-
ments towards a modern data management infrastructure. Polyglot persistence (a
term coined in [FS12]) denotes that one can choose as many databases as needed so
that all requirements are satisfied. Polyglot persistence can in particular be an optimal
solution when backward-compatibility with a legacy application must be ensured.
The new database system can run alongside the legacy database system; while the
analytical write-heavy
graph SQL query
query transaction REST-based
traversal
access
Integration layer
Query decomposition
Query redirection
Result recombination
Synchronization
graph key-value SQL in-memory

database store database store
Fig. 15.1. Polyglot persistence with integration layer
legacy application still remains fully functional, novel requirements can be taken into
account by using the new database system.
Polyglot persistence however comes with severe disadvantages:
– there is no unique query interface or query language, and hence access to the
database systems is not unified and requires knowledge of all needed database
access methods;
– cross-database consistency is a major challenge because referential integrity must
be ensured across databases (for example if a record in one database references
a record in another database) and in case data are duplicated (and hence occur
in different representation in several databases at the same time) the duplicates
have to be updated or deleted in unison.
It should obviously be avoided to push the burden of all of these query handling and
database synchronization task to the application level – that is, in the end to the pro-
grammers that maintain the data processing applications. Instead it is usually better
to introduce an integration layer (see Figure 15.1). The integration layer then takes care
of processing the queries – decomposing queries in to several subqueries, redirecting
queries to the appropriate databases and recombining the results obtained from the
accessed databases; ideally, the integration layer should offer several access meth-
ods, and should be able to parse all the different query languages of the underlying
database systems as well as potentially translate queries into other query languages.
Moreover, the integration layer should ensure cross-database consistency: it must

synchronize data in the different databases by propagating additions, modifications
or deletions among them.
15.3.2 Lambda Architecture
When real-time (stream) data processing is a requirement, a combination of a slower

batch processing layer and a speedier stream processing layer might be appropriate.
This architecture has been recently termed lambda architecture [MW15] (see Fig-
ure 15.2). The lambda architecture processes a continuous flow of data in the following
three layers:
Speed Layer: The speed layer collects only the most recent data. As soon as data
have been included in the other two layers (batch layer and serving layer), the data
can be discarded from the speed layer dataset. The speed layer incrementally com-
putes some results over its dataset and delivers these results in several real-time
views; that is, the speed layer is able to adapt his output based on the constantly
changing data set. Due to the relatively small size of the speed layer data set, the
runtime penalty of incremental computations are still within acceptable limits.
Batch Layer: The batch layer stores all data in an append-only and immutable
fashion in a so-called master dataset. It evaluates functions over the entire
dataset; the results are delivered in so-called batch views. Computing the batch
views is an inherently slow process. Hence, recent data will only be gradually
reflected in the results.
Serving Layer: The serving layer makes batch views accessible to user queries.
This can for example be achieved by maintaining indexes over the batch views.
User queries can be answered by merging data from both the appropriate batch views
and the appropriate real-time views.
15.3.3 Multi-Model Databases
Relying on different storage backends increases the overall complexity of the system
and raises concerns like inter-database consistency, inter-database transactions and
interoperability as well as version compatibility and security. It might hence be advan-
tageous to use a database system that stores data in a single store but provides access
to the data with different APIs according to different data models. Databases offering
this feature have been termed multi-model databases. Multi-model databases either
support different data models directly inside the database engine or they offer layers
for additional data models on top of a single-model engine. Figure 15.3 shows an ex-
Batch layer Serving layer

Batch view 1 Index 1
Master Batch view 2 Index 2

data set
d
n
pe
ap

Data
stream
Speed layer
ap
pe
Speed view 1
nd
Recent Speed view 2 merge

data set
Speed view 3
Speed view 4
Fig. 15.2. Lambda architecture
ample multi-model database with a key-value store as the main engine and a graph
layer on top of it.
Several advantages come along with this single-database multi-model approach:
– Reduced database administration: maintaining a single database installation is
easier than maintaining several different database installations in parallel, keep-
ing up with their newest versions and ensure inter-database compatibility. Config-
uration and fine-tuning database settings can be geared towards a single database
system.
– Reduced user administration: In a multi-model database only one level of user
management (including authentication and authorization) is necessary.
– Integrated low-level components: Low-level database components (like memory
buffer management) can be shared between the different data models in a multi-
model database. In contrast, polyglot persistence with several database systems
requires each database engine to have its own low-level components.
– Improved consistency: With a single database engine, consistency (including syn-
chronization and conflict resolution in a distributed system) is a lot easier to en-
sure than consistency across several different database platforms.
– Reliability and fault tolerance: Backup just has to be set up for a single database
and upon recovery only a single database has to be brought up to date. Intra-
graph
traversal write-heavy
Graph layer transaction REST-based
access
key-value store
Fig. 15.3. A multi-model database
database fault handling (like hinted handoff) is less complex than implementing
fault handling across different databases.
– Scalability: Data partitioning (in particular “auto-sharding”) as well as profiting
from data locality can best be configured in a single database system – as opposed
to more complex partitioning design when data are stored in different distributed
database systems.
– Easier application development: Programming efforts regarding database admin-
istration, data models and query languages can focus on a single database sys-
tem. Connections (and optimizations like connection pooling) have to be man-
aged only for a single database installation.
The systems surveyed here are a polyglot data processing system, a real-time event
processing framework as well as two multi-model database systems.
15.4.1 Apache Drill
Apache Drill is inspired by the ideas developed in Google’s Dremel system [MGL+ 10].
Web resources:
– Apache Drill: http://drill.apache.org/
– documentation page: http://drill.apache.org/docs/
– GitHub repository: https://github.com/apache/drill
Apache Drill’s aim is to support several storage sources that are connected to Drill by
storage plugins. These plugins provide an interface to the data sources as well as query
optimization rules for the specific query languages. The two main principles in Drill
(and Dremel) to achieve a high performance are:
Move code not data: instead of transmitting large amounts of data to the servers
with appropriate processing routines, the data remain on the storage servers. Each
query is decomposed into a multi-level execution tree where the subqueries lower
in the tree (and closer to the data) process the data on the storage servers.
Process data “in situ”: Data transformations (between different encodings or for-
mats) are avoided by providing native query execution routines for the attached
data stores.
Data sources can be database systems but also files in a distributed file system: Drill
supports plain textual formats with a flat structure (comma-separated, tab-separated
or pipe-separated files) as well as textual formats with more complex structures (JSON,
Avro or Parquet files).
Several service processes called Drillbits accept requests from clients, process the
data, and return the results. The Drillbit that handles a specific client request is called
the foreman for this query. Its main task is to parse the incoming SQL query and ob-
tain a logical execution plan for it (containing several logical operators). In a next step,
this logical plan is optimized for example by swapping some operators. After the opti-
mization a physical plan is obtained that describes how and with which data sources
which part of the query should be answered. The physical plan is then converted into
a multi-level execution tree where the leaf operators of the tree can be run in parallel
on the appropriate data sources. The leaf operators then obtain partial results based
on executing their subquery on the appropriate data sources. As the partial results
move back up the tree (towards the root), the data of different sources are exchanged
and combined (for example, aggregated). Data type conversions from the data types
specific to the database system or file format into SQL data types have to be executed
before passing the results back up the tree. Drill can process nested data in a schema-
less way. An internal schema is obtained while reading in the data; this process is
called schema discovery.
Apache Drill implements some extensions to SQL in order to be able to handle
self-describing, nested data in different formats. Drill’s SQL dialect uses backticks to
quote terms that would otherwise be interpreted as reserved keywords in SQL or to
quote file paths.
As an example consider a selection query on a file in the distributed file system
(DFS). The storage plugin used in the query is called dfs and a comma-separated (CSV)
file is accessed by specifying the path and file name. Drill splits the comma-separated
values in each line into an array of columns that can be addressed by an index (starting
with 0). Assume we query a CSV file containing a pair of book identifier and book title
(separated by a comma) in each line of a file called books.csv:
SELECT COLUMNS[0] as ID, COLUMNS[1] as Title FROM dfs.‘/books.csv‘;

A query can also be applied to a directory: in effect, the query will be executed on all
files in a directory and the output will contain the relevant values from all files. All
these files however have to be compatible; for example, in CSV files the columns must
be of the same type in all files and occur in all files in the same order. Drill can also
work with a directory hierarchy: if a query is executed on a directory containing sub-
directories (which themselves may again contain subdirectories), Drill will descend
into the subdirectories, execute that query on the files located there and aggregate the
results of all subdirectories. Several functions can be used in a query to exactly tell
Drill which subdirectories it should descend into.
15.4.2 Apache Druid
Druid offers the functionality to merge real-time and historical data.
Web resources:
– Apache Druid: http://druid.io/
– documentation page: http://druid.io/docs/0.8.0/
– GitHub repository: https://github.com/druid-io/druid
Worker nodes in the Druid architecture are divided into real-time nodes and histori-
cal nodes. Real-time nodes can process streams of data while batches of data can be
loaded into the historical nodes. Druid can be backed up by so-called“deep storage”
to enable long-term persistence of the data. The data format processed by Druid in the
real-time nodes are JSON files; the historical nodes can process JSON as well as CSV
(comma-separated) or TSV (tab-separated) data. More precisely, a data item processed
in Druid is a timestamped event. A segment is a group of such events for a certain pe-
riod of time. Each event is represented by a data record consisting of a timestamp col-
umn, several attributes called dimension columns as well as several attributes called
the metrics columns. The format of the input events have to be declared in a so-called
input specification by providing schema information – in particular, the names and
types of the columns. In the real-time nodes, several events can then grouped by times-
tamp or a subset of the dimensions; next, data in the groups can be processed further
– for example, counting the size of each grouping; or aggregating (for example, sum-
ming) over the metrics columns for each group; or otherwise filtering the data (for
example, with selection or pattern matching).
Real-time nodes keep data (for a certain time segment) in an index structure; these
index structures are periodically transfered to the deep storage. From the deep stor-
age the data can be loaded into the historical nodes where they are immutable (that is,
read-only). Druid stores data in a column-oriented way to enable column compression
and data locality. Broker nodes are the ones responsible to handle client requests; this
includes relaying requests to multiple real-time or historical nodes as well as aggre-

gating partial results before returning results to the client. Coordinator nodes manage
data distribution among historical nodes.
15.4.3 OrientDB
As a multi-model database, OrientDB offers a document API, an object API, and a

graph API; it implements extensions of the SQL standard to interact with all three
APIs. Alternatively, Java APIs are available. The Java Graph API is compliant with Tin-
kerPop (see Section 4.6.1).
Web resources:
– OrientDB: http://orientdb.com/
– documentation page: http://orientdb.com/docs/last/
– GitHub repository: https://github.com/orientechnologies/
Classes describe the type of a record. Classes can be explicitly created as an OClass
object in the database schema and inheritance is supported where the superclass must
also be registered in the schema:
OClass person = db.getMetadata().getSchema().createClass("Person");

OClass employee = db.getMetadata().getSchema()
.createClass("Employee").setSuperClass(person);
Records are organized into clusters (by default one cluster per class). In OrientDB,
each record (document, object, vertex or edge) is identified by a recordID of the form
#<cluster-id>:<cluster-position> that represents the physical position of the
record in the database. Records can be linked by storing the recordID of the target
record in the source record; this avoids joins (based on IDs) as well as embedding.
The APIs offered by OrientDB are the following:
Graph API: The graph API offers commands to create vertices and edges with proper-
ties as follows where V and E are the default vertex and edge classes, respectively:
CREATE VERTEX V SET name = ’Alice’

Output: Created vertex with RID #13:1
CREATE VERTEX V SET name = ’Bob’
Output: Created vertex with RID #13:2
CREATE EDGE E FROM #13:1 TO #13:2 SET knows_since = ’2010’
To obtain custom vertex and edge classes, V and E can be extended.

For example, a new friend class can be obtained for edges and a new friend edge cre-
ated as follows:
CREATE CLASS Friend EXTENDS E

CREATE EDGE Friend FROM #13:1 TO #13:2
Traversing the edge can be done by calling the in (incoming edges), out (outgoing
edges) or both (bidirectional edges) functions; for example, Alice’s friends can be ob-
tained:
SELECT EXPAND( OUT( ’Friend’ ) ) FROM Person WHERE name = ’Alice’
Edges without properties can be stored as lightweight edges that do not have a record
identifier but are stored as a link to the target vertex in the source vertex.
Document API: The document API is Java-based. A document database can be
opened within a transaction ODatabaseDocumentTx by specifying a URL; the URL
determines whether the database is in-memory (memory:), embedded in the Java
application (plocal:) or running on a remote server (remote:). The open method
accepts a username and a password as strings. When the database is closed, all re-
sources will be released. The ODocument class represents a JSON document in the
database. It can be created, its fields can be set and then it can be saved. For exam-
ple, we can create a person document in an embedded database called persondb as
follows:
ODatabaseDocumentTx db =
new ODatabaseDocumentTx("plocal:/persondb");
db.open("user", "password");
try {
ODocument alice = new ODocument("Person");
alice.field( "firstname", "Alice" );
alice.field( "lastname", "Smith" );
alice.field( "age", 31 );
alice.save();
} finally {
db.close();
}
Note that Person is the class name for the new document. The database offers methods
to iterate over all documents for a class (browseClass) and over all documents stored
in a cluster (browseCluster).
SQL queries can be passed to the database by calling the query() method and
passing an OSQLSynchQuery object with a SQL string.
For example:
List<ODocument> result = db.query(

new OSQLSynchQuery<ODocument>("select * from Person
where firstname = ’Alice’"));
OrientDB also supports asynchronous queries (where instead of collecting all results
a result listener returns result records step by step), non-blocking queries (so that the
thread does not wait for the answer) and prepared queries (that accept various param-
eters).
SQL commands (like updates) can be passed to the database by calling the
command() method and passing an OCommandSQL object with a SQL string:
int recordsUpdated = db.command(

new OCommandSQL("update Person set lastname = ’Miller’
where firstname = ’Alice’")).execute();
Object API: Internally, OrientDB uses a mapping of objects into documents. When
reading in objects, they are constructed from the documents by using Java Reflection.
That is why each persisted class has to provide an empty constructor as well as getter
and setter methods for its non-transient and non-static fields.
An object database can be opened within a transaction OObjectDatabaseTx by
specifying a URL and then opened. Immediately after opening, an entity manager
must be notified which objects are persistent; this registration can be done for indi-
vidual classes or entire packages. A non-proxied object is one that does not have a
representation in the database; a proxied object is one represented in the database.
With the newInstance method, an object will be created that is immediately proxied.
OObjectDatabaseTx db = new OObjectDatabaseTx("plocal:/persondb");

db.getEntityManager().registerEntityClasses(Person.class);
try {
Person p = db.newInstance(Person.class);
p.setFirstname("Alice");
p.setLastname("Smith");
p.setAge(31);
db.save(p);
} finally {
db.close();
}
If instead the object is created with the usual constructor, it will only be proxied after
calling the save method; then a proxied version of the object is returned that should
be reassigned to the old (non-proxied) reference:
OObjectDatabaseTx db = new OObjectDatabaseTx("plocal:/persondb");

db.getEntityManager().registerEntityClasses(Person.class);
try {
Person p = new Person();
p.setFirstname("Alice");
p.setLastname("Smith");
p.setAge(31);
p = db.save(p);
} finally {
db.close();
}
15.4.4 ArangoDB
ArangoDB is a multi-model database with a graph API, a key-value API and a docu-
ment API. Its query language AQL (ArangoDB query language) resembles SQL in parts
but adds several database-specific extensions to it.
Web resources:
– ArangoDB: https://www.arangodb.com/
– documentation page: https://www.arangodb.com/documentation/
– GitHub repository: https://github.com/arangodb
Documents are stored in collections. Each collection has an internal ID as well as a

unique name; the name can be set by the user. Whenever a document is created, a
document key is assigned to it (stored in a document field _key) that is unique inside
the collection; by default the key is system-generated but can also be provided by the
user. In addition, a system-wide ID (denoted by _id) is maintained that allows for
cross-collection accesses; it consists of the collection name and the key.
Inserting values into a database corresponds to inserting a JSON document as fol-
lows:
INSERT {firstname: "Alice", lastname: "Smith", Age:31} IN persons

A query is expressed with a for-in-return statement:
FOR person IN persons RETURN person
The output format can be modified in the return statement (for example, concatenat-
ing first and last name):
FOR person IN persons RETURN

{name: CONCAT(person.firstname," ",person.lastname)}
A selection condition is expressed with a filter statement:
FOR person IN persons FILTER person.age > 30 RETURN person
The update statement replaces individual fields of a document:
FOR person IN persons FILTER person._key == 1 UPDATE person WITH

{lastname: "Miller"} IN persons
Whereas the replace statement replaces the entire document:
FOR person IN persons FILTER person._key == 1

REPLACE person WITH {firstname: "Jane", lastname: "Doe", Age:29}
In addition, AQL supports several statements to express joins between two documents
as well as aggregation and sorting. It can be extended by custom JavaScript code.
In the graph API, vertices are normal documents; edges are documents that
have the additional internal attributes _from and _to. To make edge handling more
efficient in the graph API, dedicated edge collections can be created that are auto-
matically indexed to allow for fast traversals. For example, all neighbors of a vertex
can be obtained by using the NEIGHBORS statement based on the vertex collection
persons, as well as the edge collection friendedges, and starting from the node with
ID "persons/alice" along outbound edges with the label knows:
NEIGHBORS(persons, friendedges, "persons/alice", "outbound",

[ { "$label": "knows" } ] )
A critical discussion of NOSQL data stores in general is taking place; see for example
[IS12, Lea10, Sto10].
Several attempts have been undertaken to define common interfaces for different data
stores; see for example [HTV10, ABR14, GBR14]. However, the limitation of such an
approach is that expressiveness is limited: a uniform interface cannot support all the
special features and particularities of the individual data stores.
Other articles compare the features of several NoSQL stores [Cat11, HJ11, Pok13,
GHTC13] – yet as the feature set of the databases is rapidly changing, these results are
quite short-lived.
Moreover a variety of performance comparisons have been carried out [PPR+ 09,
CST 10, PPR+ 11, FTD+ 12, LM13, AB13, KSM13a, PPV13, ABF14] – however mostly over
+
synthetic data sets and with artificial workloads and in particular not in a production
environment.
The notion of polyglot persistence – as coined in [FS12] – describes the approach
that applications represent data in different data models and use a mix of database
systems: the application can choose the data store and access method that is most
appropriate for the current task.
Drill system internals have been described in [HN13] and Druid system internals
in [YTL+ 14].
Bibliography
[AAB05] Amitanand Aiyer, Lorenzo Alvisi, and Rida A Bazzi. On the availability of non-strict
quorum systems. In Distributed Computing, pages 48–62. Springer, 2005.
[AB13] Veronika Abramova and Jorge Bernardino. NoSQL databases: MongoDB vs Cas-
sandra. In Bipin C. Desai, Ana Maria de Almeida, and Sudhir P. Mudur, editors,
International C* Conference on Computer Science & Software Engineering (C3S2E),
pages 14–22. ACM, 2013.
[Aba07] Daniel J. Abadi. Column stores for wide and sparse data. In Third Biennial Confer-
ence on Innovative Data Systems Research, pages 292–297, 2007.
[Aba12] Daniel J. Abadi. Consistency tradeoffs in modern distributed database system
design: CAP is only part of the story. IEEE Computer, 45(2):37–42, 2012.
[ABF14] Veronika Abramova, Jorge Bernardino, and Pedro Furtado. Testing cloud benchmark
scalability with cassandra. In Services (SERVICES), 2014 IEEE World Congress on,
pages 434–441. IEEE, 2014.
[ABH+ 13] Daniel J. Abadi, Peter A. Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel
Madden. The design and implementation of modern column-oriented database
systems. Foundations and Trends in Databases, 5(3):197–280, 2013.
[ABPA+ 09] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and
Alexander Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS tech-
nologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922–
933, 2009.
[ABR14] Paolo Atzeni, Francesca Bugiotti, and Luca Rossi. Uniform access to NoSQL sys-
tems. Information Systems, 43:117–133, 2014.
[ACL+ 07] Mustafa Atay, Artem Chebotko, Dapeng Liu, Shiyong Lu, and Farshad Fotouhi.
Efficient schema-based XML-to-relational data mapping. Information Systems,
32(3):458–476, 2007.
[ADA12] Divyakant Agrawal, Sudipto Das, and Amr El Abbadi. Data Management in the
Cloud: Challenges and Opportunities. Synthesis Lectures on Data Management.
Morgan & Claypool, 2012.
[AKD06] Anish Arora, Sandeep S. Kulkarni, and Murat Demirbas. Resettable vector clocks.
Journal of Parallel and Distributed Computing, 66(2):221–237, 2006.
[AL97] Atul Adya and Barbara Liskov. Lazy consistency using loosely synchronized clocks.
In ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC),
pages 73–82. ACM, 1997.
[AMF06] Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. Integrating compression and
execution in column-oriented database systems. In ACM SIGMOD International
Conference on Management of Data, pages 671–682. ACM, 2006.
[AMH08] Daniel J. Abadi, Samuel Madden, and Nabil Hachem. Column-stores vs. row-stores:
how different are they really? In ACM SIGMOD International Conference on Manage-
ment of Data, pages 967–980. ACM, 2008.
[AN10] Alex Averbuch and Martin Neumann. Partitioning graph databases – a quantita-
tive evaluation. Master’s thesis, KTH Stockholm, 2010. The Computing Research
Repository abs/1301.5121.
[ARB+ 06] Alexandre Andrade, Gabriela Ruberg, Fernanda Baião, Vanessa P Braganholo, and
Marta Mattoso. Efficiently processing XML queries over fragmented repositories
with PartiX. In Current Trends in Database Technology (EDBT), pages 150–163.
Springer, 2006.
334 | Bibliography
[ASS13] Masoud Saeida Ardekani, Pierre Sutra, and Marc Shapiro. Non-monotonic snapshot
isolation: scalable and strong consistency for geo-replicated transactional systems.
In 32nd International Symposium on Reliable Distributed Systems (SRDS), pages
163–172. IEEE, 2013.
[AW04] Hagit Attiya and Jennifer Welch. Distributed computing: fundamentals, simulations,
and advanced topics, volume 19. John Wiley & Sons, 2004.
[BAB+ 12] Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell,
Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,
Andrew Prout, Albert Reuther, Antonio Rosa, and Charles Yee. Driving big data with
big compute. In IEEE Conference on High Performance Extreme Computing (HPEC),
pages 1–6. IEEE, 2012.
[BB13] Pablo Barceló Baeza. Querying graph databases. In 32nd ACM Symposium on
Principles of Database Systems (PODS), pages 175–188. ACM, 2013.
[BBG+ 95] Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick
O’Neil. A critique of ANSI SQL isolation levels. In ACM SIGMOD Record, volume 24,
pages 1–10. ACM, 1995.
[BD13] Philip A. Bernstein and Sudipto Das. Rethinking eventual consistency. In ACM
SIGMOD International Conference on Management of Data, pages 923–928. ACM,
2013.
[BDF+ 10] Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan
Langerman, and J. Ian Munro. Cache-oblivious dynamic dictionaries with up-
date/query tradeoffs. In Twenty-First Annual ACM-SIAM Symposium on Discrete
Algorithms SODA, pages 1448–1456. Society for Industrial and Applied Mathemat-
ics, 2010.
[Ber73] Gerald Berman. The gossip problem. Discrete Mathematics, 4(1):91, 1973.
[BFCF+ 07] Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel,
Bradley C. Kuszmaul, and Jelani Nelson. Cache-oblivious streaming B-trees. In 19th
Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA),
pages 81–92. ACM, 2007.
[BFG+ 13] Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica. HAT,
not CAP: towards highly available transactions. In Proceedings of the 14th USENIX
conference on Hot Topics in Operating Systems, pages 24–24. USENIX Association,
2013.
[BG82] Daniel Barbará and Hector Garcia-Molina. How expensive is data replication? an
example. In 3rd International Conference on Distributed Computing Systems, pages
263–268. IEEE Computer Society, 1982.
[BG83] Philip A Bernstein and Nathan Goodman. Multiversion concurrency control – theory
and algorithms. ACM Transactions on Database Systems (TODS), 8(4):465–483,
1983.
[BG84] Philip A Bernstein and Nathan Goodman. An algorithm for concurrency control
and recovery in replicated distributed databases. ACM Transactions on Database
Systems (TODS), 9(4):596–615, 1984.
[BG85] Philip A Bernstein and Nathan Goodman. Serializability theory for replicated
databases. Journal of Computer and System Sciences, 31(3):355–374, 1985.
[BG13] Peter Bailis and Ali Ghodsi. Eventual consistency today: limitations, extensions,
and beyond. Communications of the ACM, 56(5):55–63, 2013.
[BGFvS09] Rena Bakhshi, Daniela Gavidia, Wan Fokkink, and Maarten van Steen. An analytical
model of information dissemination for a gossip-based protocol. Comput. Netw.,
53(13):2288–2303, August 2009.
Bibliography | 335
[BGK+ 08] Prosenjit Bose, Hua Guo, Evangelos Kranakis, Anil Maheshwari, Pat Morin, Jason
Morrison, Michiel H. M. Smid, and Yihui Tang. On the false-positive rate of Bloom
filters. Information Processing Letters, 108(4):210–213, 2008.
[BGS+ 11] Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan,
Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, Aravind
Menon, Samuel Rash, Rodrigo Schmidt, and Amitanand S. Aiyer. Apache Hadoop
goes realtime at Facebook. In ACM SIGMOD International Conference on Manage-
ment of Data, pages 1071–1080. ACM, 2011.
[BGY13] Sebastian Burckhardt, Alexey Gotsman, and Hongseok Yang. Understanding even-
tual consistency. Technical report, Technical Report MSR-TR-2013-39, Microsoft
Research, 2013.
[BHBL09] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far.
International Journal on Semantic Web and Information Systems, 5(3):1–22, 2009.
[BHEF11] Mihaela A Bornea, Orion Hodson, Sameh Elnikety, and K Fekete. One-copy serial-
izability with snapshot isolation under the hood. In Data Engineering (ICDE), 2011
IEEE 27th International Conference on, pages 625–636. IEEE, 2011.
[BHH09] Sebastian Bächle, Theo Härder, and Michael Peter Haustein. Implementing and
optimizing fine-granular lock management for XML document trees. In 14th Inter-
national Conference on Database Systems for Advanced Applications (DASFAA),
volume 5463 of Lecture Notes in Computer Science, pages 631–645. Springer, 2009.
[Blo70] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.
Communications of the ACM, 13(7):422–426, 1970.
[BM14] Vanessa Braganholo and Marta Mattoso. A survey on XML fragmentation. ACM
SIGMOD Record, 43(3):24–35, 2014.
[BR02] Roberto Baldoni and Michel Raynal. Fundamentals of distributed computing: A
practical tour of vector clock systems. IEEE Distributed Systems Online, 3(2), 2002.
[Bre12] Eric Brewer. CAP twelve years later: How the "rules" have changed. Computer,
45(2):23–29, 2012.
[BRJ05] Grady Booch, James Rumbaugh, and Ivar Jacobson. The Unified Modeling Language
User Guide. Addison-Wesley, 2nd edition, 2005.
[BS14] Peter Baumann and Heinrich Stamerjohanns. Towards a systematic benchmark
for array database systems. In Specifying Big Data Benchmarks, pages 94–102.
Springer, 2014.
[BVF+ 12] Peter Bailis, Shivaram Venkataraman, Michael J Franklin, Joseph M Hellerstein,
and Ion Stoica. Probabilistically bounded staleness for practical partial quorums.
Proceedings of the VLDB Endowment, 5(8):776–787, 2012.
[BWK07] Hendrik Blockeel, Tijn Witsenburg, and Joost N. Kok. Graphs, hypergraphs, and
inductive logic programming. In Mining and Learning with Graphs. MLG, 2007.
[BZS14] David Bermbach, Liang Zhao, and Sherif Sakr. Towards comprehensive measure-
ment of consistency guarantees for cloud-hosted data storage services. In Perfor-
mance Characterization and Benchmarking, pages 32–47. Springer, 2014.
[CAH12] Marek Ciglan, Alex Averbuch, and Ladialav Hluchy. Benchmarking traversal opera-
tions over graph databases. In 28th International Conference on Data Engineering
Workshops (ICDEW), pages 186–189. IEEE, 2012.
[Cat11] Rick Cattell. Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4):12–
27, 2011.
[CB09] Thomas M. Connolly and Carolyn E. Begg. Database Systems: A Practical Approach
to Design, Implementation and Management. Addison-Wesley, 5th edition, 2009.
336 | Bibliography
[CCA+ 10] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy,
and Russell Sears. MapReduce online. In Networked Systems Design and Imple-
mentation, pages 313–328. USENIX Association, 2010.
[CDG+ 06] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach,
Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A
distributed storage system for structured data. In 7th Symposium on Operating Sys-
tems Design and Implementation (OSDI’06), pages 205–218. USENIX Association,
2006.
[CGR07] Tushar D Chandra, Robert Griesemer, and Joshua Redstone. Paxos made live: an
engineering perspective. In Proceedings of the twenty-sixth annual ACM symposium
on Principles of distributed computing, pages 398–407. ACM, 2007.
[Cha10] Kang-tsung Chang. Introduction to geographic information systems. McGraw-Hill
New York, 2010.
[Che76] Peter Pin-Shan Chen. The entity-relationship model – toward a unified view of data.
ACM Transactions on Database Systems (TODS), 1(1):9–36, 1976.
[CK85] George P. Copeland and Setrag Khoshafian. A decomposition storage model. In
ACM SIGMOD International Conference on Management of Data, pages 268–279.
ACM, 1985.
[Cod70] Edgar F. Codd. A relational model of data for large shared data banks. Communica-
tions of the ACM, 13(6):377–387, 1970.
[COK86] Brian A. Coan, Brian M. Oki, and Elliot K. Kolodner. Limitations on database avail-
ability when networks partition. In Joseph Y. Halpern, editor, Fifth Annual ACM
Symposium on Principles of Distributed Computing, pages 187–194. ACM, 1986.
[CST+ 10] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell
Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st
ACM symposium on Cloud computing, pages 143–154. ACM, 2010.
[CZ12] Gary Chartrand and Ping Zhang. A first course in graph theory. Courier Dover
Publications, 2012.
[Dat07] Chris J. Date. Logic and Databases: The Roots of Relational Theory. Trafford Publish-
ing, 2007.
[DG04] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on
large clusters. In Operating Systems Design and Implementation (OSDI), pages
137–150. USENIX Association, 2004.
[DG10] Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing tool.
[DGH+ 88] Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes Irish, John Larson, Scott
Shenker, Howard E. Sturgis, Daniel C. Swinehart, and Douglas B. Terry. Epidemic al-
gorithms for replicated database maintenance. Operating Systems Review, 22(1):8–
32, 1988.
[DGMS85] Susan B Davidson, Hector Garcia-Molina, and Dale Skeen. Consistency in a parti-
tioned network: a survey. ACM Computing Surveys (CSUR), 17(3):341–370, 1985.
[DHJ+ 07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall,
and Werner Vogels. Dynamo: Amazon’s highly available key-value store. In Sympo-
sium on Operating Systems Principles (SOSP), pages 205–220. ACM, 2007.
[Die12] Reinhard Diestel. Graph Theory, volume 173 of Springer Graduate Texts in Mathe-
matics. Springer, 2012.
Bibliography | 337
[DQRJ+ 10] Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and
Jörg Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even
noticing). Proceedings of the VLDB Endowment, 3(1):518–529, 2010.
[DS12] Miyuru Dayarathna and Toyotaro Suzumura. XGDBench: A benchmarking platform
for graph stores in exascale clouds. In IEEE International Conference on Cloud
Computing Technology and Science (CloudCom), pages 363–370. IEEE, 2012.
[DSMBMM+ 11] David Dominguez-Sal, Norbert Martinez-Bazan, Victor Muntes-Mulero, Pere Baleta,
and Josep Lluis Larriba-Pey. A discussion on the design of graph database bench-
marks. In Performance Evaluation, Measurement and Characterization of Complex
Systems – TPC Technology Conference Revised Selected Papers, volume 6417 of
Lecture Notes in Computer Science, pages 25–40. Springer, 2011.
[DSUBGV+ 10] David Dominguez-Sal, P. Urbón-Bayes, Aleix Giménez-Vañó, Sergio Gómez-Villamor,
Norbert Martínez-Bazan, and Josep-Lluis Larriba-Pey. Survey of graph database
performance on the HPC scalable graph analysis benchmark. In International Con-
ference on Web-Age Information Management (WAIM) Workshops, volume 6185 of
Lecture Notes in Computer Science, pages 37–48. Springer, 2010.
[DU11] Suzanne W Dietrich and Susan D Urban. Fundamentals of Object Databases: Object-
Oriented and Object-Relational Design. Synthesis Lectures on Data Management.
Morgan & Claypool, 2011.
[DuC13] Bob DuCharme. Learning Sparql. O’Reilly, 2013.
[ECM13] ECMA International. ECMA-404: The JSON data interchange format, 2013.
[EPZ05] Sameh Elnikety, Fernando Pedone, and Willy Zwaenepoel. Database replication us-
ing generalized snapshot isolation. In 24th IEEE Symposium on Reliable Distributed
Systems, pages 73–84. IEEE, 2005.
[FB99] Armando Fox and Eric A. Brewer. Harvest, yield and scalable tolerant systems. In
Workshop on Hot Topics in Operating Systems, pages 174–178, 1999.
[FBM10] Guilherme Figueiredo, Vanessa P. Braganholo, and Marta Mattoso. Processing
queries over distributed XML databases. Journal of Information and Data Manage-
ment, 1(3):455–470, 2010.
[Fie00] Roy Thomas Fielding. Architectural styles and the design of network-based software
architectures. PhD thesis, University of California, Irvine, 2000.
[FM82] Michael J. Fischer and A. Michael. Sacrificing serializability to attain high availabil-
ity of data. In Symposium on Principles of Database Systems (PODS), pages 70–75.
ACM, 1982.
[FM12] Lizhen Fu and Xiaofeng Meng. Efficient processing of updates in dynamic graph-
structured XML data. In 13th International Conference on Web-Age Information
Management (WAIM), volume 7418 of Lecture Notes in Computer Science, pages
254–265. Springer, 2012.
[FS12] Martin J. Fowler and Pramodkumar J. Sadalage. NoSQL Distilled: A Brief Guide to the
Emerging World of Polyglot Persistence. Prentice Hall, 2012.
[FTD+ 12] Avrilia Floratou, Nikhil Teletia, David J. DeWitt, Jignesh M. Patel, and Donghui
Zhang. Can the elephants handle the NoSQL onslaught? Proceedings of the VLDB
Endowment, 5(12):1712–1723, 2012.
[GBR14] Felix Gessert, Florian Bucklers, and Norbert Ritter. Orestes: A scalable database-as-
a-service architecture for low latency. In 6th International Workshop on Cloud Data
Management – Data Engineering Workshops (ICDEW), pages 215–222. IEEE, 2014.
[GHMT12] Michael T. Goodrich, Daniel S. Hirschberg, Michael Mitzenmacher, and Justin
Thaler. Cache-oblivious dictionaries and multimaps with negligible failure prob-
338 | Bibliography
ability. In First Mediterranean Conference on Algorithms, volume 7659 of Lecture

Notes in Computer Science, pages 203–218. Springer, 2012.
[GHOS96] Jim Gray, Pat Helland, Patrick E. O’Neil, and Dennis Shasha. The dangers of repli-
cation and a solution. In ACM SIGMOD International Conference on Management of
Data, pages 173–182. ACM, 1996.
[GHTC13] Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari, and Miriam AM Capretz.
Data management in cloud environments: Nosql and newsql data stores. Journal of
Cloud Computing: Advances, Systems and Applications, 2(1):22, 2013.
[Gif79] David K Gifford. Weighted voting for replicated data. In Proceedings of the seventh
ACM symposium on Operating systems principles, pages 150–162. ACM, 1979.
[GL02] Seth Gilbert and Nancy A. Lynch. Brewer’s conjecture and the feasibility of consis-
tent, available, partition-tolerant web services. SIGACT News (ACM), 33(2):51–59,
2002.
[GLP93] Giorgio Gallo, Giustino Longo, and Stefano Pallottino. Directed hypergraphs and
applications. Discrete Applied Mathematics, 42(2):177–201, 1993.
[GLS11] Wojciech Golab, Xiaozhou Li, and Mehul A Shah. Analyzing consistency properties
for fun and profit. In 30th annual ACM SIGACT-SIGOPS symposium on Principles of
distributed computing, pages 197–206. ACM, 2011.
[GMUW08] Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. Database Systems:
The Complete Book. Prentice Hall, 2nd edition, 2008.
[GÖ10] Lukasz Golab and M Tamer Özsu. Data stream management. Synthesis Lectures on
Data Management. Morgan & Claypool Publishers, 2010.
[Gro08] W3C XML Core Working Group. Extensible markup language (XML) 1.0 (fifth edition)
W3C recommendation, 2008.
[Gru02] Torsten Grust. Accelerating XPath location steps. In ACM SIGMOD International
Conference on Management of Data, pages 109–120. ACM, 2002.
[HCC11] Ian Heywood, Sarah Cornelius, and Steve Carter. An Introduction to Geographical
Information Systems. Pearson, 2011.
[HD08] Allison L. Holloway and David J. DeWitt. Read-optimized databases, in depth. Pro-
ceedings of the VLDB Endowment, 1(1):502–513, 2008.
[HJ11] Robin Hecht and S Jablonski. NoSQL evaluation: A use case oriented survey. In
International Conference on Cloud and Service Computing, pages 336–341, 2011.
[HM10] Terry Halpin and Tony Morgan. Information Modeling and Relational Databases.
Morgan Kaufmann, 2010.
[HMB90] Antony L. Hosking, J. Eliot B. Moss, and Cynthia Bliss. Design of an object faulting
persistent smalltalk. COINS Technical Report, pages 90–45, 1990.
[HN13] Michael Hausenblas and Jacques Nadeau. Apache Drill: interactive ad-hoc analysis
at scale. Big Data, 1(2):100–104, 2013.
[HP13] Florian Holzschuher and René Peinl. Performance of graph query languages: com-
parison of Cypher, Gremlin and native access in Neo4j. In Proceedings of the Joint
EDBT/ICDT 2013 Workshops, pages 195–204. ACM, 2013.
[HRSD07] Allison L. Holloway, Vijayshankar Raman, Garret Swart, and David J. DeWitt. How to
barter bits for chronons: compression and bandwidth trade offs for database scans.
In ACM SIGMOD International Conference on Management of Data, pages 389–400.
ACM, 2007.
[HTV10] Till Haselmann, Gunnar Thies, and Gottfried Vossen. Looking into a REST-based
universal API for database-as-a-service systems. In Commerce and Enterprise
Computing (CEC), 2010 IEEE 12th Conference on, pages 17–24. IEEE, 2010.
Bibliography | 339
[IGN+ 12] Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender,
and Martin L. Kersten. MonetDB: Two decades of research in column-oriented
database architectures. IEEE Data Engineering Bulletin, 35(1):40–45, 2012.
[Int11] International Organization for Standardization. ISO/IEC 9075:2011 Information
technology – Database languages – SQL, 2011.
[IS12] Maria Indrawan-Santiago. Database research: Are we at a crossroad? reflection on
NoSQL. In 15th International Conference on Network-Based Information Systems
(NBiS), pages 45–51. IEEE, 2012.
[JC06] Kuen-Fang Jack Jea and Shih-Ying Chen. A high concurrency XPath-based locking
protocol for XML databases. Information & Software Technology, 48(8):708–716,
2006.
[JPPMAK03] Ricardo Jiménez-Peris, Marta Patiño-Martínez, Gustavo Alonso, and Bettina Kemme.
Are quorums an alternative for data replication? ACM Transactions on Database
Systems (TODS), 28(3):257–294, 2003.
[Juk13] Nenad Jukic. Database Systems: Introduction to Databases and Data Warehouses.
Pearson, 2013.
[JV13] Salim Jouili and Valentin Vansteenberghe. An empirical comparison of graph
databases. In International Conference on Social Computing (SocialCom), pages
708–715. IEEE, 2013.
[JVG+ 07] Márk Jelasity, Spyros Voulgaris, Rachid Guerraoui, Anne-Marie Kermarrec, and
Maarten van Steen. Gossip-based peer sampling. ACM Trans. Comput. Syst., 25(3),
August 2007.
[KA10] Bettina Kemme and Gustavo Alonso. Database replication: a tale of research across
communities. Proceedings of the VLDB Endowment, 3(1-2):5–12, 2010.
[KA14] Vinit Kumar and Ajay Agarwal. Ht-paxos: High throughput state-machine replication
protocol for large clustered data centers. CoRR, abs/1407.1237, 2014.
[KCJ+ 87] Setrag Khoshafian, George P. Copeland, Thomas Jagodis, Haran Boral, and Patrick
Valduriez. A query processing strategy for the decomposed storage model. In
International Conference on Data Engineering, pages 636–643. IEEE Computer
Society, 1987.
[KGGS15] Lukas Kircher, Michael Grossniklaus, Christian Grün, and Marc H Scholl. Efficient
structural bulk updates on the Pre/Dist/Size XML encoding. In 31st International
Conference on Data Engineering (ICDE), pages 447–458. IEEE, 2015.
[KGT+ 10] Jens Krueger, Martin Grund, Christian Tinnefeld, Hasso Plattner, Alexander Zeier,
and Franz Faerber. Optimizing write performance for read optimized databases. In
Database Systems for Advanced Applications, pages 291–305. Springer, 2010.
[KK98] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for
partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392,
1998.
[KL00] Kamalakar Karlapalem and Qing Li. A framework for class partitioning in object-
oriented databases. Distributed and Parallel Databases, 8(3):333–366, 2000.
[KLL+ 97] David R. Karger, Eric Lehman, Frank Thomson Leighton, Rina Panigrahy, Matthew S.
Levine, and Daniel Lewin. Consistent hashing and random trees: Distributed
caching protocols for relieving hot spots on the world wide web. In Frank Thom-
son Leighton and Peter W. Shor, editors, Twenty-Ninth Annual ACM Symposium on
the Theory of Computing, pages 654–663, 1997.
[KM08] Adam Kirsch and Michael Mitzenmacher. Less hashing, same performance: Build-
ing a better Bloom filter. Random Structures & Algorithms, 33(2):187–218, 2008.
340 | Bibliography
[KÖD10] Patrick Kling, M Tamer Özsu, and Khuzaima Daudjee. Generating efficient execution
plans for vertically partitioned XML databases. Proceedings of the VLDB Endow-
ment, 4(1):1–11, 2010.
[KSM13a] Vojtěch Kolomičenko, Martin Svoboda, and Irena Holubová Mlỳnková. Experimental
comparison of graph databases. In Proceedings of International Conference on
Information Integration and Web-based Applications & Services, page 115. ACM,
2013.
[KSM13b] Vojtěch Kolomičenko, Martin Svoboda, and Irena Holubová Mlýnková. Experimental
comparison of graph databases. In Proceedings of International Conference on
Information Integration and Web-based Applications & Services, pages 115:115–
115:124. ACM, 2013.
[KvS07] Anne-Marie Kermarrec and Maarten van Steen. Gossiping in distributed systems.
SIGOPS Operating Systems Review, 41(5):2–7, October 2007.
[KZK11] Euclid Keramopoulos, Michael Zounaropoulos, and George Kourouleas. A compar-
ison study of object-oriented database management systems. Fourth International
Theoretical and Practical Conference on Object Systems, 2011.
[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system.
[Lam79] Leslie Lamport. How to make a multiprocessor computer that correctly executes
multiprocess programs. IEEE Transactions on Computers, 28(9):690–691, 1979.
[Lam86] Leslie Lamport. On interprocess communication. Distributed computing, 1(2):86–
101, 1986.
[Lam98] Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems
(TOCS), 16(2):133–169, 1998.
[Lam05] Leslie Lamport. Generalized consensus and Paxos. Technical report, Technical
Report MSR-TR-2005-33, Microsoft Research, 2005.
[Lam06] Leslie Lamport. Fast Paxos. Distributed Computing, 19(2):79–103, 2006.
[Lam11] Leslie Lamport. Byzantizing Paxos by refinement. In Distributed Computing, pages
211–224. Springer, 2011.
[Lar05] Craig Larman. Applying UML and Patterns: An Introduction to Object-Oriented Anal-
ysis and Design and Iterative Development. Pearson, 2005.
[LC01] Dongwon Lee and Wesley W. Chu. CPI: Constraints-preserving inlining algorithm for
mapping XML DTD to relational schema. Data & Knowledge Engineering, 39(1):3–
25, 2001.
[Lea10] Neal Leavitt. Will NoSQL databases live up to their promise? Computer, 43(2):12–
14, 2010.
[LFKA11] Wyatt Lloyd, Michael J Freedman, Michael Kaminsky, and David G Andersen. Don’t
settle for eventual: scalable causal consistency for wide-area storage with cops. In
Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles,
pages 401–416. ACM, 2011.
[LFV+ 12] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandier, Lyric
Doshi, and Chuck Bear. The Vertica analytic database: C-store 7 years later. Pro-
ceedings of the VLDB Endowment, 5(12):1790–1801, 2012.
[LG07] Fakhar Lodhi and Muhammad Ahmad Ghazali. Design of a simple and effective
object-to-relational mapping technique. In ACM Symposium on Applied Computing
(SAC), pages 1445–1449. ACM, 2007.
[Lin12] Jimmy Lin. MapReduce is good enough? If all you have is a hammer, throw away
everything that’s not a nail! CoRR, abs/1209.2191, 2012.
Bibliography | 341
[LKPMJP05] Yi Lin, Bettina Kemme, Marta Patiño-Martínez, and Ricardo Jiménez-Peris. Mid-
dleware based data replication providing snapshot isolation. In ACM SIGMOD
International Conference on Management of Data, pages 419–430. ACM, 2005.
[LL95] Mark Levene and George Loizou. A graph-based data model and its ramifications.
IEEE Transactions on Knowledge and Data Engineering, 7(5):809–823, 1995.
[LLC+ 11] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon.
Parallel data processing with MapReduce: a survey. ACM SIGMOD International
Conference on Management of Data, 40(4):11–20, 2011.
[LLC14] Aldelir Fernando Luiz, Lau Cheuk Lung, and Miguel Correia. MITRA: Byzantine fault-
tolerant middleware for transaction processing on replicated databases. ACM
SIGMOD Record, 43(1):32–38, 2014.
[LLH08] Changqing Li, Tok Wang Ling, and Min Hu. Efficient updates in dynamic XML data:
from binary string to quaternary string. The International Journal on Very Large Data
Bases, 17(3):573–601, 2008.
[LLSG92] Rivka Ladin, Barbara Liskov, Liuba Shrira, and Sanjay Ghemawat. Providing high
availability using lazy replication. ACM Transactions on Computer Systems (TOCS),
10(4):360–391, 1992.
[LM99] Meng-Jang Lin and Keith Marzullo. Directional gossip: Gossip in a wide area net-
work. In EDCC, pages 364–379. Springer, 1999.
[LM10] Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured
storage system. SIGOPS Operating Systems Review, 44(2):35–40, 2010.
[LM13] Yishan Li and Sathiamoorthy Manoharan. A performance comparison of SQL and
NoSQL databases. In Pacific Rim Conference on Communications, Computers and
Signal Processing (PACRIM), pages 15–19. IEEE, 2013.
[LS71] Raymond A. Lorie and Andrew J. Symonds. A relational access method for interac-
tive applications. In Courant Computer Science Symposia, volume 6, pages 99–124,
1971.
[LSP82] Leslie Lamport, Robert E. Shostak, and Marshall C. Pease. The Byzantine generals
problem. ACM Transactions on Programming Languages and Systems (TOPLAS),
4(3):382–401, 1982.
[MA11] Giorgos Margaritis and Stergios V. Anastasiadis. RangeMerge: Online performance
tradeoffs in NoSQL datastores. Technical report, Department of Computer Science,
University of Ioannina, 2011. Technical Report DCS 2011-13.
[ME98] Wai Yin Mok and David W. Embley. Using NNF to transform conceptual data models
to object-oriented database designs. Data & Knowledge Engineering, 24(3):313–
336, 1998.
[Mer87] Ralph C. Merkle. A digital signature based on a conventional encryption function. In
International Cryptology Conference (CRYPTO), pages 369–378. Springer, 1987.
[MG12] Rajeev Tastogi Minos Garofalakis, Johannes Gehrke. Data Stream Management:
Processing High-Speed Data Streams. Springer Verlag, 2012.
[MGL+ 10] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivaku-
mar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale
datasets. Proceedings of the VLDB Endowment, 3(1-2):330–339, 2010.
[MK13] Pavel Mička and Zdeněk Kouba. DAO dispatcher pattern: A robust design of the
data access layer. In Fifth International Conferences on Pervasive Patterns and
Applications (PATTERNS), pages 1–6. IARIA, 2013.
[Mos92] J. Eliot B. Moss. Working with persistent objects: to swizzle or not to swizzle. IEEE
Transactions on Software Engineering, 18(8):657–673, 1992.
342 | Bibliography
[MS10] Hui Ma and Klaus-Dieter Schewe. Fragmentation of XML documents. Journal of

Information and Data Management, 1(1):21–34, 2010.
[MSL+ 11] Prince Mahajan, Srinath Setty, Sangmin Lee, Allen Clement, Lorenzo Alvisi, Mike
Dahlin, and Michael Walfish. Depot: Cloud storage with minimal trust. ACM Trans-
actions on Computer Systems (TOCS), 29(4):12, 2011.
[MT13] Vojtěch Merunka and Jakub Tuma. Normalization rules of the object-oriented data
model. In Innovations and Advances in Computer, Information, Systems Sciences,
and Engineering, volume 152 of Lecture Notes in Electrical Engineering, pages
1077–1089. Springer, 2013.
[MW15] Nathan Marz and James Warren. Big Data: Principles and best practices of scalable
realtime data systems. Manning Publications Co., 2015.
[OCGO96] Patrick E. O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O’Neil. The log-
structured merge-tree (LSM-tree). Acta Informatica, 33(4):351–385, 1996.
[Oli07] Antoni Olivé. Conceptual modeling of information systems. Springer, 2007.
[O’N08] Elizabeth J. O’Neil. Object/relational mapping 2008: Hibernate and the entity data
model (EDM). In ACM SIGMOD International Conference on Management of Data,
pages 1351–1356. ACM, 2008.
[OOP+ 04] Patrick E. O’Neil, Elizabeth J. O’Neil, Shankar Pal, Istvan Cseri, Gideon Schaller,
and Nigel Westbury. ORDPATHs: Insert-friendly XML node labels. In ACM SIGMOD
International Conference on Management of Data, pages 903–908. ACM, 2004.
[OR13] Martin F. O’Connor and Mark Roantree. FibLSS: A scalable label storage scheme for
dynamic XML updates. In 17th East European Conference on Advances in Databases
and Information Systems (ADBIS), volume 8133 of Lecture Notes in Computer Sci-
ence, pages 218–231. Springer, 2013.
[ORS+ 08] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew
Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings
of the 2008 ACM SIGMOD international conference on Management of data, pages
1099–1110. ACM, 2008.
[ÖV11] M. Tamer Özsu and Patrick Valduriez. Principles of Distributed Database Systems.
Springer, 2011.
[PBA+ 10] Nuno M. Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, and Ricardo
Gonçalves. Dotted version vectors: Logical clocks for optimistic replication. The
Computing Research Repository, abs/1011.5808, 2010.
[PK07] Peter Pleshachkov and Sergei Kuznetsov. SXDGL: Snapshot based concurrency
control protocol for XML data. In 5th International XML Database Symposium on
Database and XML Technologies (XSym), volume 4704 of Lecture Notes in Computer
Science, pages 122–136. Springer, 2007.
[PL94] Alexandra Poulovassilis and Mark Levene. A nested-graph model for the repre-
sentation and manipulation of complex objects. ACM Transactions on Information
Systems (TOIS), 12(1):35–68, 1994.
[Pla11] Hasso Plattner. SanssouciDB: An in-memory database for processing enterprise
workloads. In BTW, volume 20, pages 2–21, 2011.
[Pok13] Jaroslav Pokorny. NoSQL databases: a step to database scalability in web environ-
ment. International Journal of Web Information Systems, 9(1):69–82, 2013.
[PPR+ 83] Douglas Stott Parker, Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J.
Walker, Evelyn Walton, Johanna M. Chow, David A. Edwards, Stephen Kiser, and
Charles S. Kline. Detection of mutual inconsistency in distributed systems. IEEE
Transactions on Software Engineering (TSE), 9(3):240–247, 1983.
Bibliography | 343
[PPR+ 09] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt,
Samuel R. Madden, and Michael Stonebraker. A comparison of approaches to large
scale data analysis. In ACM SIGMOD International Conference on Management of
Data, Providence, Rhode Island, USA, 2009. ACM.
[PPR+ 11] Swapnil Patil, Milo Polte, Kai Ren, Wittawat Tantisiriroj, Lin Xiao, Julio López, Garth
Gibson, Adam Fuchs, and Billie Rinaldi. YCSB++: benchmarking and performance
debugging advanced features in scalable table stores. In Proceedings of the 2nd
ACM Symposium on Cloud Computing, page 9. ACM, 2011.
[PPV13] Zachary Parker, Scott Poe, and Susan V Vrbsky. Comparing NoSQL MongoDB to an
SQL DB. In Proceedings of the 51st ACM Southeast Conference, page 5. ACM, 2013.
[PRT14] Vinit Padhye, Gowtham Rajappan, and Anand Tripathi. Transaction management
using causal snapshot isolation in partially replicated databases. In 33rd Interna-
tional Symposium on Reliable Distributed Systems (SRDS), pages 105–114. IEEE,
2014.
[PST+ 97] Karin Petersen, Mike Spreitzer, Douglas B. Terry, Marvin Theimer, and Alan J. De-
mers. Flexible update propagation for weakly consistent replication. In Symposium
on Operating Systems Principles (SOSP), pages 288–301. ACM, 1997.
[PZ11] Hasso Plattner and Alexander Zeier. In-memory data management. Springer,
Heidelberg, 2011.
[RC13] Florin Rusu and Yu Cheng. A survey on array storage, query languages, and sys-
tems. arXiv preprint arXiv:1302.0103, 2013.
[RG77] James B Rothnie and Nathan Goodman. A survey of research and development
in distributed database management. In Proceedings of the third international
conference on Very large data bases-Volume 3, pages 48–62. VLDB Endowment,
1977.
[RH93] Mark A. Roth and Scott J. Van Horn. Database compression. ACM SIGMOD Record,
22(3):31–39, 1993.
[Ric11] Catherine Ricardo. Databases Illuminated. Jones & Bartlett Learning, 2011.
[RL14] Maryam Razavian and Patricia Lago. A lean and mean strategy: a data migration
industrial study. Journal of Software: Evolution and Process, 26(2):141–171, 2014.
[RST11] Jun Rao, Eugene J. Shekita, and Sandeep Tata. Using paxos to build a scalable,
consistent, and highly available datastore. Proceedings of the VLDB Endowment,
4(4):243–254, 2011.
[RW12] Eric Redmond and Jim R. Wilson. Seven Databases in Seven Weeks: A Guide to
Modern Databases and the NoSQL Movement. Pragmatic Programmers, 2012.
[RWE13] Ian Robinson, Jim Webber, and Emil Eifrem. Graph databases. O’Reilly, 2013.
[SAB+ 05] Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cher-
niack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J.
O’Neil, Patrick E. O’Neil, Alex Rasin, Nga Tran, and Stanley B. Zdonik. C-store: A
column-oriented DBMS. In International conference on very large databases (VLDB).
ACM, 2005.
[SB09] Russell Sears and Eric A. Brewer. Segment-based recovery: write-ahead logging
revisited. Proceedings of the VLDB Endowment, 2(1):490–501, 2009.
[SBZB13] Michael Stonebraker, Paul Brown, Donghui Zhang, and Jacek Becla. SciDB: A
database management system for applications with complex analytics. Comput-
ing in Science & Engineering, 15(3):54–62, 2013.
[SCB+ 14] Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel,
and Kenneth A Ross. Reducing database locking contention through multi-version
concurrency. Proceedings of the VLDB Endowment, 7(13), 2014.
344 | Bibliography
[SFKP13] David Schwalb, Martin Faust, Jens Krueger, and Hasso Plattner. Physical column
organization in in-memory column stores. In Database Systems for Advanced Appli-
cations, pages 48–63. Springer, 2013.
[SK92] Mukesh Singhal and Ajay D. Kshemkalyani. An efficient implementation of vector
clocks. Information Processing Letters, 43(1):47–52, 1992.
[SLJ12] Weifeng Shan, Husheng Liao, and Xueyuan Jin. XML concurrency control protocols:
A survey. In International Conference on Web-Age Information Management (WAIM)
Workshops, volume 7419 of Lecture Notes in Computer Science, pages 299–308.
Springer, 2012.
[SMK+ 01] Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrish-
nan. Chord: A scalable peer-to-peer lookup service for internet applications. In ACM
SIGCOMM Conference, pages 149–160. ACM, 2001.
[SPAL11] Yair Sovran, Russell Power, Marcos K Aguilera, and Jinyang Li. Transactional storage
for geo-replicated systems. In Proceedings of the Twenty-Third ACM Symposium on
Operating Systems Principles, pages 385–400. ACM, 2011.
[Spi12] Richard Paul Spillane. Efficient, Scalable, and Versatile Application and System
Transaction Management for Direct Storage Layers. PhD thesis, Computer Science
Department, Stony Brook University, 2012.
[SS05] Yasushi Saito and Marc Shapiro. Optimistic replication. ACM Computing Surveys,
37(1):42–81, 2005.
[Sto10] Michael Stonebraker. SQL databases v. NoSQL databases. Communications of the
ACM, 53(4):10–11, 2010.
[Sto12] Michael Stonebraker. New opportunities for new SQL. Communications of the ACM,
55(11):10–11, 2012.
[STZ+ 99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt,
and Jeffrey F. Naughton. Relational databases for querying XML documents: Limita-
tions and opportunities. In 25th International Conference on Very Large Data Bases
(VLDB), pages 302–314. Morgan Kaufmann, 1999.
[SW13] Michael Stonebraker and Ariel Weisberg. The VoltDB main memory DBMS. IEEE
Data Engineering Bulletin, 36(2):21–27, 2013.
[TBM+ 11] Andrew Twigg, Andrew Byde, Grzegorz Milos, Tim D. Moreton, John Wilkes, and Tom
Wilkie. Stratified B-trees and versioning dictionaries. CoRR, abs/1103.4282, 2011.
[TDP+ 94] Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike Spreitzer, Marvin Theimer,
and Brent B. Welch. Session guarantees for weakly consistent replicated data. In
Conference on Parallel and Distributed Information Systems (PDIS), pages 140–149.
IEEE Computer Society, 1994.
[TDW+ 12] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao,
and Daniel J. Abadi. Calvin: Fast distributed transactions for partitioned database
systems. In ACM SIGMOD International Conference on Management of Data. ACM,
2012.
[TGGL82] Irving L Traiger, Jim Gray, Cesare A Galtieri, and Bruce G Lindsay. Transactions
and consistency in distributed database systems. ACM Transactions on Database
Systems (TODS), 7(3):323–342, 1982.
[Tho79] Robert H Thomas. A majority consensus approach to concurrency control for multi-
ple copy databases. ACM Transactions on Database Systems (TODS), 4(2):180–209,
1979.
[TRA96] Francisco J. Torres-Rojas and Mustaque Ahamad. Plausible clocks: Constant size
logical clocks for distributed systems. In Workshop on Distributed Algorithms
(WDAG), pages 71–88. Springer, 1996.
Bibliography | 345
[TRL12] Sasu Tarkoma, Christian Esteve Rothenberg, and Eemil Lagerspetz. Theory and
practice of Bloom filters for distributed systems. IEEE Communications Surveys and
Tutorials, 14(1):131–155, 2012.
[TSJ+ 09] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehous-
ing solution over a map-reduce framework. Proceedings of the VLDB Endowment,
2(2):1626–1629, 2009.
[TSK+ 10] Ilya Taranov, Ivan Shcheklein, Alexander Kalinin, Leonid Novak, Sergei Kuznetsov,
Roman Pastukhov, Alexander Boldakov, Denis Turdakov, Konstantin Antipin, Andrey
Fomichev, Peter Pleshachkov, Pavel Velikhov, Nikolai Zavaritski, Maxim Grinev,
Maria Grineva, and Dmitry Lizorkin. Sedna: Native XML database management sys-
tem (internals overview). In ACM SIGMOD International Conference on Management
of Data, pages 1037–1046. ACM, 2010.
[TSS97] Zahir Tari, John Stokes, and Stefano Spaccapietra. Object normal forms and depen-
dency constraints for object-oriented schemata. ACM Transactions on Database
Systems (TODS), 22(4):513–569, 1997.
[TVB+ 02] Igor Tatarinov, Stratis Viglas, Kevin S. Beyer, Jayavel Shanmugasundaram, Eugene J.
Shekita, and Chun Zhang. Storing and querying ordered XML using a relational
database system. In ACM SIGMOD International Conference on Management of
Data, pages 204–215. ACM, 2002.
[TvS06] Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems: Principles and
Paradigms (2nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2006.
[VCLM13] Luis Vaquero, Félix Cuadrado, Dionysios Logothetis, and Claudio Martella. Adaptive
partitioning for large-scale dynamic graphs. In Proceedings of the 4th Annual Sym-
posium on Cloud Computing, SOCC ’13, pages 35:1–35:2, New York, NY, USA, 2013.
ACM.
[vL09] Axel van Lamsweerde. Requirements Engineering: From System Goals to UML Mod-
els to Software Specifications. John Wiley & Sons, 2009.
[WA09] Weihan Wang and Cristiana Amza. On optimal concurrency control for optimistic
replication. In Distributed Computing Systems, 2009. ICDCS’09. 29th IEEE Interna-
tional Conference on, pages 317–326. IEEE, 2009.
[WH10] Andreas M. Weiner and Theo Härder. An integrative approach to query optimization
in native XML database management systems. In International Database Engineer-
ing and Applications Symposium (IDEAS), pages 64–74. ACM, 2010.
[Woo12] Peter T. Wood. Query languages for graph databases. ACM SIGMOD Record,
41(1):50–60, 2012.
[WPS+ 00] Matthias Wiesmann, Fernando Pedone, André Schiper, Bettina Kemme, and Gus-
tavo Alonso. Database replication techniques: A three parameter classification. In
19th IEEE Symposium on Reliable Distributed Systems, pages 206–215. IEEE, 2000.
[WV01] Gerhard Weikum and Gottfried Vossen. Transactional information systems: theory,
algorithms, and the practice of concurrency control and recovery. Morgan Kauf-
mann, 2001.
[XLW12] Liang Xu, Tok Wang Ling, and Huayu Wu. Labeling dynamic XML documents: an
order-centric approach. IEEE Transactions on Knowledge and Data Engineering,
24(1):100–113, 2012.
[YH97] Li-Hsing Yen and Ting-Lu Huang. Resetting vector clocks in distributed systems.
Journal of Parallel and Distributed Computing, 43(1):15–20, 1997.
[YTL+ 14] Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep
Ganguli. Druid: a real-time analytical data store. In Proceedings of the 2014 ACM
346 | Bibliography
SIGMOD international conference on Management of data, pages 157–168. ACM,

2014.
[ZB12] Marcin Zukowski and Peter A. Boncz. Vectorwise: Beyond column stores. IEEE Data
Engineering Bulletin, 35(1):21–27, 2012.
[ZCD+ 12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy
McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed
datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceed-
ings of the 9th USENIX conference on Networked Systems Design and Implementa-
tion, pages 2–2. USENIX Association, 2012.
Index
2PC see two-phase commit class 12, 193

2PL see two-phase locking client-centric consistency 305–306
clock 276–292
acceptor 269, 272 cloud database 39
ACID properties 26, 39, 295, 316 clustering 249
adjacency 45, 60 collision 176
adjacency list 50–51 column family 161, 163
adjacency matrix 46–48 column name 163
Aerospike 315 column qualifier 163, 167–169, 176
affinity 249 column store 143
agent 266 column striping 151
all-or-nothing principle 3, 26 combine 108
AllegroGraph 311 comission failure 238
allocation see data allocation compaction 173–175, 187, 189, 191, 243
Ambari 120 compatibility matrix 99
anomaly 19, 20, 33, 161, 196, 301 complete graph 42, 43
anti-entropy 240 composite attribute 9, 11, 18, 19, 217
ArangoDB 330 compression 144
array database 313 concurrency 24, 25, 131, 237, 261, 263, 280,
association 13, 194, 197, 198, 204 282, 283, 288, 290, 319
association class 13, 198, 204 concurrency control 26–28, 97–100, 139,
atomicity 26 266–276, 308
attribute 9, 11–14, 17, 19, 22, 34, 53–56, 71, 72, concurrent events 280
77, 85, 100, 103, 193, 213, 217–219 consensus problem 266
– composite 9, 11, 18, 19, 217 consistency 4, 26, 237, 261, 263, 271, 295–307,
– key 18, 19, 207 320, 321, 323
– multi-valued 9, 11, 13, 18, 203, 204, 218 – eventual see eventual consistency
attribute table 56 – trade-offs 306
Avro 120, 325 – weak see weak consistency
axis 81 consistent hashing 257
convergent replicated data types 130
B-tree 92, 94, 96, 171 coordinator 266
backward traversal 52 Couchbase 139
big data 38 CouchDB 136
bit-vector encoding 145 counter column 170
Bloom filter 175–181 crash failure 238
breadth-first search 44
bucket 108 dangling references 200
Byzantine failure 239 DAO see Data Access Object
Data Access Object 202
candidate key 21 data allocation 255–259
CAP principle 306 data distribution problem 256
causal consistency 304 data locality 108, 144, 163
causality 277, 280 data replication problem 265
– effective 304 data stream 312
348 | Index
data-centric consistency 303 finite state machine 156

database-as-a-service 39 Flink 312
DataNucleus 229 FLOWR expression 82
decision phase 266 Flume 120
definition level 152, 156 foreign key 18–20, 33, 84, 86, 87, 89, 111, 163,
depth-first search 44 203–205, 212, 213
derived fragmentation 250 forward traversal 52
DeweyID 80, 96 fragmentation 245–254
dictionary encoding 146 frame of reference encoding 146
difference 22
differential encoding 148 generalized hyperedge 59
directed graph 43 Geode 315
directed hyperedge 58 geographic information system 314
directed multigraph 43 GeoJSON 314
distribution transparency 236 GeoServer 314
Document Object Model 76 GIS see geographic information system
document order 76 gossip 239–241
Document Type Definition 71–73 graph 41–45
dotted version vectors 131, 291 – complete 42, 43
Dremel 151 – directed 43
Drill 324 – multi-relational 53
Druid 326 – oriented 43
DTD see Document Type Definition – simple 42, 43
durability 26 – single-relational 53
Dynamo 257 – undirected 42
– weighted 44
edge 41, 42 graph partitioning 252
edge cut 248, 254 graph problems 45
edge label 53, 54 graph traversal 44
edge list 46 GRASS GIS 314
edge marking 227
edge table 56 Hadoop 118
end tag 69 Hamilton Cycle 45
entity 8, 17–19, 33, 71, 161, 163, 254 Hamilton Path 45
entity lifecycle 211, 215 happened before 277
entity-relationship model 8–11 happened-before relation 277
epidemic protocol 239–241, 265 hash function 176, 252, 255, 257
ERM see entity-relationship model hash tree 241–243
Eulerian Cycle 45 Hazelcast 315
Eulerian Path 45 HDFS 118
eventual consistency 304 head set 59
eXistDB 100 hinted handoff 265
Extensible Markup Language 69–71 history 300
extensible record store 161 Hive 127
homogeneity 35, 143
fail-recover 239 horizontal fragmentation 249
fail-stop 239 hybrid fragmentation 250
failure 171, 237–239, 262, 264, 265, 267, 268, hyperedge 58
271, 272 – directed 58
Index | 349
– generalized 59 leader 269, 271

– oriented 59 lean and mean 317
– undirected 58 learner 269, 271
hypergraph 58 linked data 311
HyperGraphDB 66 lock escalation 99
hypernode 61 locking 7, 27, 98, 319
Log-Structured Merge Tree 171
idempotent 117 logical clock 277
identifier overflow 96 lost update 282, 290, 291, 295, 298
identifier stability 96
immutable data files 166 main memory 4–7, 83, 92, 143, 162, 163, 170,
in-memory database 315 171, 208, 209, 223–228, 315
incidence 45 main memory address 6, 226, 227
– negative 53 main memory table 167
– positive 53 map 106, 107
incidence list 51–53, 61 map-reduce 106–109, 118
incidence matrix 48–49, 61 master-slave replication 262
inconsistency window 303, 306 memtable 167, 172
index 90, 100, 103, 171 Merkle tree see hash tree
inlining 86 method 12, 193
integrity 4, 24, 25, 118 method chaining 64
interface 194, 201 migration 237, 317, 318
interrelational constraints 18 MonetDB 158
intersection 22 MongoDB 133
intrarelational constraints 18 multi-level index 171
inverse attributes 200 multi-master replication 263
isolation 26, 97, 306 multi-model database 322, 324, 327, 330
multi-relational graph 53
Java Data Objects 215–217 multi-valued attribute 9, 11, 13, 18, 203, 204,
Java Persistence API 209–214 218
Java Persistence Query Language 209 multiedge 43
Java Script Object Notation 101, 110–112, 116, multigraph 42, 43
229, 314, 319, 325, 326, 328 – directed 43
JDO see Java Data Objects – undirected 43
Jena 311 multiplicities 13, 151
join 23, 34, 35, 89, 91, 92, 124, 162, 203, 206, multiversion concurrency control 276
207, 209, 214, 247, 249, 250, 253, 327, 331 MVCC see multiversion concurrency control
JPQL see Java Persistence Query Language
JSON see Java Script Object Notation natural join 22
JSON object 110 negative incidence 53
JSON Schema 112–116 Neo4J 65
nested graph 61
key attribute 18, 19, 207 network partition 238
key-value pair 63, 101, 103, 105, 106, 109, 110, NewSQL 315
113, 119, 155, 162, 169, 173 node 41
node label 53, 54, 56
labeling scheme see numbering scheme node marking 228
lambda architecture 322 node table 56
Lamport clock see scalar clock node test 81
350 | Index
non-blocking reads 276 prefix numbering 80

non-redundancy 4, 246, 318 preorder numbering 78
non-resident 225 primary key 19–21, 86, 88, 182, 196, 211, 215
normalization 19, 20, 33, 161, 196–199, 218 projection 22, 214, 249
Not only SQL 38 property 54
null suppression 149 property graph 53–55
nullipotent 117 proposer 269, 271
numbering scheme 78–81
QGIS 314
Object Data Management Group 201 quorum 265, 298–299
object identifier 194–196
Object Management Group 201 range query 168
object normal form 196–199 Rasdaman 313
object-relational databases 217–222 RDF see resouce description framework
object-relational impedance mismatch 194 reachability 213, 215, 224
object-relational mapping 202–217 read phase 269, 276
omission failure 238 read repair 265
one-copy serializability 296, 297, 301 Read-one write-all 298
operator tree 23 recovery 172, 264
optimistic concurrency control 26, 98 recursion 34, 35
OrdPath 80, 97 Redis 132
OrientDB 327 redo logging 171
oriented graph 43 reduce 106, 107
oriented hyperedge 59 redundancy 4, 8, 19, 33, 203, 206, 218, 262
reengineering 317
page buffer 5, 167, 208, 209, 225, 226 referential integrity 20, 200
page split 95 relation schema 17, 18, 34, 37
Parquet 158, 325 relational algebra 22
partial quorum 299 relational calculus 22, 23
partition tolerance 302, 306, 307 relational query language 22
partitioning 108, 245 relationship 9, 19, 34, 41, 58, 65, 194, 200, 213,
Paxos 268–274 224
peer-to-peer replication 263 reliability 4, 236, 261, 264, 291, 323
persistence 4, 200, 202, 213, 215, 216, 223, 224 renaming 22
pessimistic concurrency control 26, 98 renumbering 80
Pig 121 repetition level 152, 156
point query 168 replication 4, 237, 261–266, 301–303, 315
pointer swizzling 226–228 replication factor 261, 263, 298
polyglot persistence 320 Representational State Transfer 116–117
position bit-string 150 resident 225
position list 149 resident object table 226
position range 150 resilient distributed datasets 121
positive incidence 53 resouce description framework 311
PostGIS 314 REST see Representational State Transfer
postorder numbering 78 Riak 129
Pre/Dist/Size encoding 102 round-tripping 90
pre/post diagram 79 row key 162, 163, 167–170, 176
pre/post numbering 78 ROWA see Read-one write-all
predicate 81 rumor spreading 240
Index | 351
run-length encoding 144 Structured Query Language 22, 23

subclass 12, 14, 194, 199, 204–208, 212
Samza 312 superclass 14, 194, 199, 204, 205, 207, 208,
scalability 3, 38, 235, 236, 302, 320, 324 211, 212
scalar clock 277–281 synchronization 263, 285, 286, 288, 290
Scalaris 315
schedule 7, 27 tail set 58
schema evolution 8, 37–39, 174, 223, 319 target node 43, 46, 54
schema independence 37 target set 59
schema-based mapping 84, 86 Tez 120
schemaless 37, 38, 105, 253, 318, 319, 325 three-phase commit 268
schemaless mapping 84, 89 time-to-live value 164, 166, 167, 169, 173, 187,
SciDB 313 244, 291
selection 22, 248, 250, 251, 253 timestamp 166
semantic overloading 34, 41 timestamp scheduler 27
semi-structured 3, 69, 109, 320 TinkerPop 62, 327
sequential consistency 295, 296 TokuDB 316
serializability 27, 296, 297, 299, 301 tombstone 167, 172
service level agreement 39 trailer 171
Sesame 311 transaction 24–28, 33, 36, 65, 97, 99, 133, 172,
session guarantees 305 211, 216, 246, 250, 252, 266, 276, 296,
shadow node 251, 253 297, 300–303, 319, 322, 328, 329
sharding 245, 253 transitive closure 34
shared-nothing architecture 235 transparency 236
shuffle 106, 107 traversal 44
sibling version 287 – backward 52
Simple API for XML 76 – forward 52
simple graph 42, 43 triple store 311
single-level storage 224 TTL see time-to-live value
single-relational graph 53 tuple reconstruction 144, 148
sliding window 312 two-level storage 208
snapshot 26, 118, 120, 223, 315 two-phase commit 266
snapshot isolation 300–301 two-phase locking 27
– non-monotonic 304 typed table 217, 221, 222
– parallel 304
source node 43, 46, 54 UML see Unified Modeling Language
source set 58 undirected graph 42
spanning tree 45 undirected hyperedge 58
Spark 120 undirected multigraph 43
SPARQL 311 Unified Modeling Language 11, 201
specialization 14, 194, 196, 202, 204 union 22, 34, 35, 207, 247, 250
split 106, 107 upsert 117, 166, 172
SQL see Structured Query Language
SQL object 217 validation phase 276
SQL/XML 84 vector clock 281–284, 289–292
Sqoop 128 vector clock bounding 289
start tag 69 vector clock comparison 283
Storm 312 version vector 284–289
strong clock property 281 versioning 37, 166, 174, 223, 319
352 | Index
vertex 41, 42 XML see Extensible Markup Language

vertical fragmentation 249 XML Parser 75
virtual heap 225, 254 XML Schema 73–75
Virtuoso 311 XML tree 76
visibility 12, 191, 193, 306 XPath 81–82
VoltDB 315 XQuery 82–83
voting phase 266 XSD see XML Schema
XSLT 83–84
weak clock property 281
weak consistency 39, 299, 302–303
YARN 120
weighted graph 44
wide column store 161
write phase 269, 276 ZooDB 230
write-ahead logging 172 ZooKeeper 120

Advanced Data Management - For SQL, NoSQL, Cloud and Distributed Databases

Uploaded by

Copyright:

Available Formats

You might also like

Advanced Data Management - For SQL, NoSQL, Cloud and Distributed Databases

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Data Management - For SQL, NoSQL, Cloud and Distributed Databases

Uploaded by

Copyright:

Available Formats

Lena Wiese

Analyse und Design mit der UML 2.5, 11. Auflage

Die UML-Kurzreferenz 2.5 für die Praxis, 6. Auflage

Algorithmen – Eine Einführung, 4. Auflage

For SQL, NoSQL, Cloud and Distributed Databases

Library of Congress Cataloging-in-Publication Data

Bibliographic information published by the Deutsche Nationalbibliothek

© 2015 Walter de Gruyter GmbH, Berlin/Boston

Part IV Conclusion is the final part of this book.

List of Figures | XIX

List of Tables | XXII

2 Relational Database Management Systems | 17

Part II: NOSQL And Non-Relational Databases

3 New Requirements, “Not only SQL” and the Cloud | 33

5.4.2 Storage Management | 92

6 Key-value Stores and Document Databases | 105

7 Column Stores | 143

8 Extensible Record Stores | 161

8.2.4 Compaction | 173

9 Object Databases | 193

Part III: Distributed Data Management

10 Distributed Database Systems | 235

10.4.1 Hash Trees | 241

11 Data Fragmentation | 245

12 Replication And Synchronization | 261

13.3 Consistency Trade-offs | 306

Part IV: Conclusion

14 Further Database Technologies | 311

15 Concluding Remarks | 317

2.1 An algebra tree (left) and its optimization (right) | 24

3.1 Example for semantic overloading | 34

4.1 A social network as a graph | 41

5.1 Navigation in an XML tree | 77

6.1 A map-reduce example | 107

7.1 Finite state machine for record assembly | 157

8.1 Writing to memory tables and data files | 167

8.5 Write-ahead log on disk | 172

9.1 Generalization (left) versus abstraction (right) | 195

10.1 A hash tree for four messages | 242

11.1 XML fragmentation with shadow nodes | 252

12.1 Master-slave replication | 262

12.17 Vector clock | 283

13.1 Interfering operations at three replicas | 296

15.1 Polyglot persistence with integration layer | 321

3.1 Base table for recursive query | 35

4.1 Node table and attribute table for a node type | 56

5.1 Schema-based mapping | 88

7.1 Run-length encoding | 145

8.1 Library tables revisited | 161

9.1 Unnormalized representation of collection attributes | 203

11.1 Vertical fragmentation | 249

1.1 Database Properties

External ... External

Fig. 1.1. Database management system and interacting components

1.2 Database Components

1.3 Database Design

1.3.1 Entity-Relationship Model

Relationships. Relationships describe associations between entities. Relationships