Dzone Refcard 335 Distributed SQL Essentials 2022

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

335 BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS

•  What Is Distributed SQL?

Distributed SQL •  Fundamentals of Distributed SQL


Architecture
−  Consider Your Workload

Essentials
−  Mixed Models
−  Additional Infrastructure
−  Consider Mixed Workload Support
−  Making and Meeting Performance
Requirements
−  Understanding Consistency
−  Cloud, Home, or Hybrid

ANDREW OLIVER •  Conclusion and Further Reading


SR. DIRECTOR OF PRODUCT MARKETING, MARIADB

Distributed SQL databases combine the resilience and scalability of Traditional relational databases were architected for transactional
a NoSQL database with the full functionality of a relational database. integrity, low disk storage, and high performance, all on a single-server
Since the 1980s, using the relational model and SQL query language has architecture. This requires a larger server or instance whenever there
been the dominant approach that businesses adopt to develop critical is an increase in the amount of data, query volume, or data complexity.
applications. Monolithic architectures do not support high availability and disaster
recovery.
While the 2010s saw the emergence of databases ranging from NoSQL
to various data stores in the Hadoop ecosystem, most mission-critical Clustering technologies like Oracle RAC and Galera have addressed
transactional applications have remained on traditional relational some of the issues of high availability, and replication technologies
databases like Oracle and SQL Server. like Oracle Streams (Golden Gate) have allowed for forms of disaster
recovery and cross-data center replication, but they come at a heavy
WHAT IS DISTRIBUTED SQL? cost in both dollars and performance.
Distributed SQL databases are becoming popular with organizations
that are interested in moving data infrastructure to the cloud and/or Traditional databases added forms of data partitioning to better use
cloud-native environments in order to reduce TCO and move away from storage hardware, but these partitioning methods increase application
the horizontal scaling limitations of monolithic RDBMSs like Oracle,
PostgreSQL, MySQL, and SQL Server.

Basic characteristics of distributed SQL:

•  A SQL API for querying and modeling data with support for
traditional RDBMS features like foreign keys, partial indexes,
stored procedures, and triggers

•  An automatic distributed query execution so that no single node


becomes a bottleneck

•  Automatic and transparent distributed data storage, including


indexes sharded across multiple nodes of the cluster, so that no
single node becomes a bottleneck — data distribution ensures
high performance and high availability

•  Replication with strong consistency and distributed ACID


transactions

REFCARD | JANUARY 2022 1


XPAND YOUR
EXPECTATIONS
Distributed SQL now available in SkySQL

Get started with a $500 credit:


mariadb.com/skyview

SkySQL is the only DBaaS capable of deploying MariaDB as a distributed SQL database for
scalable, high-performance transaction processing or as a multi-node columnar database for
data warehousing and ad hoc analytics. SkySQL makes it easy to start small and scale when
needed, as much as needed – whether it’s the result of continued business growth or an
exponential surge (e.g., successful Black Friday/Cyber Monday promotions).
REFCARD | DISTRIBUTED SQL ESSENTIALS

complexity and do not deliver performance at "internet scale" without integrity, full RDBMS features (such as joins), and a standard query
in-memory caching technologies. When combined, caching, clustering, language (SQL). These new distributed architectures embrace and
replication, and manual sharding strategies lead to an extremely brittle take advantage of modern technologies like cloud computing and
architecture. Kubernetes.

NoSQL has not been a panacea, either. Transactional integrity, flexible HOW DISTRIBUTED SQL DATABASES WORK
query, and join functionality are the primary reasons organizations NoSQL databases dropped much of the functionality of SQL databases
have not adopted NoSQL technologies outside of low-risk applications because they are difficult to implement at scale. Features like
and niche areas where there was no choice but to sacrifice database transactional consistency were not as strictly important for the "big
functionality in order to deal with the required high data volume and data" use cases these databases were created for. As Google faced
multi-data center deployments. increasingly critical workloads, it developed a technology called
"Spanner," which was outlined in a paper published in 2012.
Traditional databases excel at joins and transactional data integrity
— but not when combined with the requirement to handle large data The basic structure outlined in the paper is common across all
volumes and global data replication. distributed SQL databases. Distributed SQL databases include a SQL
execution engine that distributes queries to multiple servers. They
The promise of distributed SQL databases is that they allow you to
include a distributed storage engine that shards and replicates data
achieve internet scale and high availability without sacrificing data
(see Figure 1).

Figure 1

Google Spanner and most distributed SQL databases follow the One node is elected as the leader using a consensus protocol (either
CAP theorem as "CP" databases, meaning they are consistent and PAXOS or Raft for most other distributed SQL databases). When an
partitionable. According to the CAP theorem, this means these application writes to the database, a set of keys is locked, and the state
databases sacrifice "100% availability"; however, this is a "theoretical" change is replicated among the nodes in strict order. A log ensures
analysis of their storage and replication model. If the database achieves ordered data replication regardless of failures or cluster changes.
a high enough level of reliability through other means, then theoretical
Distributed SQL databases allow for multi-zone, where data for each
fault tolerance is less of a concern.
partition is replicated using the consensus protocol. Leaders coordinate
Data in a distributed SQL database is exposed as tables but stored in writes and are usually distributed among multiple zones evenly.
key-value pairs. Tables in distributed SQL databases are divided into
In the event a zone or region becomes unavailable, a new leader is
smaller units, essentially sliced into "partitions" (which the Spanner
elected in one of the remaining zones or regions. Data is copied from
paper calls tablets). Each partition is replicated to a number of server
surviving replicas to existing nodes to maintain fault tolerance and
nodes. Peers live on separate instances and generally separate
data distribution (see Figures 2 and 3 on the next page). If new nodes
availability zones and data centers. For instance, if the replication
are added, the data is rebalanced among the new nodes, increasing
factor is 3, then there are three replicas of each table partition. This
distribution and performance.
replication has strong consistency.

REFCARD | JANUARY 2022 3 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

Figure 2

Figure 3

NEWSQL VS. DISTRIBUTED SQL while providing resilience and consistency across redundant zones.
Before the Spanner paper, there were a number of "NewSQL" databases However, this distribution inherently causes more latency. Some legacy
developed. These databases do not offer the same level of functionality workloads may not tolerate distribution. In those cases, a client-server
as distributed SQL databases. In some cases, they are merely a sharding RDBMS with read replicas or multiple writers may be more appropriate
strategy built on existing monolithic databases like MySQL and even though they do not scale to the same level as a distributed SQL
PostgreSQL. Some do not provide global consistency or failover. database.

Distributed SQL databases are a subset of NewSQL databases that To migrate an existing application, consider whether the distributed
scale, shard, fail over, and provide global consistency. Because there SQL database is sufficiently compatible with your previous database’s
is no governing standard for distributed SQL databases, some NewSQL APIs and SQL dialect. Most distributed SQL databases are either
vendors have begun calling their systems "distributed SQL," despite compatible with MariaDB and MySQL or PostgreSQL. Some distributed
lacking distributed SQL features. SQL databases also provide compatibility features for Oracle (even
including PL/SQL) and SQL Server.
FUNDAMENTALS OF DISTRIBUTED SQL
ARCHITECTURE If an application is written using an Object-Relational Mapping (ORM)
Using or migrating to a distributed SQL database requires considering tool, consider a distributed SQL database that is well supported by that
your workload, application architecture, performance, consistency ORM.
requirements, geographic topography, and execution environment.
MIXED MODELS
Match application requirements to the capabilities and configuration
While NoSQL databases, particularly document databases, are famous
of a distributed SQL database to ensure your business and customer
for storing JSON, SQL databases now also support JSON. In fact,
needs are met.
JSON support is part of the SQL 2016 standard. Different distributed
CONSIDER YOUR WORKLOAD SQL and SQL databases have extended support for JSON. It may be
Distributed SQL databases are good for moving mission-critical possible to consolidate both SQL and JSON workloads into a single
and system-of-record workloads to cloud computing environments distributed SQL database.

REFCARD | JANUARY 2022 4 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

ADDITIONAL INFRASTRUCTURE and performance under concurrent access. Making these compromises
A full distributed SQL system is more than a database. It requires in integrity depends on your database’s capabilities, the nature of the
administration tools, load balancers, and a database proxy in order data, and the needs of your application.
to fully serve its purpose. A load balancer may not be sufficient for a
CLOUD, HOME, OR HYBRID
modern everyday application. In the case of a node failure, a database
Many applications and services have moved to public clouds. Some
proxy can potentially retry inflight transactions and fail over sessions
distributed SQL databases are offered as a vendor-hosted database as
to the new instance. This capability reduces the burden on application
a service. Although it may be more costly, the distributed SQL database
clients and removes the need for developers to write complicated
provider has the most expertise with running that type of database and
exception case code to handle transaction failures.
can apply the best practices learned across multiple customer installs.
CONSIDER MIXED WORKLOAD SUPPORT The cost may also balance out when labor, hardware, and risk are
Distributed SQL databases are transactional by nature, meaning they accounted for.
support a large number of short reads, writes, and simple queries that
pull back a small number of rows. Traditional relational databases The downside of a managed offering is the loss of control. A vendor

supported parallel queries and other types of relatively long-running, may not support a really old version of their database because of costs.

semi-analytical queries. Some distributed SQL databases support An organization may not want to assume the risk of an upgrade until

columnar indexes and the ability to run more ad hoc and analytical their application or service is revised. Some applications, datasets, or

queries. Look at your entire workload and usage patterns to ensure the industries cannot consider the public cloud due to cost, data sensitivity,

distributed SQL database can handle most of your application needs. If or specific security requirements. Running locally also may achieve

consuming the database as a service, look at add-on capabilities such lower latency with data at the edge of the network.

as columnar storage.
Many organizations have a combination of sensitive and less-sensitive

MAKING AND MEETING PERFORMANCE data and may want a hybrid cloud consisting of local instances as well
REQUIREMENTS as public cloud instances. Data may even be replicated between private
Using your business requirements as a foundation, establish technical and public instances in various configurations. While there are many
requirements for scalability and latency. Scalability requirements are potential advantages of a hybrid cloud approach to database hosting,
expressed in terms of queries per second, writes per second, reads per it is more complex to operate and requires a platform or tool to manage
second, data volume, and relative growth. This enables the cluster to the database cost effectively.
be sized appropriately. Latency requirements should be expressed as
COST CONSIDERATIONS
nominal and maximum delay for writes and reads on the system. As the
Distributed SQL databases replicate at the database layer as opposed
application matures, develop more specific requirements for different
to the storage layer. Competitive technologies like Amazon Aurora use
components or services. For instance, a sale might take 10ms, but other
the storage layer for replication and redundancy. Generally, distributed
requests might be allowed to take 3ms. Consider the database as a part
SQL databases are cost-competitive compared to storage layer
of your total "latency budget."
replication. However, when measuring cost in cloud-hosted distributed
UNDERSTANDING CONSISTENCY SQL databases, instance pricing is only one small component of the
Business-critical applications rely upon ACID guarantees. When data is overall system. Generally speaking, IOPS are a larger component.
written in one row in one table on one server, it is easy for databases Additionally, while one database may support cheaper instances, it
to be "ACID." When multiple rows or multiple tables are written, it is may require more of them to achieve the same performance. When
important to understand the database’s consistency guarantees and measuring and comparing costs, price for performance is the most
their failure modes. Ensuring consistency requires some degree of important factor.
locking or copying and waiting on acknowledgment, so when seeking
higher throughput or lower latency, it may make sense to compromise
CONCLUSION
Distributed SQL databases are critical infrastructure for taking systems
read consistency guarantees for select datasets or clients.
of record to the cloud or operating them at internet scale. These
Write consistency should never be compromised for system-of-record systems match the scale and resilience of a NoSQL database with
applications. For anything from ever-growing data to informational the full-featured performance of a relational database. Whether the
status, consistency may not be as important, or at least not equally as consideration is for an existing workload or a new system of record, any
important to all applications or application clients. The database may application that requires scale, low latency, transactional integrity, or
offer different modes of isolation — for instance, REPEATABLE_READ and general resilience is a good candidate for a distributed SQL database.
SERIALIZABLE. These offer different trade-offs in terms of consistency Anyone architecting a modern application that takes advantage of a

REFCARD | JANUARY 2022 5 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

distributed database should investigate tools like Hasura or Prisma


as well as Object-Relational Mapping tools like Hibernate. These tools WRITTEN BY ANDREW OLIVER,
may allow for overall distributed application optimization including SR. DIRECTOR OF PRODUCT MARKETING,

optimizing database operations. Deploying and managing a distributed MARIADB

database is more complex than a monolithic database and requires Andrew C. Oliver is the Senior Director of Product
either extensive DevOps expertise and tooling or a distributed SQL Marketing for MariaDB. He is a prolific writer about
technology — particularly open-source and distributed database
management platform that takes care of the details. This is also true technologies. In the past, he served on the board of the Open
when monitoring key performance and cost metrics. The rewards of Source Initiative, founded Apache POI and was an early part of
JBoss, Inc. before its acquisition by Red Hat. Find him over on
distributed SQL technology, including globally consistent transactions, Twitter @acoliver.
high availability, and scalability, make these challenges worthwhile
— especially when deployed with the right DevOps tooling and
architecture.
600 Park Offices Drive, Suite 300
Research Triangle Park, NC 27709
FURTHER READING 888.678.0399 | 919.678.0300

•  "Spanner: Google's Globally-Distributed Database" – https:// At DZone, we foster a collaborative environment that empowers developers and
tech professionals to share knowledge, build skills, and solve problems through
research.google/pubs/pub39966/ content, code, and community. We thoughtfully — and with intention — challenge
the status quo and value diverse perspectives so that, as one, we can inspire
•  "What You Need to Know About Distributed SQL" – https://dzone. positive change through technology.

com/articles/what-you-need-to-know-about-distributed-sql
Copyright © 2022 DZone, Inc. All rights reserved. No part of this publication
•  "Getting Started With Distributed SQL" Refcard – https://dzone. may be reproduced, stored in a retrieval system, or transmitted, in any form or
by means of electronic, mechanical, photocopying, or otherwise, without prior
com/refcardz/getting-started-with-distributed-sql written permission of the publisher.

•  "Beyond NoSQL: The Case for Distributed SQL" – https://www.


infoworld.com/article/3564543/beyond-nosql-the-case-for-
distributed-sql.html

REFCARD | JANUARY 2022 6 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like