Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 11

DISTRIBUTED DATA

STORE
-UNDER DBMS FLIP CLASSROOM BY
LOKESH DANGRE
MADHURI KURHADE
OJAS SONEWANE
What is a Distributed Data Store?

A distributed data store is a system that stores and


processes data on multiple machines.
WHY DO WE NEED IT?

Performance, Scalability, and Reliability


PERFORMANCE

Performance is how well a machine can do work.


Performance is critical. There are countless studies that quantify and show the business impacts of delays
as short as 100ms⁴. Slow response times don’t just frustrate people — they cost traffic, sales, and
ultimately revenue
SCALABILITY

Scalability is the ability to increase or decrease


infrastructure resources.
Applications today often experience rapid growth and cyclical usage patterns. To meet these load
requirements, we “scale” our distributed data stores. This means that we provision more or less resources
on demand as needed. Scalability comes in two forms.
RELIABILITY

Reliability is the probability of being failure-free


Some applications are so critical to our lives that even seconds of failure are unacceptable. These
applications cannot use single-machine data stores because of the unavoidable hardware and network
failures that could compromise the entire service. Instead, we use distributed data stores because they can
accommodate for individual computers or network paths failing.
HOW DOES IT WORK?

PARTITIONING, QUERY ROUTING, AND REPLICATION


PARTITIONING
• Data sets are often too large to be stored on a single machine. To overcome
this, we partition our data into smaller subsets that individual machines can
store and process. There are many ways to partition data, each with their own
tradeoffs. The two main approaches are vertical and horizontal partitioning.
• Vertical partitioning means to split up data by related fields. Fields can be
related for many reasons. They might be properties of some common object.
They might be fields that are commonly accessed together by queries. They
might even be fields that are accessed at similar frequencies or by users with
similar permissions. The exact way you vertically partition data across
machines ultimately depends on the properties of your data store and the
usage patterns you are optimizing for.
• Horizontal partitioning (also known as sharding) is when we split up data
into subsets all with the same schema. For example, we can horizontally
partition a relational database table by grouping rows into shards to be stored
on separate machines. We shard data when a single machine cannot handle
either the amount of data or the query load for that data. Sharding strategies
fall into two categories, Algorithmic and Dynamic, but hybrids exist.
QUERY ROUTING
Partitioning the data is only part of the story. We still need to route queries from the client to
the correct backend machine. Query routing can happen at different levels of the software
stack. Let’s see the three basic cases.
Client-side partitioning is when the client holds the decision logic for which backend node to
query. The advantage is the conceptual simplicity, and the disadvantage is that each client must
implement query routing logic.
Proxy-based partitioning is when the client sends all queries to a proxy. This proxy then
determines which backend node to query. This can help reduce the number of concurrent
connections on your backend servers and separate application logic from routing logic.
Server-based partitioning is when the client connects to any backend node, and the node will
either handle, redirect, or forward the request.
In practice, query routing is handled by most distributed data stores. Typically, you configure a
client, and then query using the client. However, if you are building your own distributed data
store or using products like Redis that don’t handle it, you’ll need to take this into
consideration
REPLICATION
Replication means to store multiple copies of the same data. This has many
benefits.
Data redundancy: When hardware inevitably fails, the data is not lost because
there is another copy.
Data accessibility: Clients can access the data from any replica. This increases
resiliency against data center outages and network partitions.
Increased read throughput: There are more machines that can serve the data,
and so the overall capacity is higher.
Decreased network latency: Clients can access the replica closest to them,
decreasing network latency.

You might also like