Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Chapter Two

Functional Requirements
Computer systems are designed to meet one or multiple objectives. Without consideration of such
requirements, it is impossible to compare alternative designs.

When designing distributed systems, it can be helpful to distinguish between functional


requirements (FRs) and non-functional requirements (NFRs).

A functional requirement specifies something the system must be able to do. It specifies a feature,
expressed as a specific function or as a broadly defined behavior. The system either has the feature,
and thus provides the function or behaves as specified, or not.

Examples of functional requirements of a multiplayer online game include:

1. The player can connect to the game servers and see who is connected to each server.
2. The player can join their friends and exchange messages.
3. Players engaged in parkour against the clock can do so wherever they are in the world,
even if the clocks on their computers are not synchronized.
4. Specific actions where multiple players compete to retrieve the same object are correctly
resolved by the game, and, subsequently, each player sees the same outcome.
5. Modifications made to an in-game object by the players are seen in the correct order by all
nearby players.

In contrast, a non-functional requirement focuses on how well the feature works, defining quality
attributes for the distributed system. Examples related to the same online game include:

1. Players experience that their actions (clicks) receive a reaction from the game within 150
milliseconds. This lag is not only limited but also stable.
2. Players can access the game services every time they try.
3. Each machine used by the gaming platform is utilized more than 50%, each day.
4. The gaming platform consumes at most 1 MWh of electricity per day.

2.1 A Framework of Functional Requirements


Which Functional Properties Do We Expect?

• Naming: We expect that components of the same computer system find each other, by
identifier or name, and can communicate easily.
• Clock Synchronization: We also expect that the system provides a clock and implicitly
that the clock is synchronized for all system components; in other words, acting 'at the same
time' (synchronously) is trivial for components in a computer system.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


• Consensus: Similarly, components should trivially reach consensus, that is, can easily
agree on a single value, such as the maximum value across the values stored by all
components or which component has received the most votes in the last election.
(Surprisingly, real-world elections also struggle with this issue like distributed computer
systems.)
• Consistency: Last, for this lecture, we expect the data to which multiple components write
to show consistency, meaning simplified, that if some components modify the data,
afterward all components can read the result and agree it is correct.

Distributed systems counter all these intuitions. Because the machines hosting components of the
system are physically distributed, the laws of physics have an important impact: the real-world
time it takes for information to get from one component to another can be orders of magnitude
higher than it normally takes in the computers and smartphones we are used to. This real-world
information delay changes everything. Components in distributed systems cannot easily name or
communicate with other components. Distributed systems cannot easily achieve clock
synchronization, consensus, or consistency. Instead, all these functions require specialized
approaches in distributed systems.

2.2 Naming and Communication in Distributed Systems


Challenge of Communication in Distributed Systems

The ability to communicate is essential for systems. Even single machines are constructed around
the movement of data, from input devices, to memory and persistent storage, to output devices.
Although computers are increasingly complex, this communication is well-understood. We
include in the typical functional requirements, and modern systems already meet these
requirements, that messages arrive correctly at the receiver, that there is an upper limit on the
amount of time it takes to read or write a message, and that developers know how much data can
be safely exchanged between applications at any point in time. In a distributed system, none of
this is true without additional effort.

Similar to applications running on a single machine,

• A distributed system can only function if its components are able to communicate.
• The components in a distributed system are asynchronous: they run independently from,
and do not wait for, other components.
• Components must continue to function, even though communication with other
components can start and stop at any time.
• The networks used for communication in distributed system are unreliable. Networks may
drop, delay, or reorder messages arbitrarily, and components need to take these possibilities
into account.

a) Protocols for Computer Networking

To enable communication between computers, they need to speak the same protocol. A protocol
defines the rules of communication, including

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


• Which entity can speak,
• When, and
• How to represent the data that is communicated.

How a protocol is defined depends on the technology that underlays it. For protocols that directly
use the network’s transport layer, they need to define data fields as a sequence of bits or bytes.
Defining a protocol on this level, however, has multiple disadvantages. It is labor intensive, the
binary messages are challenging to debug, and it is difficult to achieve backward compatibility.

When a protocol defines data fields on the level of bits and bytes, adding or changing what data
can be sent while still supporting older implementations is difficult. For these and other reasons,
distributed systems often define their protocols on a higher layer of abstraction.

One of the simplest abstractions on top of byte streams is plain-text messages. These are used
widely in practice, especially in the older technologies that form the core of the Internet. For
example, all of the following protocols use plain-text messages: -

• The Domain Name System (DNS),


• The Hypertext Transfer Protocol (HTTP), and
• The Simple Mail Transfer Protocol (SMTP).

Instead of defining fields with specified bit or byte lengths, plain-text protocols are typically line-
based, meaning every message ends with a new line character (“\n”). The advantages of such
protocols are that they are easy to debug by both humans and computers, and that they offer
increased flexibility due to variable-length fields. Text-based protocols can easily be changed into
binary protocols without losing their advantages, by compressing the data before it is sent over the
network.

Moving one more level up brings us to structured-text protocols. Such protocols use specialized
languages designed to represent data, and use them to define messages. For example, REST APIs
typically use JSON to exchange data. A structured text comes with its own (more complex) rules
on how to format messages. Fortunately, many parser libraries exist for popular structured-text
formats such as XML and JSON, making it easier for distributed system developers to use these
formats without writing the tools themselves

Finally, structs and objects in programming languages can also be used as messages. Typically,
these structs are translated to and from structured-text or binary representations with little or no
work required from developers. Mapping programming-language-specific data structures to and
from message formats are called marshaling and unmarshaling respectively. Marshaling libraries
and tools take care of both marshaling and unmarshaling. Examples of marshaling libraries for
structured-text protocols include Jackson for Java and the built-in JSON library for Golang.
Examples of marshaling libraries for binary formats include Java’s built-in Serializable interface
and Google’s Protocol Buffers.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


b) Communication Models for Message Passing

Message passing always occurs between a sender and a receiver. It requires messages to traverse
a possibly unreliable communication environment and can start and end at synchronized or
arbitrary moments. This leads to multiple ways in which message-passing communication can
occur, and thus multiple useful models.

Depending on whether the message in transit through the communication environment is stored
(persisted) until it can be delivered or not, we distinguish between transient and persistent
communication:

• Transient communication only maintains the message while the sender and the receiver are
online, and only if no transmission error occurs. This model is the easiest to implement and
matches well with the typical Internet-router based on store-and-forward or cut-through
technology. An example here is that real-time games may occasionally drop updates and use
local correction mechanisms. This allows in many cases the use of relatively simple designs,
but for some game genres can lead to the perception of lag or choppy movement of the avatars
and objects.
• Persistent communication requires the communication environment to store the message until
it is received. This is convenient for the programmer, but much more complex to guarantee by
the distributed system. Worse, this leads typically to lower scalability than approaches based
on transient communication due to the higher latency of the message broker storing incoming
messages on a persistent storage device as well as potential limits of the number of messages
that can be persisted at the same time. An example of the use of a persistent communication
system appears in the email system. Emails are sent and received using SMTP and IMAP
respectively. SMTP copies email from a client or server to another server, and IMAP copies
email from a server to a client. The client can copy the email from their server repeatedly
because the email is persisted on the server.

Depending on whether the sender and/or the receiver has to wait (is blocked) in the process of
transmitting or receiving, we distinguish between asynchronous communication and
synchronous communication:

o In asynchronous communication, asynchronous senders and receivers attempt to


transmit and receive, respectively, but will continue to other activities immediately
after the attempt, regardless of its outcome. UDP-like communication uses the
(transient) asynchronous model.
o In synchronous communication, synchronous senders and receivers block until their
operation (request) is confirmed (synchronized). We identify three useful
synchronization points: (1) when the request is submitted, that is, when the request
has been acknowledged by the communication environment; (2) when the request
is dispatched (message/operation delivery), that is, when the communication
environment acknowledges the request has been delivered for execution to the other
side of the communication; (3) when the request is fully processed (operation
completed), that is, when the request has reached its destination and has been fully
processed, but before the result of the processing has been sent back (as another

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


message, possibly with the same communication model). In practice, the Message
Passing Interface (MPI) standard provides the programmer with the flexibility of
choosing between all these approaches to synchronization.

Remote Procedure Calls

Due to the unavailability of shared memory in distributed systems, communication is a means to


achieve a common goal among multiple computers. Often, we communicate because one machine,
the caller, wants to ask (call on) another machine, the callee, to have them perform some work.
This pattern is very typical for how people have used telecommunication systems since their
inception, just that here we are dealing with machine-to-machine communication. Remote
procedure calls (RPCs) are an abstraction of this communication-enabled process in DCS.
Internally, RPC is typically built on top of message passing, and thus can occur according to any
of the models introduced in the previous block. RPC or derivatives are still much used today. For
instance, Google’s gRPC is a common building block of large parts of the Google datacenter stack.
Many uses of REST also closely mimic RPC semantics, much to the disdain of purists who
emphasize that REST is supposed to be resource-centric and not procedure-centric. Modern object-
oriented variants of RPC such as RMI are often used in the Java community for building distributed
applications.

Implementation

Functionally, RPC wants to maintain the illusion that a program would call a local implementation
of the service. Since now caller and callee reside on different machines, they need to agree on a
definition of what the procedure is: its name and parameters. This information is often encoded in
an interface written in an Interface Definition Language (IDL).

In order for local programs to be able to call the service, a stub is created that implements the
interface but, instead of doing function execution locally, encodes the name and the argument
values in a message that is forwarded to the callee. Since this is a mechanical, deterministic
process, the stub can be compiled automatically by a stub generator.

On the server side, the message is received and the arguments need to be unmarshalled from the
message so that the function can be invoked on behalf of the client. This is again performed by an
automatically generated stub, on this side of the system also often referred to as a skeleton (server
stub in the image below)

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


The dynamics of this operation are as follows. When the client calls the procedure on the local
stub, the client stub marshals both the procedure name and the provided arguments into a message.
This message is then sent to a server that contains the requested procedure. Upon receipt, the
receiving stub unmarshals the message and calls the corresponding procedure with the provided
arguments. The returned value is then sent back to the client, using the same approach. RPC uses
transient synchronous communication to create an interface that is as close as possible to regular
procedure calls

From Remote Procedures to Remote Objects

Since the late-1950s, the programming language community has proposed and developed
programming models that consider objects, rather than merely operations and procedures for
program control. Object-oriented programming languages, such as Java (and Kotlin), Python, and
C++, remain among the most popular programming languages. It is thus meaningful to ask the
question: Can RPC be extended to (remote) objects?

An object-oriented equivalent of RPC is remote method invocation (RMI). RMI is similar to RPC,
but has to deal with the additional complexity of remote-object state. In RMI, the object is located
on the server, together with its methods (equivalent of procedures for RPC). The client calls a
function on a proxy, which fulfills the same role as the client-stub in RPC. On the server side, the
RMI message is received by a skeleton, which executes the method call on the correct object.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Communication Patterns

The messages that are sent between machines, be they sent as plain messages, or as the underlying
technology of RPC, show distinct patterns over time depending on the properties of the system
that uses them. Below we describe some of the most prevalent communication patterns.

• Request-Reply Request reply is a one-to-one communication pattern where a sender starts


by asking something from the receiver. The receiver then takes an action and replies to the
sender.
• Publish-Subscribe In publish-subscribe, or pub-sub, one or multiple machines generate
periodic updates. If another machine is interested in these updates, they can subscribe to
them at the sender. The sender then sends the updates to all the machines that are
subscribed. This can work well in games, where players may indicate that they want to
receive some updates, but not others.
• Pipeline communication works with producers, consumers, and prosumers. In this
messaging pattern, a producer wants to send messages to a particular type of receiver but
does not care about which machine this is specifically. This pattern allows long chains and
easy load-balancing, by adding more machines to the system with a particular role.
• Broadcast A broadcast is a message sent out by a source and addressed, or received, by all
other entities on the network. Broadcast messages are useful for bootstrapping or
communicating the global state. An example of bootstrapping is the broadcast DHCP
message sent out by a client requesting an IP. An example of broadcasting the global state
can be found in games, where a server may inform all players when a new player has joined.
• Flooding is a pattern where a broadcast is repeated by its receivers. This works well for fast
information dissemination on non-star-topology networks but also uses large amounts of
network resources. Systems that use flooding must also actively stop the flooding once all
machines have received the message. One way of doing this is to allow machines to
propagate the broadcast only once.
• Multicast In between one-to-one communication such as request-reply, and one-to-all
communication such as broadcast is multicast. Here a sender wants to send messages to a
particular set of receivers. In a game, the receivers could be teammates, whom the player
sends data about themselves that they do not want to share with the opposite team.
• Gossip For some systems the large number of resources required for flooding is out of the
question. To still do data dissemination, gossip provides an alternative. In a gossiping
exchange pattern, machines periodically contact one random neighbor. They send
messages to each other, and then go back to sleep. This pattern allows information to
propagate through the entire network with high likelihood, without the intensive resource
utilization that flooding requires.

Communication in Practice

Here are some examples of popular messaging systems:

• RabbitMQ is a message broker, a middleware system that facilitates sending and receiving
messages. It is similar to Kafka, but not specifically built for stream-processing systems.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


• Protocol Buffers are a cross-platform, cross-language method of (de)marshaling data. This
is useful for HTTP or other services that do not come with their own data marshaling
libraries.
• Netty is a framework for building networked event-driven systems. It supports both custom
and existing protocols such as HTTP and Protocol Buffers.
• Akka HTTP is part of the Akka distributed-systems framework. It provides both a client-
and server-side framework for building HTTP-based services.
• gRPC is an RPC library that marshals messages using Protocol Buffers.
• CORBA is another popular and long-running RPC library.

2.2.1 Naming in Distributed Systems


Distributed systems can consist of hundreds or even thousands of entities at any given point. An
entity can be a machine, a service, or a serverless function. Depending on the workload, an entity
may be active for any duration between multiple years to only a few seconds. Whatever their
lifetime, these entities must be able to communicate with each other. This is where naming comes
in. Naming provides the entities in the distributed system to identify each other and set up
communication channels.

• Naming schema: The techniques used to assign names to entities,


• Naming services: using the naming schema and other elements to offer name-related
services to the distributed system.

a) Naming Schema in Distributed Systems

Naming schemes (schema) are the rules by which names are given to individual entities. There are
an infinite number of ways in which to ascribe names to entities. In this section, we identify and
discuss three categories of naming schema:

• Simple naming,
• Hierarchical naming, and
• Attribute-based naming.

Simple Naming: Focusing on uniquely identifying one entity among many is the simplest way to
name entities in a distributed system. Such a name contains no information about the entity’s
location or role.

Advantages: The main advantage of this approach is simplicity. The effort required to assign a
name is low—the only requirement is that the name is not already taken. Various approaches can
simplify even this verification step, at the cost of a (very low) probability the name may cause a
collision with another chosen name.

Disadvantages: A simple name shifts the complexity of locating it to the naming service.

Addressing the downside of simple naming, distributed systems can use rich names. Such names
not only uniquely identify an entity, but also contain additional information, for example, about

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


the entity's location. This additional information can simplify the task of the naming service. As
examples of rich names, we discuss hierarchical naming and attribute-based naming.

Hierarchical Naming: In hierarchical naming, names are allowed to contain other names, creating
a tree structure.

Namespaces are commonly used in practice. Examples include file systems, the DNS, and package
imports in Java and other languages. These names consist of a concatenation of words separated
by a special character such as “.” or “/”. The tree structure forms a name hierarchy, which combines
well with, but is not the same as, a hierarchical name resolution approach. When a hierarchical
naming scheme is combined with hierarchical name resolution, a machine is typically responsible
for all names in one part of the hierarchy. For example, when using DNS to look up the name
https://rift.com.et, we first contact one of the DNS root servers. These forward us to the “et”
servers, which forward us to the “com” servers, which in turn know where to find “rvu.com.et”.

Figure 1. Simplified attribute-based naming for a Minecraft-like game. Steps 1-3 are discussed in
the text.

Attribute-Based Naming names entities by concatenating the entity’s distinguishing attributes.


For example, a Minecraft server located in an EU datacenter may be named “R=EU/G=Minecraft”,
where “R” indicates the region and “G” indicates the game.

Figure 1 illustrates how a player in the example might find a game of Minecraft located in an EU
datacenter. In step 1, the game client on the player's computer automates this, by querying the
naming service to "search((R=“EU”)(G=“Minecraft”))". Because the entries in attribute-based
naming are key-value pairs, searches are easy to make, and also partial searches can result in
matches. In step 2, the naming service returns the information that "server 42" is a server matching

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


the requested properties. In step 3, the game client resolves the name and connects to the specific
machine that runs the Minecraft server in the EU.

Naming schema in practice: The lightweight directory access protocol (LDAP) is a name-
resolution protocol that uses both hierarchical and attribute-based naming. Names consist of
attributes, and can be found by performing search operations. The protocol returns all names with
matching attributes. In addition to the attributes, names also have a distinguished name, which is
a unique hierarchical name similar to a file path. This name can change when, for example, the
name is moved to a different server. Because LDAP is a protocol, multiple implementations exist.
ApacheDS is one of these implementations.

b) Naming Services in Distributed Systems

Once every entity in the system has a name, we would like to use those names to address our
messages. Networking approaches assume that we know, for each entity, on which machine it is
currently running. In distributed systems, we want to break free of this limitation. Modern
datacenter architectures often run systems inside virtual machines that can be moved from one
physical machine to the next in seconds. Even if instances of entities are not moved, they may fail
or be shut down, while new instances of the same service are started on other machines. Naming
services address such complexity in distributed systems.

Name resolution: A subsystem is responsible for maintaining a mapping between entity names
and transport-layer addresses. Depending on the scalability requirements of the system, this could
be implemented on a single machine, as a distributed database, etc.

Publish-Subscribe systems: The entities only have to indicate which messages they are interested
in receiving. In other words, they subscribe to certain messages with the naming service. This
subscription can be based on multiple properties. Common properties for publish-subscribe
systems include messages of a certain topic, with certain content, or of a certain type. When an
entity wants to send a message, it sends it not to the interested entities, but to the naming service.
The naming service then proceeds by publishing the message to all subscribed entities.

The publish-subscribe service is reminiscent of the bus found in single-machine systems. For this
reason, the publish-subscribe service is often called the "enterprise bus". A bus provides a single
channel of communication to which all components are connected. When one component sends a
message, all others are able to read it. It is then up to the recipients to decide if that message is of
interest to them. Publish-subscribe differs from this approach by centralizing the logic that decides
which messages are of interest to which entities.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Figure 2. Publish-subscribe example.

Figure 2 illustrates the operation of publish-subscribe systems, in practice. In step 1, user A


announces their intention to receive all updates from user D; this forms the subscription for all
messages from user D. This request could be much more complex, filtering only some messages.
Also, the same user can make multiple subscriptions. The system will ensure the request is
enforced.

In step 2, users B and C send updates (messages) to the publish-subscribe system. These are stored
(published) and may be forwarded to users other than A.

In step 3, user D sends a new message to the publish-subscribe system. The system analyzes this
message and decides it fits the subscription made by user A. Consequently, in step 4, user A will
receive the new message.

In practice: The publish-subscribe approach is widely deployed in production systems. Apache


Kafka is a publish-subscribe platform for message streams. It is suitable for systems that produce
and reliably process large quantities of messages. Kafka is part of a larger ecosystem. It can be
used with other systems such as Hadoop, Spark, Storm, Flink, and others.

2.3 Clock Synchronization, Consensus and Consistency Relation


Focusing on clock synchronization, consensus, and consistency, at first glance they seem very
different from each other. Yet, they each try to get multiple components in a distributed system to
agree on a particular value. This forms the basis of our conceptual framework for these functional
requirements.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Figure 1. Comparison of functional requirements.

Clock synchronization focuses on an agreement between components about the time, which is a
single, numerical value, which changes continuously but unidirectionally (it only increases if one
counts the elapsed number of milli- or microseconds since a commonly agreed start time, as
computer systems do) and monotonically (with the clock frequency). We also observe a
synchronized clock enables establishing a happens-before relationship between events recorded in
the system; even without a physical clock, if we can otherwise establish this relationship, we have
the equivalent of a logical clock. Using the happens-before relationship between any two events,
we can create a total order of these events.

Consensus focuses on any value, so unlike the clock not only numerical. More importantly, the
value subject to consensus does not need to change as clocks do; in fact, it may not even change
at all. Consensus may focus on a single value but, by reaching consensus repeatedly, can also
enable a total ordering of events but such an approach is expensive in time and resources.

Consistency focuses on any value from the many included in a dataset, creating a flexible order.
Consistency protocols in distributed systems define the kind of order that can be achieved, for
example, total order, and, more loosely, when the order will be achieved, for example, after each
operation, after some guaranteed maximum number of operations, or eventually. Using
consistency protocols to order events in a form weaker than total ordering, and even some
discrepancies between how different components see the values in the database, are useful for
different classes of applications because they can often be achieved much quicker and with much
more scalable techniques.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


2.4 Consensus
Coordination, with a Focus on Consensus
A simple program performs a series of operations to obtain some desired result. In distributed
systems, these operations are performed by multiple machines communicating through unreliable
network environments. While the system is running, machines must coordinate to operate
correctly. For example, the system may need to agree on whether a certain action happened - to
reach a consensus on this question.

2.4.1 What is Consensus?


Consider a distributed key-value store that uses replication. Users submit read and write operations
to whichever process is closest to them, reducing latency. To give the user the illusion of a single
system, the processes must agree on the order to perform the queries, and especially keep the
results of writing (changing) data and reading data in the correct order. Using clock
synchronization techniques could work for this, but the cost of having each machine in the
distributed system ask each other about whether the operations they received lead to some other
order, for each operation, is prohibitively expensive in both resources and time. Another class of
techniques needs to focus on the consensus problem.

In a distributed system,

Consensus is the ability to have all machines agree on a value. Consensus protocols ensure this
ability.

For a protocol to ensure consensus, three things have to happen.

• First, the system as a whole must agree on a decision.


• Second, machines do not apply decisions that are not agreed upon.
• Third, the system cannot change a decision.

Consensus protocols (distributed algorithms) can create a total order of operations by repeatedly
agreeing on what operation to perform next.

Why Consensus is impossible in some Circumstances

Theoretical computer science has considered for many decades the problem of reaching consensus.
When machine failures can occur, reaching consensus is surprisingly difficult. If the delay of
transmitting a message between machines is left unbound, it is proved that, even when using
reliable networks, no distributed consensus protocol is guaranteed to complete. The proof itself is
known as the FLP proof, after the acronym of the family names of its creators. It can be found in
the aptly named article “Impossibility of Distributed Consensus with One Faulty Process” [1].

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


[1] FLP Impossibility Theorem, accredited to Fischer, Lynch, and Paterson, has proved that in a
fully asynchronous distributed system where even a single process may have a crash failure, it’s
impossible to have a deterministic algorithm for achieving consensus.

Consider that the claim is not true: There exists a consistency protocol, a distributed algorithm that
always reaches consensus in bounded time. For the algorithm to be correct, all machines that
decide on a value must decide on the same value. This prevents the algorithm from simply letting
the machines guess a value. Instead, they need to communicate to decide which value to choose.

This communication is done by sending messages. Receiving, processing, and sending messages
makes the algorithm progress toward completion. At the start of the algorithm, the system is in an
undecided state. After exchanging a certain number of messages, the algorithm decides. After a
decision, the algorithm - and the system - can no longer “change its mind.” The FLP proof shows
that there is no upper bound on the number of messages required to reach consensus.

General Consensus and an Approach to Reach It

To achieve consensus, consensus protocols must have two properties:

1. Safety, which guarantees "nothing incorrect can happen". The consensus protocol must
decide on a single value, and cannot decide on two values, or more, at once.
2. Liveness, which guarantees "something correct will happen, even if only slowly". The
consensus protocol, left without hard cases to address - for example, no failures for some
amount of time -, can and will reach its decision on which value is correct.

Many protocols have been proposed to achieve consensus, with various degrees of capability under
various forms of failures, messaging delays they tolerate, etc.

Among the protocols that are used in practice, Paxos, multi-Paxos, and more recently Raft seem
to be very popular. For example, etcd is a distributed database built on top of the Raft consensus
algorithm. Its API is similar to that of Apache ZooKeeper (a widely-used open-source coordination
service), allowing users to store data in a hierarchical data-structure. Etcd is used by Kubernetes
and several other widely-used systems to keep track of shared state.

We sketch here the operation of the Raft approach to reach consensus. Raft is a consensus
algorithm specifically designed to be easy to understand. Compared to other consensus algorithms,
it has a smaller state space (the number of configurations the system can have), and fewer parts.

Figure 3. Raft overview.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Figure 3 gives a Raft overview. There are four main components:

1. Raft first elects a leader ("leader election" in Figure 3). The other machines become
followers. Once a leader has been elected, the algorithm can start accepting new log entries
(data operations).
2. The log (data) is replicated across all the machines in the system ("log replication" in the
figure).
3. Users send new entries only to the leader.
4. The leader asks every follower to confirm. If most followers confirm, the log is updated
(performs the operation).

We describe three key parts of Raft. These do not form the entirety of Raft, which is indicative
that even a consensus protocol designed to be easy to understand still has many aspects to cover.

The Raft leader election: Having a leader simplifies decision-making. The leader decides on the
values. The other machines are followers, accepting all decisions from the leader. Easy enough.
But how do we elect a leader? All machines must agree on who the leader is—leader election
requires reaching consensus, and must have safety and liveness properties.

In Raft, machines can try to become the new leader by starting an election. Doing so changes their
role to candidate. Leaders are appointed until they fail, and followers only start an election if they
believe the current leader to have failed. A new leader is elected if a candidate receives the majority
of votes. With one exception, which we discuss in the section on safety below, followers always
vote in favor of the candidate.

Raft uses terms to guarantee that voting is only done for the current election, even when messages
can be delayed. The term is a counter shared between all machines. It is incremented with each
election. A machine can only vote once for every term. If the election completes without selecting
a new leader, the next candidate increments the term number and starts a new election. This gives
machines a new vote, guaranteeing liveness. It also allows distinguishing old from new votes by
looking at the term number, guaranteeing safety.

An election is more likely to succeed if there are fewer concurrent candidates. To this end,
candidates wait a random amount of time after a failed election before trying again.

Log replication: In Raft, users only submit new entries to the leader, and log entries only move
from the leader to the followers. Users that contact a follower are redirected to the leader.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Figure 4. Log replication in Raft. The crown marks the leader.

New entries are decided, or “chosen,” once they are accepted by a majority of machines. As Figure
4 illustrates, this happens in a single round-trip: (a) The leader propagates the entries to the
followers and, (b) counts the votes and accepts the entry only if a majority in the system voted
positively.

Log replication is relatively simple because it uses a leader. Having a leader means, for example,
that there cannot be multiple log entries contending for the same place in the log.

Safety in Raft: Electing a leader and then replicating new entries is not enough to guarantee safety.
For example, it is possible that a follower misses one or multiple log entries from the leader, the
leader fails, the follower becomes a candidate and becomes the new leader, and finally overwrites
these missed log entries. (Sequences of events that can cause problems are a staple of consensus-
protocol analysis.) Raft solves this problem by setting restrictions on which machines may be
elected leader. Specifically, machines vote “yes” for a candidate only if that candidate’s log is at
least as up-to-date as theirs. This means two things must hold:

1. The candidate’s term must be at least as high as the followers, and


2. The candidate’s log entry index must be at least as high as the follower's.

When machines vote according to these rules, it cannot occur that an elected leader overwrites
chosen (voted upon) log entries. It turns out this is sufficient to guarantee safety; additional
information can be found in the original article.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


2.5 Consistency in Distributed Systems
The Data Store

The essence of any discussion about consistency is the abstract notion of the data store. Data stores
can differ when servicing diverse applications, types of operations, and kinds of transactions, but
essentially a data store:

1. Stores data across multiple machines (replicas),


2. Receives a stream of various operations, of which most common are Read (query data) and
Write (update data); it is common for data stores to support only a few operations,
sometimes only Reads and Writes,
3. Enforces that the operations execute correctly, that is, delivering consistent results, and
4. In practice, also supports other, functional and non-functional, requirements.

Figure 1. Data store with a single primary user.

Many applications only have a single primary user. You are likely the only one accessing your
email, for business or leisure. You may have a private Dropbox folder, which you may want to
access at home, on the train, wherever you stay long enough to want to store new photos, etc. Many
mobile-first users recognize these and similar applications. Figure 1 depicts the data store for the
single primary user. Here, the user can connect from one location (or device), write new
information - a new email, a new Dropbox file, then disconnect. After moving to a new location
(or device), and reconnecting, the user should be able to resume the email and access the latest
version of the file.

Other applications have multiple users, writing together information to the same shared document,
changing together the state of an online game, making together transactions affecting many shared
accounts in a large data management system, etc. Here, the data store again has to manage the
data-updates, and deliver correct results when users query (read).

2.5.1 What is Consistency?

The main goal of consistency is:

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Goal: achieving consistency, which means establishing and enforcing a mutual agreement between
the data store and its client or clients, on the expected effect of system operations on data.

In a distributed system, achieving consistency falls upon the consistency model and consistency
(enforcing) mechanisms.

• Consistency models determine which data and operations are visible to a user or process,
and which kind of read and write operations are supported on them.
• Consistency mechanisms update data between replicas to meet the guarantees specified by
the model.

Classes of consistency models: The consistency model offers guarantees, but outside the
guarantees, almost anything is allowed, even if it seems counter-intuitive.

We identify two main classes of consistency models:

1. Strong consistency: an operation, particularly a query, can return only a consistent state.
2. Weak consistency: an operation, particularly a query, can return inconsistent state, but
there is an expectation there will be a moment when consistent state is returned to client.
Sometimes, the model guarantees which moment, or which (partial) state.

The strictest forms of consistency are so costly to maintain that, in practice, there may be some
tolerance for a bit of inconsistency after all. The CAP theorem suggests availability may suffer
under these strict models, and, the PACELCA framework further suggests also performance is a
trade-off with how strict the consistency model can be.

Weak Consistency Models

Many views on consistency models exist. Traditional results from theoretical computer science
and formal methods indicate

• What single-operation, single-object guarantees and


• What multi-operation, multi-object guarantees of consistency can be given for data stores.

Notions of (i) linearizability or (ii) serializability emerged to indicate Write operations can seem
instantaneous yet a real-time or an arbitrary total order can be enforced, respectively.

Building from these results,

(1) in operation-centric consistency models, a single client can access a single data object,

(2) in transaction-centric consistency models, multiple clients can access any of the multiple data
objects, and

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


(3) in application-centric consistency models, specific applications can tolerate some
inconsistency or have special ways to avoid some of the costly update operations.

Operation-Centric Consistency Models (single client, single data object, data store with multiple
replicas):

Several important models emerged in the past four decades, and more may continue to emerge:

Sequential consistency: All replicas see the same order of operations as all other replicas. This is
desirable, but of course prohibitively expensive.

What other operation-centric consistency models can designers use?

Causal consistency weakens the promises, but also the needs to operate, of sequential consistency:
As for sequential consistency, causally related operations must still be observed in the same order
by all replicas. However, for other operations that are not causally related, different replicas may
see a different order of operations and thus of outcomes. Important cases of causal consistency,
with important applications, include:

1. Monotonic Reads: Subsequent reads by the same process always return a value that is at
least as recent as a previous read. Important applications include Calendar, inventories in
online games, etc.
2. Monotonic Writes: Subsequent writes by the same process follow each other in that order.
Important applications include email, coding on multiple machines, your bank account,
bank accounts in online games, etc.
3. Read Your Writes: A client that writes a value, upon reading it will see a version that is at
least as recent as the version they wrote. Updating a webpage should always, in our
expectation, make the page refresh show the update.
4. Writes Follow Reads: A client that first reads and then writes a value, will write to the
same, or a more recent, version of the value it read. Imagine you want to post a reply on
social media. You expect this reply to appear following the post you read.

Causal consistency is still remarkably difficult to ensure in practice. What could designers use that
is so lightweight it can scale to millions of clients or more? The key difficulty in scaling causal
consistency is that updates that multiple replicas to coordinate could hit the system concurrently,
effectively slowing it down to non-interactive responses and breaking scalability needs. A
consistency model that can delay when replicas need to coordinate would be very useful to achieve
scale.

Eventual consistency: Absent new writes, all replicas will eventually have the same contents. Here,
the coordination required to achieve consistency can be delayed until the system is less busy, which
may mean indefinitely in a very crowded system; in practice, many systems are not heavily
overloaded much of the time, and eventual consistency can achieve good consistency results in a
matter of minutes or hours.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Application-Centric Consistency Models: We only sketch the principle of operation of these
consistency models.

Under special circumstances, there is no need for the heavy, scalability-breaking coordination
needed to ensure consistency we saw for operation-centric consistency models (and can have an
intuition about the even heavier transaction-centric consistency models). Identifying such
circumstances in general has proven very challenging, but good patterns have emerged for specific
(classes of) applications.

Applications where small inconsistencies can be tolerated include social media, where for example
information can often be a bit stale (but not too much!) without much impact, online gaming, where
slight inconsistencies between the positions of in-game objects can be tolerated (but large
inconsistencies cannot), and even banking where inconsistent payments are tolerated as long as
their sum does not exceed the maximum amount allowed for the day. Consistency models where
limited inconsistency is allowed, but also tracked and not allowed to go beyond known bounds,
include conits (we discuss them in the next section).

Conflict-free Replicated Data Types (CRDTs) focus on efficiently reconciling inconsistency


situations. To this end, they are restricted to data types that only allow monotonic operations, and
whose replicas can always be correctly reconciled by taking the union of operations across all
replicas. For example, suppose the data-object represents a set, to which items can only be added.
In this case, it does not matter in which order the objects are added, or which replica executes the
operation of addition. In the end, the correct set is obtained by executing every addition operation
in the system across all replicas. In this example, removal or modification of an object would not
be allowed because they are not monotonic.

We conclude by observing the designer of distributed systems and applications must have at least
a basic grasp of consistency, and of known classes of consistency models with a proven record for
exactly that kind of system or application. This can be a challenging learning process, and mistakes
are costly.

2.5.2 Consistency for Online Gaming, Virtual Environments, and the Metaverse
Dead Reckoning

One of the earliest consistency techniques in games is the dead reckoning. The technique addresses
the key problem that information arriving over the network may be stale by the moment of arrival
due to network latency. The main intuition behind this technique is that many values in the game
follow a predictable trajectory, so updates to these values over time can largely be predicted. Thus,
as a latency-hiding technique, dead reckoning uses a predictive technique, which estimates the
next value and, without new information arriving over the network from the other nodes in the
distributed system, updates the value to match the prediction.

Although players are not extremely sensitive to accurate updates, and as long as the updated values
seem to follow an intuitive trajectory will experience the game as smooth, they are sensitive to
jumps in values. Thus, when the locally predicted values and the values arriving over the network
diverge, dead reckoning cannot simply replace the local value with the newly arrived; such an

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


approach would lead to value jumps that disturb the players. Instead, dead reckoning interpolates
the locally predicted and the newly arrived values, using a convergence technique.

The interplay between the two techniques, the predictive and the convergence, makes dead
reckoning an eventually consistent technique, with continuous updates and managed
inconsistency.

Advantages: Although using two internal techniques may seem complex, dead reckoning is a
simple technique with excellent properties when used in distributed systems. It is also mature, with
many decades of practical experience already available.

For many gaming applications, trajectories are bound by limitations on allowed operations, so the
numerical inconsistency can be quantified as a function of the staleness of information.

Drawbacks: As a significant drawback, dead reckoning works only for applications where the two
techniques, especially the predictive, can work with relatively low overhead.

Figure 1. Example of dead reckoning.

Example: Figure 1 illustrates how dead reckoning works in practice. In this example, an object is
located in a 2D space (so, has two coordinates), in which it moves with physical velocity (so, a 2D
velocity vector expressed as a pair). The game engine updates the position of each object after
each time tick, so at time t=0, t=1, t=2, etc. In the example, the local game engine receives an
update about the object, at t=0; this update positions the object at position (0,0), with velocity (2,2).
The dead reckoning predictor can easily predict the next positions the object will take during the
next time ticks: (2,2) at t=1, (4,4) at t=2, etc. If the local game engine receives no further updates,
this predictor can continue to update the object, indefinitely.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


However, the object, controlled remotely, moves differently, and the local game engine receives
at t=1 an update that the object is now located at (3,1), with the new velocity (4,2). The game
engine has already updated the local value, to (2,2). For the next tick, t=2, the dead reckoning
technique must now interpolate the next value (4,4) and the value derived from the received
updates (7,3). If it simply replaces the next value with the value derived from the received update,
the player will observe a sudden jump, because the next value follows the intuitive path of the
previous values the player observed locally, whereas the value derived from the received update
does not. Instead, the dead reckoning technique computes interpolated values, which will smoothly
converge to the correct value if no new information is received.

If the local game engine keeps receiving new information, dead reckoning ensures a state of
smooth inconsistency, which the players experience positively.

Lock-step Consistency

Toward the end of 1997, multiplayer gaming was already commonplace, and games like Age of
Empires were launched with much acclaim and sold to millions. The technical conditions were
much improved over the humble beginnings of such games, around the 1960s for small-scale
online games and through the 1970s for large-scale games with hundreds of concurrent players
(for example, in the PLATO metaverse). Players could connect with the main servers through high-
speed networks... of 28.8 Kbps, with connections established over dial-up (phone) lines with
modems. So, following a true Jevons' paradox, gaming companies developing real-time strategy
games focused on scaling up, from a few tens of units to hundreds, per player.

Consequently, the network became a main bottleneck - sending around information about
hundreds to thousands of units (location, velocity, direction, status for each other tracked variable),
about 20 times per second as required in this game genre at the time, would quickly exceed the
limit of about 3,000 bytes per second. To expert designers, these network conditions could support
a couple of hundred but not 1,000 units. In a game like Age of Empires, the target limit set by
designers was even higher: 1,500 units across 8 players. How to ensure consistency under these
circumstances? (Similar situations continue to occur: For each significant advance in the speed of
the network and the processing power of the local gaming rig, game developers embark again on
new games that quickly exceed the new capabilities.)

One more ingredient is needed to have a game where the state of every unit - location, appearance,
activity, etc. - appears consistent across all players: the state needs to be the same at the same
moment because players are engaged in a synchronous contest against each other. So, the missing
ingredient is a synchronized clock linked to the consistency process.

Lock-step consistency occurs when simulations progress at the same rate and achieve the same
status at the end (or start) of each step (time tick).

One approach to achieve lock-step consistency is for all the computers in the distributed system
running the game to synchronize their game clocks. Players would input their commands to their
local game engines, which the local game engine communicates over the network to all other game
engines. Then, every game engine updates the local status based on the received input, either

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


locally or over the network. The game moves in lock-step, and each step consists of the sequence
input, communication, then local updates that include input.

A main benefit of this approach is that the approach trades-off communication for local
computation: the communication part is reduced only to necessary updates, such as player inputs,
and the game engines recompute the state of the game using dead reckoning and the inputs. The
network bandwidth is therefore sufficient for a game like Age of Empires with 1,500 moving units.

As a drawback, this approach uses a sequence of three operations, which is prohibitive when the
target is to complete all of them in under 50 milliseconds (to enable updates 20 times per second,
as described earlier in the section). Suppose performance variability occurs in the distributed
system, either in transferring data over the Internet or in computing the updates, for any of the
players. In this case, the next step either cannot complete in time or has to wait for the slowest
player to complete (lock-step really means the step is locked until everyone completes it).

Another approach pipelines communication and communication processes, that is, updating the
state while receiving input from players. To prevent inconsistent results, this approach again uses
time ticks, and input received during one step is always enforced two steps later.

Advantages: Such an approach guarantees that performance variability in communicating input


between players can be tolerated, as long as it is not longer than twice the step duration (so, 400
ms for the typical tick duration of 200 ms). The tolerance threshold of 400 ms was not chosen
lightly and corresponds to the tolerance to latency players exhibit for this game genre, which has
been reported by many empirical studies and summarized by meta-studies such as [2]. In other
words, for this game genre players still enjoy their in-game experience, even when a latency of
400 ms is added to their input, as long as the results are smooth and consistent.

Disadvantages: As for the first approach, performance variability, which is predominantly caused
by the processing of the slowest computer in the distributed game or by the laggiest Internet
connection, can cause problems. The problems occur only when the performance variability is
extreme, closer to 1,000 ms than to 400 ms over the typical performance at the time.

A third approach improves on the second by allowing turns to have variable lengths and thus
match performance variability. This approach works as the second whenever performance
becomes stable near normal levels: The turn length stays at 200 ms, with ticks for communication
and computation set at 50 ms.

Whenever performance degrades, this approach provides a technique to lengthen the step duration,
typically up to 1,000 ms. Beyond this value, empirical studies indicate the game becomes much
less enjoyable. Not only the turn lengthens when needed, but also how it is allocated for
computation and communication tasks, next to local updates and rendering. This approach
allocates, from the total turn duration, more time for computation to accommodate for a slower
computer among the players or more time for communication to accommodate for slow Internet
connections.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


To make decisions on turn duration, and its specific allocation to communication and computation
tasks, the system uses a distributed monitoring technique, where each player reports to the leader
(the host for the typical Age of Empires deployment, an elected leader for the peer-to-peer
deployment), during each turn, the duration of its local computation task and the latency it
observed when sending a ping message to each other player. The leader then computes the
maximum of the received values for computation and communication tasks, and makes appropriate
decisions. A typical situation could occur when the Internet latency increases, for example to 500
ms, with the turn length increasing correspondingly. Another typical situation, of some
computation tasks taking longer than usual, for example, 95 ms, would see the turn stay the normal
duration, 200 ms, but inside it the computation tick is increased to 100 ms.

Beyond mere lock-step consistency: Lock-step approaches still suffer from a major drawback:
when every player has to simulate locally every input, the amount of computation can quickly
overwhelm slower computers, especially when game designers intend to scale well beyond 8
players, to possibly tens or hundreds or thousands for real-time strategy games.

First, we partition the virtual world into areas so that the game engine can select only those of
interest for each player. Second, the game engine updates areas judiciously. Some areas do not
receive updates because no player is interested in them. Areas interesting for only one player are
updated on that player's machine. Each area that is interesting for two or more players is updated
with lock-step or communication-only consistency protocols, depending on the computation and
communication capabilities of the players interested in the area.

In summary: Trading off communication for computation needs is a typical problem for online
games, virtual environments, and metaverses. Lock-step consistency provides a solution based on
this trade-off, with many desirable properties. Still, lock-step consistency is challenging when the
system exhibits unstable behavior, such as performance variability. In production, games must
cope with unstable behavior often. Then, monitoring the system carefully while it operates, and
conducting careful empirical studies of how the players experience the game under different levels
of performance, is essential to addressing the unstable behavior satisfactorily.

Conit-based Consistency

Although lock-step consistency is useful, in games where many changes occur that do not fit local
predictors, so for which dead-reckoning and other computationally efficient techniques are
difficult to find, it is better when scaling the virtual world to allow for some inconsistency to occur.
In particular, games such as Minecraft could benefit from this.

Conits, abbreviation of consistency unit, have been designed to support consistency approaches
where inconsistency can occur but should be quantified and managed. In the original design by Yu
and Vahdat [1], conits quantify three dimensions of inconsistency:

• Staleness - how old is this update?


• Numerical error- how large is the impact of this update? and

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


• Order error- how does this update relate to other updates?

Any conit-based consistency protocol uses at least one conit to capture the inconsistency in the
system along the three dimensions. Time elapsed and data-changing operations lead to updates to
the conit state, typically increasing inconsistency values along one or more dimensions. At
runtime, when the limit of inconsistency set by the system operators is exceeded, the system
triggers a consistency-enforcing protocol and the conit is reset to (near-)zero inconsistency across
all dimensions.

Conits provide a versatile base for consistency approaches. Still, they so far have not been much
used in practice for two main reasons: First, not many applications exist that would tolerate
significant amounts of inconsistency. Second, setting the thresholds after which consistency must
be enforced is error-prone and application-dependent.

2.6 Replication in Distributed Systems More Detail

To provide a seamless experience to their users, distributed systems often rely on data replication.
Replication allows companies such as Amazon, Dropbox, Google, and Netflix to move data close
to their users, significantly improving non-functional requirements such as latency and reliability.

We study in this section what replication is and what are the main concerns for the designer when
using replication. One of the main such concerns, consistency of data across the replica, relates to
an important functional requirement and will be the focus of the next sections in this module.

What is Replication?
The core idea of replication is to repeat essential operations by duplicating, triplicating, and
generally multiplying the same service or physical resource, thread or virtual resource, or, at a
finer granularity and with a higher level of abstraction, data or code (computation).

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Figure 1. Data and service sharing.

Like resource sharing, replication can occur (i) in time, where multiple replicas (instances) co-exist
on the same machine (node), simultaneously, or (ii) in space, where multiple instances exist on
multiple machines. Figure 1 illustrates how data or services could be replicated in time or in space.
For example, data replication in space (Figure 1, bottom-left quadrant) places copies of the data
from Node 1 on several other nodes, here, Node 2 through n. As another example, service
replication in time (Figure 1, top-right quadrant) launches copies of the service on the same node,
Node 1.

To clarify replication, we must further distinguish it from other operational techniques that use
copies of services, physical resources, threads, virtual resources, data, computation, etc.
Replication differs from other techniques in many ways, including:

1. Unlike partitioning data or computation, replication makes copies of (and then uses) entire
sources, so entire datasets, entire compute tasks, etc. (A variant of replication, more
selective, focuses on making copies of entire sources, but only if sources are considered
important enough.)
2. Unlike load balancing, replication makes copies of the entire workload. (Selective
replication only makes copies of the part of the workload considered important enough.)
3. Unlike data persistence, checkpointing, and backups, replication techniques repeatedly act
on the replicas, and access to the source replica is similar to accessing the other replicas.
4. Unlike speculative execution, replication techniques typically consider replicas as
independent contributors to overall progress with the workload.
5. Unlike migration, replication techniques continue to use the source.

General benefits of replication


Replication can increase performance. When more replicas can service users, if each can deliver
roughly the performance of the source replica, the service effectively increases its performance
linearly with the number of replicas; in such cases, replication also increases the scalability of the
system. For example, grid and cloud computing systems replicate their servers, thus allowing the
system to scale to many users with similar needs.

When replicating in space, because the many nodes are unlikely to all be affected by the same
performance issue when completing a share of the workload, the entire system delivers relatively
stable performance; in this case, replication also decreases performance variability.

Geographical replication, where nodes can be placed close-to-users, can lead to important
performance gains, guided by the laws of physics, particularly the speed of light.

Replication can lead to higher reliability and to what practice considers high availability: in a
system with more replicas, more of them need to fail before the entire system becomes unavailable,
relative to a system with only one replica. The danger of a single point of failure (see also the
discussion about scheduler architectures, in Module 4) is alleviated.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Consider one of the largest outages in the past year, occurring at Microsoft's Teams, Xbox Live,
and Azure services, an outage so serious it made the news, as reported by the BBC [1]. According
to the BBC, "tens of thousands of users" self-reported failures, and Microsoft admitted the failures
touched "a subset of users". However, the Microsoft approach of replicating their services, both in
time and in space, allowed most of its hundreds of millions of users to continue working, even
while the others were experiencing failures when accessing the replicas servicing them. The many
replicas of the same service prevent partial failures from spreading system-wide. In one of our
studies [2], we observed many of these situations across the services provided by Microsoft,
Amazon, Google, Apple, and others; some of these failures are not considered important enough
to be reported by these companies.

General drawbacks of replication:

In high-availability services, replication is a common mechanism to make the system available.


Typically, a single replica provides the service, with the others available if the primary replica
fails. Such systems cost more to operate. (The extra cost may be worthwhile. For example, many
business-critical services run with services where at least two replicas exist; although the cost of
operation for the IT infrastructure effectively doubles, the higher likelihood the services will be
available when needed prevents much more costly situations when the service is needed but not
available. There is also a reputational cost at play, where the absence of service may cause bad
publicity well beyond the cost of the service.)

When multiple replicas can perform the same service concurrently, their local state may become
different, a consequence of the different operations performed by each replica. In this situation, if
the application cannot tolerate the inconsistency, the distributed system must enforce a consistency
protocol to resolve the inconsistency, either immediately, at some point in time but with specific
guarantees, or eventually. As explained during the introduction, the CAP theorem indicates
consistency is one of the properties of distributed systems that cannot be easily achieved, and in
particular it presents trade-offs with availability (and performance, as we will learn at the end of
this module). So, this approach may offset and even negate some of the benefits discussed earlier
in this section.

Replication Approaches
In a small-scale distributed system, replication is typically achieved by executing the incoming
stream of tasks (requests in web and database applications, jobs in Module 4) either (i) passively,
where the execution happens on a single replica, which then broadcasts to the others the results, or
(ii) actively, where each replica receives the input stream of tasks and executes it. However, many
more considerations appear as soon as the distributed system becomes larger than a few nodes
serving a few clients.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Many replication approaches have been developed and tested in practice in larger, even global-
scale, distributed systems. Depending on the scale and purpose of the system, we consider here the
principles of three aspects of the replication problem: (i) replica-server location, (ii) replica
placement, and (iii) replica updates. Many more aspects exist, and in general replication is a rich
problem that includes many of the issues present in resource management and scheduling problems
in distributed systems (see Module 4): Who? What? When? For how long? etc.

Replica-server location: Like any data or compute task, replicas require physical or virtual
machines on which to run. Thus, the problem of placing these machines, such that their locations
provide the best possible service to the system and a good trade-off with other considerations, is
important. This problem is particularly important for distributed systems with a highly
decentralized administration, for which decisions taken by the largely autonomous nodes can even
interfere with each other, and for distributed systems with highly volatile clients and particularly
those with high churn, where the presence of clients in one place or another can be difficult to
predict.

Replica-server location defines the conditions of a facility location problem, for example, finding
the best K locations out of the N possible, subject to many performance, cost, and other constraints,
with many theoretical solutions from Operations Research.

An interesting problem is how should new replica locations emerge. When replica-servers are
permanent, for example, as game operators run their shared sites, or web operators mirror the
websites, all that is needed is to add a statically configured machine. However, to prevent resource
waste, it would be better to allow replica-servers to be added or removed as needed, related to
(anticipated) load. (This is the essential case of many modern IT operations, which underlies the
need for cloud and serverless computing.) In such a situation, derived from traditional systems
considerations, who should trigger adding or removing a replica-server, the distributed system or
the client? A traditional answer is both, which means that the designer must (1) consider whether
to allow the replica-server system to be elastic, adding and removing replicas as needed, and (2)
enable both system- and client-initiated elasticity.

Replica placement: Various techniques can help with placing replicas on available replica-
servers.

A simplistic approach is to place one replica on each available replica-server. The main advantage
of this approach is that each location, presumably enabled because of its good capability to service
clients, can actually do so. The main disadvantages are that this approach does not enable
replication in time, so multiple replicas located in the same location when enough resources exist,
and rapid changes in demand or system conditions cannot be accommodated. (Longer-term
changes can be accommodated by good management of replica-server locations.)

Another approach is to use a multi-objective solver to place replicas in the replica-server topology.
The topological space can be divided, e.g., into Voronoi diagrams; conditions such as poor
connectivity between adjacent divisions can be taken into account, etc. Online algorithms often
use simplifications, such as partitioning the topology only along the main axes, and greedy
approaches, such as placing servers first in the most densely populated areas.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Replica updates:

What to update? Replicas need to achieve a consistent state, but how they do so can differ by
system and, in dynamic systems, even by replica itself (e.g., as in [3] for an online gaming
application). Two main classes of approaches exist: (i) updating from the result computed by one
replica (the coordinating-replica), and (ii) updating from the stream of input operations that,
applied identically, will lead to the same outcome and thus a consistent state across all replicas.
(Note (i) corresponds to the passive replication described at the start of this section, whereas (ii)
corresponds to active replication.)

Passive replication typically consumes fewer compute resources per replica receiving the result.
Conversely, active replication typically consumes fewer networking resources to send the update
to all replicas. In section 2.2.7, Consistency for Online Gaming, Virtual Environments, and the
Metaverse, we see how these trade-offs are important to manage for online games.

When to perform updates? With synchronous updates, all replicas perform the same update, which
has the advantage that the system will be in a consistent state at the end of each update, but also
the drawbacks of waiting for the slowest part of the system to complete the operation and of having
to update each replica even if this is not immediately necessary.

With asynchronous updates, the source informs the other replicas of the changes, and often just
that a new operation has been performed or that enough time has elapsed since the last update.
Then, replicas mark their local data as (possibly) outdated. Each replica can decide if and when to
perform the update, lazily.

Whom? Who initiates the replica update is important.

With push-based protocols, the system propagates modifications to replicas, informing the clients
when replica-updates must occur. This means replicas in the system must be stateful, thus able to
consider the need to propagate modifications by inspecting the previous state. This approach is
useful for applications with high ratios of operations that do not change the state, relative to those
that change it (e.g., high read:write ratios). With this approach, the system is more expensive to
operate, and typically less scalable, than when it does not have to maintain state.

With pull-based protocols, clients ask for updates. Different approaches exist: clients could poll
the system to check for updates, but if the frequency is polling is too high the system can get
overloaded, and if it is too low (i) the client may get stale information from its state, or (ii) the
client may have to wait for a relatively long time before obtaining the updated information from
the system, leading to low performance.

As is common in distributed systems, a hybrid approach could work better. Leases, where push-
based protocols are used while the lease is active, and pull-based protocols are used outside the
scope of the lease, are such a hybrid approach.

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University


Exercise of chapter two
1) Using naming schema and services concept, describe how users and names are stored in
Active Directory and accessed using LDAP for authentication and other services.
2) Read about etcd then understand what it is and how it works. Then write what you get
concerning etcd using examples, graphs or any means of etcd descriptions.
3) Consensus algorithm “Raft” and direct democratic election process is mostly related. Write
the similarity, differences and the gaps of each of both operations.
4) List different consistency techniques and describe them. Use clear example for all lists and
descriptions
5) List types of data replication and discuss them.
Programming Assignment One
a) Write fully working remote procedure call (RPC) program using java. Example
b) Write fully working remote method invocation (RMI) program using java. Example

References:
[1] The BBC, Microsoft says services have recovered after widespread outage, Jan 2022.
[2] Sacheendra Talluri, (2021) Empirical Characterization of User Reports about Cloud Failures. 2021.
[3] More on Consistency and Replication

Compiled by: Ararsa Lemmessa (Eng.) Rift Valley University

You might also like