Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

lOMoARcPSD|13657851 lOMoARcPSD|13657851

MC4202 ADVANCED DATABASE TEHNOLOGY

UNIT – I DISTRIBUTED DATABASES

What Are Distributed Systems?

A distributed system is a computing environment in which various components


So the idea behind distributed architectures is to have these components presented on
are spread across multiple computers (or other computing devices) on a network. different platforms, where components can communicate with each other over a
These devices split up the work, coordinating their efforts to complete the job more efficiently communication network in order to achieve specifics objectives.
than if a single device had been responsible for the task. 3) Architectural Styles
There are four different architectural styles, plus the hybrid architecture, when it comes to
How do distributed systems communicate with each other? distributed systems. The basic idea is to organize logically different components, and
distribute those computers over the various machines.
Distributed System Architecture Distributed systems must have a network that connects all  Layered Architecture
components (machines, hardware, or software) together so they can transfer messages to
 Object Based Architecture
communicate with each other. That network could be connected with an IP address or use
cables or even on a circuit board.  Data-centered Architecture
 Event Based Architecture
One of the major disadvantages of distributed systems is the complexity of the underlying
 Hybrid Architecture
hardware and software arrangements. This arrangement is generally known as a topology or
an overlay. This is what provides the platform for distributed nodes to communicate and Layered Architecture
coordinate with each other as needed. The layered architecture separates layers of components from each other, giving it a much
1. What is a Distributed System more modular approach. A well known example for this is the OSI model that incorporates a
layered architecture when interacting with each of the components. Each interaction is
2. Distributed System Architectures
sequential where a layer will contact the adjacent layer and this process continues, until the
3. Architectural Styles request is been catered to. But in certain cases, the implementation can be made so that some
4. System Level Architecture layers will be skipped, which is called cross-layer coordination. Through cross-layer
coordination, one can obtain better results due to performance increase.
5. A Comparison Between Client Server and Peer to Peer Architectures The layers on the bottom provide a service to the layers on the top. The request flows from top
6. Middleware in Distributed Applications to bottom, whereas the response is sent from bottom to top. The advantage of using this
7. Centralized vs. Decentralized Architectures approach is that, the calls always follow a predefined path, and that each layer can be easily
replaced or modified without affecting the entire architecture.
8. Summary on Structured and Unstructured P2P Systems
1) What is a Distributed System?
A distributed system is a software system that interconnects a collection of heterogeneous
independent computers, where coordination and communication between computers only
happen through message passing, with the intention of working towards a common goal. The
idea behind distributed systems is to provide a viewpoint of being a single coherent system, to
the outside world. So, the set of independent computers or nodes are interconnected through a
Local Area Network (LAN) or a Wide Area Network (WAN).
2) Distributed System Architectures
In this blog, I would like to talk about the available Distributed System architectures that we
see today and how they are being utilized in our day to day applications. Distributed system
architectures are bundled up with components and connectors. Components can be individual
nodes or important components in the architecture whereas connectors are the ones that
connect each of these components. Object Based Architecture
 Component: A modular unit with well-defined interfaces; replaceable; reusable This architecture style is based on loosely coupled arrangement of objects. This has no specific
architecture like layers. Like in layers, this does not have a sequential set of steps that needs to
 Connector: A communication link between modules which mediates coordination or
be carried out for a given call. Each of the components is referred to as objects, where each
cooperation among components
object can interact with other objects through a given connector or interface. These are much
1|Page 2|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

more direct where all the different components can interact directly with other components notified telling that such an event has occurred. So, if anyone is interested, that node
through a direct method call. can pull the event from the bus and use it. Sometimes these events could be data, or
even URLs to resources. So the receiver can access whatever the information is given in
the event and process accordingly.
 These events occasionally carry data. An advantage in this architectural style is that,
components are loosely coupled. So it is easy to add, remove and modify components in
the system.
 One major advantage is that, these heterogeneous components can contact the bus,
through any communication protocol. But an ESB or a specific bus, has the capability
to handle any type of incoming request and process accordingly.

As shown in the above image, communication between object happen as method invocations.
These are generally called Remote Procedure Calls (RPC). Some popular examples are Java
RMI, Web Services and REST API Calls. This has the following properties.
 This architecture style is less structured.
 component = object
 connector = RPC or RMI
When decoupling these processes in space, people wanted the components to be anonymous
and replaceable. And the synchronization process needed to be asynchronous, which has led to
Data Centered Architectures and Event Based Architectures. This architectural style is based on the publisher-subscriber architecture. Between each node
Data Centered Architecture there is no direct communication or coordination. Instead, objects which are subscribed to the
 As the title suggests, this architecture is based on a data center, where the primary service communicate through the event bus.
communication happens via a central data repository. The event based architecture supports, several communication styles.
 This common repository can be either active or passive.  Publisher-subscriber
 This is more like a producer consumer problem.  Broadcast
 The producers produce items to a common data store, and the consumers can request  Point-to-Point
data from it. The major advantage of this architecture is that the Components are decoupled in space -
 This common repository could even be a simple database. But the idea is that, the loosely coupled.
communication between objects happening through this shared common storage. 4) System Level Architecture
 This supports different components (or objects) by providing a persistent storage space The two major system level architectures that we use today are Client-server and Peer-to-
for those components (such as a MySQL database). peer (P2P). We use these two kinds of services in our day to day lives, but the difference
 All the information related to the nodes in the system are stored in this persistent between these two is often misinterpreted.
storage. In event-based architectures, data is only sent and received by those Client Server Architecture
components who have already subscribed. The client server architecture has two major components.
Some popular examples are distributed file systems, producer consumer, and web based data  The client and
services.  The server.
 The Server is where all the processing, computing and data handling is happening,
whereas the Client is where the user can access the services and resources given by the
Server (Remote Server).
 The clients can make requests from the Server, and the Server will respond accordingly.
 Generally, there is only one server that handles the remote side. But to be on the safe
side, we do use multiple servers will load balancing techniques.

Event Based Architecture


 The entire communication in this kind of a system happens through events. When an
event is generated, it will be sent to the bus system. With this, everyone else will be

3|Page 4|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

As one common design feature, the Client Server architecture has a centralized security
database. This database contains security details like credentials and access details. Users
can't log in to a server, without the security credentials. So, it makes this architecture a bit
more stable and secure than Peer to Peer. The stability comes where the security database can
allow resource usage in a much more meaningful way. But on the other hand, the system
might get low, as the server only can handle a limited amount of workload at a given time.
Advantages:
 Easier to Build and Maintain 6) Middleware in Distributed Applications
 Better Security If we look at Distributed systems today, they lack the uniformity and consistency. Various
 Stable heterogeneous devices have taken over the world where distributed system cater to all these
Disadvantages: devices in a common way. One way distributed systems can achieve uniformity is through a
 Single point of failure common layer to support the underlying hardware and operating systems. This common layer
 Less scalable is known as a middleware, where it provides services beyond what is already provided by
Peer to Peer (P2P) Operating systems, to enable various features and components of a distributed system to
The general idea behind peer to peer is where there is no central control in a distributed enhance its functionality better. This layer provides a certain data structures and operations
system. The basic idea is that, each node can either be a client or a server at a given time. If that allow processes and users on far-flung machines to inter-operate and work together in a
the node is requesting something, it can be known as a client, and if some node is providing consistent way. The image given below, depicts the usage of a middleware to inter-connect
something, it can be known as a server. In general, each node is referred to as a Peer. various kinds of nodes together.

In this network, any new node has to first join the network. After joining in, they can either
request a service or provide a service. The initiation phase of a node (Joining of a node), can
vary according to implementation of a network. There are two ways in how a new node can
get to know, what other nodes are providing. 7) Centralized vs Decentralized Architectures
 Centralized Lookup Server - The new node has to register with the centralized look The two main structures that we see within distributed system overlays are Centralized and
up server and mention the services it will be providing, on the network. So, whenever Decentralized architectures. The centralized architecture can be explained by a simple client-
you want to have a service, you simply have to contact the centralized look up server server architecture where the server acts as a central unit. This can also be considered as
and it will direct you to the relevant service provider. centralized look up table with the following characteristics.
 Decentralized System - A node desiring for specific services must, broadcast and ask  Low overhead
every other node in the network, so that whoever is providing the service will respond.  Single point of failure
5) A Comparison between Client Server and Peer to Peer Architectures
 Easy to Track
 Additional Overhead.
5|Page 6|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Mapping Function: Map the hash value to a specific node in the system
 Lookup table: Return the network address of the node represented by the unique hash
value.
Unstructured P2P Systems
There is no specific structure in these systems, hence the name "unstructured networks". Due
to this reason, the scalability of the unstructured p2p systems is very high. These systems rely
on randomized algorithms for constructing an overlay network. As in structured p2p systems,
there is no specific path for a certain node. It's generally random, where every unstructured
system tried to maintain a random path. Due to this reason, the search of a certain file or node
is never guaranteed in unstructured systems.
The basic principle is that each node is required to randomly select another node, and contact
it.
 Let each peer maintain a partial view of the network, consisting of n other nodes
 Each node P periodically selects a node Q from its partial view
When it comes to distributed systems, we are more interested in studying more on the overlay  P and Q exchange information and exchange members from their respective partial
and unstructured network topologies that we can see today. In general, the peer to peer views
systems that we see today can be separated into three unique sections. Hybrid P2P Systems
 Structured P2P: nodes are organized following a specific distributed data structure Hybrid systems are often based on both client server architectures and p2p networks. A
 Unstructured P2P: nodes have randomly selected neighbors famous example is Bittorrent, which we use everyday. The torrent search engines provide a
 Hybrid P2P: some nodes are appointed special functions in a well-organized fashion client server architecture, where the trackers provide a structured p2p overlay. The rest of
Structured P2P Architecture nodes, which are also known as leechers and seeders, become the unstructured overlay of the
The meaning of the word structured is that the system already has a predefined structure that network, allowing it to scale itself as needed and further.
other nodes will follow. Every structured network inherently suffers from poor scalability, due
to the need for structure maintenance. In general, the nodes in a structured overlay network
are formed in a logical ring, with nodes being connected to the this ring. In this ring, certain
nodes are responsible for certain services.
A common approach that can be used to tackle the coordination between nodes, is to use
distributed hash tables (DHTs). A traditional hash function converts a unique key into a hash
value that will represent an object in the network. The hash function value is used to insert an
object in the hash table and to retrieve it.

Summary on Structured and Unstructured P2P Systems

In a DHT, each key is assigned to a unique hash, where the random hash value needs to be of a
very large address space, in order to ensure uniqueness. A mapping function is being used to
assign objects to nodes based on the hash function value. A look up based on the hash function
value, returns the network address of the node that stores the requested object.
 Hash Function: Takes a key and produces a unique hash value

7|Page 8|Page

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

A database is an ordered collection of related data. A DBMS is a software package to work


upon a database.
The three topics covered are database schemas, types of databases and operations on
databases.

Database and Database Management System

A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the
characteristic features of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name, Company_Id,
Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates
definition, construction, manipulation and sharing of data in a database. Definition of a
database includes description of the structure of a database. Construction of a database
involves actual storing of the data in any storage medium. Manipulation refers to the
retrieving information from the database, updating the database and generating reports.
Sharing of data facilitates data to be accessed by different users or programs.
Examples of DBMS Application Areas
DISTRIBUTED DATABASE CONCEPTS  Automatic Teller Machines
 Train Reservation System
WHAT IS DISTRIBUTED DATABASE?  Employee Management System
 Student Information System
A Distributed database is defined as a logically related collection of data that is shared Examples of DBMS Packages
which is physically distributed over a computer network on different sites. The Distributed  MySQL
DBMS is defined as, the software that allows for the management of the distributed database  Oracle
and make the distributed data available for the users.  SQL Server
 dBASE
 FoxPro
 PostgreSQL, etc.

Database Schemas

A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with them.
Databases are often represented through the three-schema architecture or ANSISPARC
architecture. The goal of this architecture is to separate the user application from the
physical database. The three levels are −
 Internal Level having Internal Schema − It describes the physical structure, details
of internal storage and access paths for the database.

9|Page 10 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

 Conceptual Level having Conceptual Schema − It describes the structure of the 3.Relational DBMS
whole database while hiding the details of physical storage of data. This illustrates the In relational databases, the database is represented in the form of relations. Each relation
entities, attributes with their data types and constraints, user operations and models an entity and is represented as a table of values. In the relation or table, a row is called
relationships.
a tuple and denotes a single record. A column is called a field or an attribute and denotes a
 External or View Level having External Schemas or Views − It describes the
characteristic property of the entity. RDBMS is the most popular database management
portion of a database relevant to a particular user or a group of users while hiding the
system.
rest of database.
For example − A Student Relation −
Types of DBMS

There are four types of DBMS.


1. Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so that
one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and simple. 4.Object Oriented DBMS
Object-oriented DBMS is derived from the model of the object-oriented programming
paradigm. They are helpful in representing both consistent data as stored in databases, as
well as transient data, as found in executing programs. They use small, reusable elements
called objects. Each object contains a data part and a set of operations which works upon the
data. The object and its attributes are accessed through pointers instead of being stored in
relational table models.
For example − A simplified Bank Account object-oriented database −

2.Network DBMS
Network DBMS in one where the relationships among data in the database are of type many-
to-many in the form of a network. The structure is generally complicated due to the existence Distributed DBMS
of numerous many-to-many relationships. Network DBMS is modelled using “graph” data
structure. A distributed database is a set of interconnected databases that is distributed over the
computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms so as to make the databases
transparent to the users. In these systems, data is intentionally distributed among multiple
nodes so that all computing resources of the organization can be optimally used.

Operations on DBMS

The four basic operations on a database are Create, Retrieve, Update and Delete.
 CREATE database structure and populate it with data − Creation of a database
relation involves specifying the data structures, data types and the constraints of the
data to be stored.
Example − SQL command to create a student table −
CREATE TABLE STUDENT (
11 | P a g e 12 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

ROLL INTEGER PRIMARY KEY,  Data is physically stored across multiple sites. Data in each site can be managed by a
NAME VARCHAR2(25), DBMS independent of the other sites.
YEAR INTEGER,  The processors in the sites are connected via a network. They do not have any
STREAM VARCHAR2(10) multiprocessor configuration.
);  A distributed database is not a loosely connected file system.
 Once the data format is defined, the actual data is stored in accordance with the  A distributed database incorporates transaction processing, but it is not synonymous
format in some storage medium. with a transaction processing system.
Example SQL command to insert a single tuple into the student table −
INSERT INTO STUDENT ( ROLL, NAME, YEAR, STREAM) Distributed Database Management System
VALUES ( 1, 'ANKIT JHA', 1, 'COMPUTER SCIENCE');
A distributed database management system (DDBMS) is a centralized software system that
 RETRIEVE information from the database – Retrieving information generally involves
manages a distributed database in a manner as if it were all stored in a single location.
selecting a subset of a table or displaying data from the table after some computations
have been done. It is done by querying upon the table. Features
Example − To retrieve the names of all students of the Computer Science stream, the  It is used to create, retrieve, update and delete distributed databases.
following SQL query needs to be executed −  It synchronizes the database periodically and provides access mechanisms by the virtue
SELECT NAME FROM STUDENT of which the distribution becomes transparent to the users.
WHERE STREAM = 'COMPUTER SCIENCE';  It ensures that the data modified at any site is universally updated.
 UPDATE information stored and modify database structure – Updating a table  It is used in application areas where large volumes of data are processed and accessed
involves changing old values in the existing table’s rows with new values. by numerous users simultaneously.
Example − SQL command to change stream from Electronics to Electronics and  It is designed for heterogeneous database platforms.
Communications −  It maintains confidentiality and data integrity of the databases.
UPDATE STUDENT
SET STREAM = 'ELECTRONICS AND COMMUNICATIONS' Factors Encouraging DDBMS
WHERE STREAM = 'ELECTRONICS';
 Modifying database means to change the structure of the table. However, modification The following factors encourage moving over to DDBMS −
of the table is subject to a number of restrictions.  Distributed Nature of Organizational Units − Most organizations in the current
Example − To add a new field or column, say address to the Student table, we use the times are subdivided into multiple units that are physically distributed over the globe.
following SQL command − Each unit requires its own set of local data. Thus, the overall database of the
ALTER TABLE STUDENT organization becomes distributed.
ADD ( ADDRESS VARCHAR2(50) );  Need for Sharing of Data − The multiple organizational units often need to
 DELETE information stored or delete a table as a whole – Deletion of specific communicate with each other and share their data and resources. This demands
information involves removal of selected rows from the table that satisfies certain common databases or replicated databases that should be used in a synchronized
conditions. manner.
Example − To delete all students who are in 4 th year currently when they are passing  Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and
out, we use the SQL command − Online Analytical Processing (OLAP) work upon diversified systems which may have
DELETE FROM STUDENT common data. Distributed database systems aid both these processing by providing
WHERE YEAR = 4; synchronized data.
 Alternatively, the whole table may be removed from the database.  Database Recovery − One of the common techniques used in DDBMS is replication of
Example − To remove the student table completely, the SQL command used is − data across different sites. Replication of data automatically helps in data recovery if
DROP TABLE STUDENT; database in any site is damaged. Users can access data from other sites while the
damaged site is being reconstructed. Thus, database failure may become almost
A distributed database is a collection of multiple interconnected databases, which are inconspicuous to users.
spread physically across various locations that communicate via a computer network.  Support for Multiple Application Software − Most organizations use a variety of
application software each with its specific database support. DDBMS provides a uniform
Features functionality for using the same data among different platforms.

 Databases in the collection are logically interrelated with each other. Often they Advantages of Distributed Databases
represent a single logical database.

13 | P a g e 14 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Following are the advantages of distributed databases over centralized databases. Homogeneous Distributed Databases
Modular Development − If the system needs to be expanded to new locations or new units, In a homogeneous distributed database, all the sites use identical DBMS and operating
in centralized database systems, the action requires substantial efforts and disruption in the systems. Its properties are −
existing functioning. However, in distributed databases, the work simply requires adding new
 The sites use very similar software.
computers and local data to the new site and finally connecting them to the distributed
 The sites use identical DBMS or DBMS from the same vendor.
system, with no interruption in current functions.
 Each site is aware of all other sites and cooperates with other sites to process user
More Reliable − In case of database failures, the total system of centralized databases comes requests.
to a halt. However, in distributed systems, when a component fails, the functioning of the  The database is accessed through a single interface as if it is a single database.
system continues may be at a reduced performance. Hence DDBMS is more reliable. Types of Homogeneous Distributed Database
Better Response − If data is distributed in an efficient manner, then user requests can be met There are two types of homogeneous distributed database −
from local data itself, thus providing faster response. On the other hand, in centralized  Autonomous − Each database is independent that functions on its own. They are
systems, all queries have to pass through the central computer for processing, which increases integrated by a controlling application and use message passing to share data updates.
the response time.  Non-autonomous − Data is distributed across the homogeneous nodes and a central
Lower Communication Cost − In distributed database systems, if data is located locally or master DBMS co-ordinates data updates across the sites.
where it is mostly used, then the communication costs for data manipulation can be Heterogeneous Distributed Databases
minimized. This is not feasible in centralized systems. In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
Adversities of Distributed Databases
 Different sites use dissimilar schemas and software.
Following are some of the adversities associated with distributed databases.  The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
 Need for complex and expensive software − DDBMS demands complex and often  Query processing is complex due to dissimilar schemas.
expensive software to provide data transparency and co-ordination across the several  Transaction processing is complex due to dissimilar software.
sites.  A site may not be aware of other sites and so there is limited co-operation in processing
 Processing overhead − Even simple operations may require a large number of user requests.
communications and additional calculations to provide uniformity in data across the Types of Heterogeneous Distributed Databases
sites.  Federated − The heterogeneous database systems are independent in nature and
 Data integrity − The need for updating data in multiple sites pose problems of data integrated together so that they function as a single database system.
integrity.  Un-federated − The database systems employ a central coordinating module through
 Overheads for improper data distribution − Responsiveness of queries is largely which the databases are accessed.
dependent upon proper data distribution. Improper data distribution often leads to very
slow response to user requests. Distributed DBMS Architectures

Types of Distributed Databases DDBMS architectures are generally developed depending on three parameters −

Distributed databases can be broadly classified into homogeneous and heterogeneous  Distribution − It states the physical distribution of data across the different sites.
distributed database environments, each with further sub-divisions, as shown in the following  Autonomy − It indicates the distribution of control of the database system and the
illustration. degree to which each constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models

Some of the common architectural models are −


 Client - Server Architecture for DDBMS
 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

15 | P a g e 16 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Client - Server Architecture for DDBMS This is an integrated database system formed by a collection of two or more autonomous
This is a two-level architecture where the functionality is divided into servers and clients. The database systems.
server functions primarily encompass data management, query processing, optimization and Multi-DBMS can be expressed through six levels of schemas −
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.  Multi-database View Level − Depicts multiple user views comprising of subsets of the
integrated distributed database.
The two different client - server architecture are −  Multi-database Conceptual Level − Depicts integrated multi-database that
comprises of global logical multi-database structure definitions.
 Single Server Multiple Client
 Multi-database Internal Level − Depicts the data distribution across different sites
 Multiple Server Multiple Client (shown in the following diagram)
and multi-database to local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −
 Model with multi-database conceptual level.
 Model without multi-database conceptual level.

Peer- to-Peer Architecture for DDBMS


In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.
DISTRIBUTED DATA STORAGE
Distributed database storage is managed in two ways: In database replication, the systems
store copies of data on different sites. If an entire database is available on multiple sites, it is a
fully redundant database.

Distributed databases are used for horizontal scaling, and they are designed to meet the
workload requirements without having to make changes in the database application or
vertically scale a single machine.

Distributed databases resolve various issues, such as availability, fault tolerance,


throughput, latency, scalability, and many other problems that can arise from using a single
machine and a single database.
Multi - DBMS Architectures
Distributed Database Definition

17 | P a g e 18 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

A distributed database represents multiple interconnected databases spread out across The following diagram shows an example of a homogeneous database:
several sites connected by a network. Since the databases are all connected, they appear as a
single database to the users.

Distributed databases utilize multiple nodes. They scale horizontally and develop a distributed
system. More nodes in the system provide more computing power, offer greater availability,
and resolve the single point of failure issue.

Different parts of the distributed database are stored in several physical locations, and the
processing requirements are distributed among processors on multiple database nodes.

A centralized distributed database management system (DDBMS) manages the distributed


data as if it were stored in one physical location. DDBMS synchronizes all data operations
among databases and ensures that the updates in one database automatically reflect on
databases in other sites.

Distributed Database Features

Some general features of distributed databases are:


Heterogeneous : A heterogeneous distributed database uses different schemas, operating
 Location independency - Data is physically stored at multiple sites and managed by systems, DDBMS, and different data models.
an independent DDBMS.
 Distributed query processing - Distributed databases answer queries in a In the case of a heterogeneous distributed database, a particular site can be completely
distributed environment that manages data at multiple sites. High-level queries are unaware of other sites causing limited cooperation in processing user requests. The limitation
transformed into a query execution plan for simpler management. is why translations are required to establish communication between sites.
 Distributed transaction management - Provides a consistent distributed database
through commit protocols, distributed concurrency control techniques, and distributed The following diagram shows an example of a heterogeneous database:
recovery methods in case of many transactions and failures.
 Seamless integration - Databases in a collection usually represent a single logical
database, and they are interconnected.
 Network linking - All databases in a collection are linked by a network and
communicate with each other.
 Transaction processing - Distributed databases incorporate transaction processing,
which is a program including a collection of one or more database operations.
Transaction processing is an atomic process that is either entirely executed or not at
all.

Distributed Database Types : There are two types of distributed databases:

 Homogenous
 Heterogenous

Homogeneous :

A homogenous distributed database is a network of identical databases stored on multiple


sites. The sites have the same operating system, DDBMS, and data structure, making them
easily manageable.
Distributed Database Storage
Homogenous databases allow users to access data from each of the databases seamlessly.

19 | P a g e 20 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

Distributed database storage is managed in two ways: There are two types of fragmentation:

 Replication  Horizontal fragmentation - The relation schema is fragmented into groups of rows,
 Fragmentation and each group (tuple) is assigned to one fragment.
 Vertical fragmentation - The relation schema is fragmented into smaller schemas,
Replication and each fragment contains a common candidate key to guarantee a lossless join.

In database replication, the systems store copies of data on different sites. If an entire Distributed Database Advantages and Disadvantages
database is available on multiple sites, it is a fully redundant database. Advantages Disadvantages

The advantage of database replication is that it increases data availability on different sites Modular development Costly software
and allows for parallel query requests to be processed.
Reliability Large overhead
However, database replication means that data requires constant updates and
synchronization with other sites to maintain an exact database copy. Any changes made on Lower communication costs Data integrity
one site must be recorded on other sites, or else inconsistencies occur.
Better response Improper data distribution
Constant updates cause a lot of server overhead and complicate concurrency control, as a lot
of concurrent queries must be checked in all available sites.  What is a Distributed Transaction?

A distributed transaction is a set of operations on data that is performed


across two or more data repositories (especially databases). It is typically
coordinated across separate nodes connected by a network, but may also span
multiple databases on a single server.

 There are two possible outcomes: 1) all operations successfully complete, or 2) none
of the operations are performed at all due to a failure somewhere in the system. In
the latter case, if some work was completed prior to the failure, that work will be
reversed to ensure no net work was done. This type of operation is in compliance
with the “ACID” (atomicity-consistency-isolation-durability) principles of databases
that ensure data integrity. ACID is most commonly associated with transactions on a
single database server, but distributed transactions extend that guarantee across
multiple databases.
 The operation known as a “two-phase commit” (2PC) is a form of a distributed
transaction. “XA transactions” are transactions using the XA protocol, which is one
implementation of a two-phase commit operation.

Fragmentation

When it comes to fragmentation of distributed database storage, the relations are


fragmented, which means they are split into smaller parts. Each of the fragments is stored
on a different site, where it is required.

The prerequisite for fragmentation is to make sure that the fragments can later be
reconstructed into the original relation without losing data.

The advantage of fragmentation is that there are no data copies, which prevents data
inconsistency.

21 | P a g e 22 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

sites where the transaction is being executed and uniformly enforce the decision. When
processing is complete at each site, it reaches the partially committed transaction state and
waits for all other transactions to reach their partially committed states. When it receives the
message that all the sites are ready to commit, it starts to commit. In a distributed system,
either all sites commit or none of them does.
The different distributed commit protocols are −
 One-phase commit
 Two-phase commit
 Three-phase commit

 Distributed One-phase Commit

 Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
A distributed transaction spans multiple databases and guarantees data integrity. controlling site and a number of slave sites where the transaction is being executed. The steps
in distributed commit are −
How Do Distributed Transactions Work?  After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
 Distributed transactions have the same processing completion requirements as  The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting
regular database transactions, but they must be managed across multiple resources, time is called window of vulnerability.
making them more challenging to implement for database developers. The multiple  When the controlling site receives “DONE” message from each slave, it makes a
resources add more points of failure, such as the separate software systems that run decision to commit or abort. This is called the commit point. Then, it sends this message
the resources (e.g., the database software), the extra hardware servers, and network to all the slaves.
failures. This makes distributed transactions susceptible to failures, which is why  On receiving this message, a slave either commits or aborts and then sends an
safeguards must be put in place to retain data integrity. acknowledgement message to the controlling site.
 For a distributed transaction to occur, transaction managers coordinate the
resources (either multiple databases or multiple nodes of a single database). The  Distributed Two-phase Commit
transaction manager can be one of the data repositories that will be updated as part
of the transaction, or it can be a completely independent separate resource that is Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
only responsible for coordination. The transaction manager decides whether to steps performed in the two phases are as follows −
commit a successful transaction or rollback an unsuccessful transaction, the latter of
which leaves the database unchanged. Phase 1: Prepare Phase
 First, an application requests the distributed transaction to the transaction  After each slave has locally completed its transaction, it sends a “DONE” message to the
manager. The transaction manager then branches to each resource, which will have controlling site. When the controlling site has received “DONE” message from all slaves,
its own “resource manager” to help it participate in distributed transactions. it sends a “Prepare” message to the slaves.
Distributed transactions are often done in two phases to safeguard against partial  The slaves vote on whether they still want to commit or not. If a slave wants to commit,
updates that might occur when a failure is encountered. The first phase involves it sends a “Ready” message.
acknowledging intent to commit, or a “prepare-to-commit” phase. After all resources  A slave that does not want to commit sends a “Not Ready” message. This may happen
acknowledge, they are then asked to run a final commit, and then the transaction is when the slave has conflicting concurrent transactions or there is a timeout.
completed.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
COMMIT PROTOCOLS controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves, it
In a local database system, for committing a transaction, the transaction manager has to only considers the transaction as committed.
convey the decision to commit to the recovery manager. However, in a distributed system, the  After the controlling site has received the first “Not Ready” message from any slave −
transaction manager should convey the decision to commit to all the servers in the various
23 | P a g e 24 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851 lOMoARcPSD|13657851

o The controlling site sends a “Global Abort” message to the slaves. Locking-based concurrency control systems can use either one-phase or two-phase locking
o The slaves abort the transaction and send a “Abort ACK” message to the protocols.
controlling site. 1. One-phase Locking Protocol: In this method, each transaction locks an item
o When the controlling site receives “Abort ACK” message from all the slaves, it before use and releases the lock as soon as it has finished using it. This locking
considers the transaction as aborted. method provides for maximum concurrency but does not always enforce
serializability.
 Distributed Three-phase Commit
2. Two-phase Locking Protocol: In this method, all locking operations precede
The steps in distributed three-phase commit are as follows − the first lock-release or unlock operation. The transaction comprise of two
phases. In the first phase, a transaction only acquires all the locks it needs and
Phase 1: Prepare Phase do not release any lock. This is called the expanding or the growing phase. In
The steps are same as in distributed two-phase commit. the second phase, the transaction releases the locks and cannot request any
new locks. This is called the shrinking phase.
Phase 2: Prepare to Commit Phase
Every transaction that follows two-phase locking protocol is guaranteed to be serializable.
 The controlling site issues an “Enter Prepared State” broadcast message. However, this approach provides low parallelism between two conflicting transactions.
 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase  Timestamp Concurrency Control Algorithms: Timestamp-based concurrency
control algorithms use a transaction’s timestamp to coordinate concurrent access to a data
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is item to ensure serializability. A timestamp is a unique identifier given by DBMS to a
not required. transaction that represents the transaction’s start time.

These algorithms ensure that transactions commit in the order dictated by their timestamps.
An older transaction should commit before a younger transaction, since the older transaction
enters the system before the younger one.
Timestamp-based concurrency control techniques generate serializable schedules such that
the equivalent serial schedule is arranged in order of the age of the participating transactions.

Optimistic Concurrency Control Algorithm : In systems with low conflict rates, the task
of validating every transaction for serializability may lower performance. In these cases, the
test for serializability is postponed to just before commit. Since the conflict rate is low, the
probability of aborting transactions which are not serializable is also low. This approach is
called optimistic concurrency control technique.

In this approach, a transaction’s life cycle is divided into the following three phases −
CONCURRENCY CONTROL  Execution Phase − A transaction fetches data items to memory and performs
Concurrency control in distributed system is achieved by a program which is operations upon them.
called scheduler. Scheduler help to order the operations of transaction in such a way that the  Validation Phase − A transaction performs checks to ensure that committing its
resulting logs is serializable. There have two type of the concurrency control that are locking changes to the database passes serializability test.
approach and non-locking approach.  Commit Phase − A transaction writes back modified data item in memory to the disk.

VARIOUS APPROACHES FOR CONCURRENCY CONTROL.

 LOCKING BASED CONCURRENCY CONTROL PROTOCOLS


QUERY PROCESSING IN DISTRIBUTED DBMS
A Query processing in a distributed database management system requires the
Locking-based concurrency control protocols use the concept of locking data items. A lock is a
transmission of data between the computers in a network. A distribution strategy for a
variable associated with a data item that determines whether read/write operations can be
query is the ordering of data transmissions and local data processing in a database system.
performed on that data item. Generally, a lock compatibility matrix is used which states
whether a data item can be locked by two transactions at the same time.
25 | P a g e 26 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com) Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)


lOMoARcPSD|13657851

Distributed query processing is the procedure of answering queries (which means


mainly read operations on large data sets) in a distributed environment where data is
managed at multiple sites in a computer network. Query processing involves the
transformation of a high-level query (e.g., formulated in SQL) into a query execution plan
(consisting of lower-level query operators in some variation of relational algebra) as well as
the execution of this plan. The goal of the transformation is to produce a plan which is
equivalent to the original query (returning the same result) and efficient, i.e., to minimize
resource consumption like total costs or response time.
1. Costs (Transfer of data) of Distributed Query processing:
In Distributed Query processing, the data transfer cost of distributed query processing
means the cost of transferring intermediate files to other sites for processing and therefore
the cost of transferring the ultimate result files to the location where that result’s required.
Commonly, the data transfer cost is calculated in terms of the size of the messages. By using
the below formula, we can calculate the data transfer cost:
Data transfer cost = C * Size
Where C refers to the cost per byte of data transferring and Size is the no. of bytes
transmitted.

27 | P a g e

Downloaded by Jasvan Sundar (jasvan35sundar@gmail.com)

You might also like