Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Distributed Databases

Submitted by

Indu Saini

(Research Scholar)

IIT Roorkee

Enrollment No. : 10926003


Distributed Databases |1

Contents

1. Abstract…………………………………………………………..…………..2
2. Introduction to Distributed Databases……………………………………….2
3. Types of Distributed Databases……………………………………………...4
4. Advantages of Distributed Databases………………………………………..5
5. Disadvantages or Challenges of Distributed Databases……………………..7
6. Distributed Database Design Techniques……………………………………9
7. Conclusion…………………………………………………………………...6
8. References……………………………………………………………………7
Distributed Databases |2

1. ABSTRACT

This paper presents an introduction to the concept of distributed databases, their


advantages over centralized database systems, different types of distributed
databases and the challenges that are faced by network managers in managing
distributed databases. The basic techniques for distributing the data over the
communication network are also mentioned in the paper.

2. INTRODUCTION TO DISTRIBUTED DATABASES

In today’s world of universal dependence on information systems, all sorts of


people need access to companies’ databases. In addition to a company’s own
employees, these include the company’s customers, potential customers,
suppliers, and vendors of all types. It is possible for a company to have all of its
databases concentrated at one mainframe computer site with worldwide access
to this site provided by telecommunications networks, including the Internet.
Although the management of such a centralized system and its databases can be
controlled in a well-contained manner and this can be advantageous, it poses
some problems as well. For example, if the single site goes down, then everyone
is blocked from accessing the databases until the site comes back up again. Also
the communications costs from the many far-flung PCs and terminals to the
central site can be expensive. One solution to such problems, and an alternative
design to the centralized database concept, is known as distributed database.

The idea is that instead of having one, centralized database, we are going to
spread the data out among the cities on the distributed network, each of which
has its own computer and data storage facilities. All of this distributed data is
still considered to be a single logical database. When a person or process
anywhere on the distributed network queries the database, it is not necessary to
know where on the network the data being sought is located. The user just
issues the query, and the result is returned. This feature is known as location
Distributed Databases |3

transparency. This can become rather complex very quickly, and it must be
managed by sophisticated software known as a distributed database
management system or distributed DBMS.

A distributed database is a Data Collection which satisfies the following


assumptions: resides on more than one machine with computational power;
machines are connected by a communication network; it benefits of a
distributed database management system which allows users to feel they work
on the entire database and gives users the opportunity to declare what they want
not how they want. The practical experience has demonstrated that there are
powerful reasons for a distributed system to be feasible it has to be relational. A
typical distributed database system (DDBS) consists of processing elements
(nodes), communication links (edges), memory units, database, and programs.
These resources are interconnected via a communication network that dictates
how information flows between nodes. Programs residing on some nodes can
run using database at other nodes.

Fig: Distributed Databases

Distributed database management system has to ensure local applications for


each computational station as well as global applications on more computational
Distributed Databases |4

machines; to develop applications it has to provide a high level query language


with distributed query building means. Transparency levels must confer the
image of a unique database.

3. TYPES OF DISTRIBUTED DATABASES

Homogeneous and Heterogeneous Distributed Database Systems

A homogenous distributed database system is a network of two or more


databases that reside on one or more machines that uses, locally, the same
DBMS product. An application can simultaneously access or modify the data in
several databases in a single distributed environment. For a client application,
the location and platform of the databases are transparent.

In a heterogeneous system, sites may run different DBMS products, which need
not be based on the same underlying data model, and so the system may be
composed of relational, network, hierarchical, and object oriented DBMSs.

Homogeneous systems are much easier to design and manage. This approach
provides incremental growth, making the addition of a new site to the DDBMS
easy, and allows increased performance by exploiting the parallel processing
capability of multiple sites.

Heterogeneous systems usually result where individual sites have implemented


their own databases and integration is considered at a later stage. In a
heterogeneous system, translations are required to allow communication
between different DBMSs. The typical solution used by some relational systems
that are part of a heterogeneous DDBMS is to use gateways, which convert the
language and model of each different DBMS into the language and model of the
relational system. However, the gateway approach has some serious limitations:
it may not support transaction management, being in fact a query translator
from one language to another.
Distributed Databases |5

4. ADVANTAGES OF DISTRIBUTED DATABASES

The distribution of data has potential advantages over traditional centralized


databases systems:

 Reflects organizational structure: Allowing the structure of the


database to mirror the structure of the enterprise is probably the major
benefit of distributed systems. Many organizations are naturally
distributed at least logically (into several divisions, departments, work-
groups, etc.) and very likely physically (into plants, factories, laboratories
etc.). Thus, the data is usually distributed already as well, because each
organizational unit within the enterprise will naturally maintain data that
is relevant to its own operation.
 Improves performance and goes beyond capacity limits: Large
centralized databases can often exceed the capacity of the server
platform, resulting in hardware constraints on the database’s total size
and/or poor query performance. If the database is fragmented into
functional subsets spread across multiple hardware platforms but
logically all these make up a single database, then the demand on data
storage and data processing for each individual database platform is less.
As the data is located at the site with the greatest needs on processing,
speed of database access may be better than that achievable from a
remote centralized database. The database systems themselves are
parallelized, allowing load on the databases to be balanced among
servers.
 Management of distributed data with different levels of
transparency: It can be provided at various levels, and each level
requires a particular type of agreement between the participants. In the
fully transparent case, the sites must agree on the data model, the schema
Distributed Databases |6

interpretation, the data representation, the available functionality, and


where the data is located. In the service (non-transparent) model, there is
only agreement on the data exchange format and on the functions that are
provided by each site.
 Improve data availability and reliability: It may often be the case that
a single centralized data base serves the needs of many different
applications, and in the event of database is unavailable all the
applications become inoperative. The failure of database can be
determined by the DBMS, hardware, operating system, and network or
software applications and can wreak still heavier financial damage on
companies. Distributed database systems are designed to continue to
function despite such failures. The effects of database damaged can be
eliminate or at least reduced by replicating the data in the centralized
database in each application-specific database. Critical data may be
replicated at different sites, making it available with higher probability.
Thus, it provides protection against unscheduled interruption of service
by removing a potential single point of failure. Also, if a node fails, the
system may be able to reroute the failed node’s requests to another node.
Multiple processors also open the door to improved performance. For
instance, a query can be executed in parallel at several sites.
 Combining heterogeneous data sources: Many large organizations
would like to preserve their IT investments. This scenario leads over the
years to the fact that a lot of these companies have installed and use two
or more different database management systems which are often
incompatible.
 Provides system flexibility and scalability: The distributed database
systems are much more flexible than centralized database systems, so it is
much easier to handle expansion. Increasing database size can usually be
handled by adding processing and storage power to the network. Also,
Distributed Databases |7

new sites can be added to the network without affecting the operations of
other sites.
 Local Autonomy: A department can control the data about them. Means
that:
 Local data is locally owned and managed;
 Local operations remain purely local;
 All operations at a given site are controlled by that site.
5. DISADVANTAGES OR CHALLENGES OF DISTRIBUTED
DATABASES

There are several disadvantages or challenges related to the distributed database


systems:

 Complexity: A distributed database systems that hides the distributed


nature of the data from the systems designers and users is inherently more
complex than a centralized database system. Besides of the normal
difficulties, the design of a distributed database has to consider
fragmentation of data, allocation of fragments to specific sites and data
replication. The replicated data adds an extra level of complexity. If the
system is not adequately designed, there will be unacceptable level of
performance, reliability and availability, and the advantages cited above
will become disadvantages. It is necessary to remind at this point the
problems rising from optimal data fragmentation and allocation, data
conflict resolution, referential integrity, deferred transaction resolution,
and so on.
 Security: In a centralized system, access to data can be easily controlled.
However, in a distributed DBMS not only does access to replicated data
have to be controlled in multiple locations, but the network itself has to
be made secure. In the past, networks were regarded as an insecure
Distributed Databases |8

communication medium. Although this is still partially true, significant


developments have been made to make networks more secure.
 Difficult to maintain Integrity: In a distributed database, enforcing
integrity over a network may require too much of the network's resources
to be feasible.
 Cost: Increased complexity means that the procurement and maintenance
costs for a distributed DBMS can be higher than those for a centralized
DBMS. Furthermore, a distributed database environment usually requires
additional hardware to establish a network between sites. There are
ongoing communication costs incurred with the use of this network.
There are also additional labor costs to manage and maintain the local
DBMSs and the underlying network. Thus, total cost is a function of
network configuration, the user work load, the data allocation strategy,
and the query optimization algorithm.
 Concurrency Control: An important consideration in the design of
distributed systems is the concurrency control. The concurrency control is
that portion of the system that is concerned with deciding what actions
should be taken in response to requests by the individual processes to
read and write into the database. The concurrency control is concerned
with avoiding deadlocks or similar occurrences and with maintaining the
consistency of the database. The job of the concurrency control is to
ensure that during the concurrent operation of any set of processes:

1. Each process sees a consistent picture of the database.

2. Each process eventually terminates.

3. The final database after all the processes terminate is consistent.

The concurrency control must maintain the global consistency of the


entire distributed database and must ensure that each process terminates.
Distributed Databases |9

 Lack of experience in the more complex database design: Besides the


normal difficulties of designing a centralized database, the design of a
distributed one has to take account of fragmentation of data, allocation of
fragments to specific sites and data replication. A significant deterrent
may be the fact that we do not have the same level of experience as with
centralized DBMSs. There are also no tools or methodologies to help
developers convert a centralized DBMS into a distributed one.
6. DISTRIBUTED DATABASE DESIGN TECHNIQUES

The data is distributed by partitioning the database tables into fragments by


various techniques. The replication or duplication processes are used to
replicate the data at various locations so that the data is closer to the user who is
accessing it.

6.1. FRAGMENTATION

The main reasons of fragmentation of the relations are to: increase locality of
reference of the queries submitted to database, improve reliability and
availability of data and performance of the system, balance storage capacities
and minimize communication costs among sites.

Fragmentation is a design technique to divide a single relation or class of a


database into two or more partitions such that the combination of the partitions
provides the original database without any loss of information. This reduces the
amount of irrelevant data accessed by the applications of the database, thus
reducing the number of disk accesses. Fragmentation can be horizontal, vertical
or mixed/hybrid.

Horizontal fragmentation (HF): allows a relation or class to be partitioned


into disjoint tuples or instances. The example of horizontal fragmentation of a
table is given as below:
D i s t r i b u t e d D a t a b a s e s | 10

Fig: Horizontal Fragmentation

Vertical Fragmentation (VF): allows a relation or class to be partitioned into


disjoint sets of columns or attributes except the primary key. The example of
vertical fragmentation of a table is given as below:

Fig: Vertical Fragmentation

Combination of horizontal and vertical fragmentations to mixed or hybrid


fragmentations (MF) are also proposed. The example of hybrid or mixed
fragmentation of a table is given as below:
D i s t r i b u t e d D a t a b a s e s | 11

Fig: Hybrid Fragmentation

6.2. RIPLICATION / DUPLICATION

Replication and duplication are two processes to that the distributive databases
are up to date and current.

Replication involves using specialized software that looks for changes in the
distributive database. Once the changes have been identified, the replication
process makes all the databases look the same. The replication process can be
very complex and time consuming depending on the size and number of the
distributive databases. This process can also require a lot of time and computer
resources.

Duplication on the other hand is not as complicated. It basically identifies one


database as a master and then duplicates that database. The duplication process
is normally done at a set time after hours. This is to ensure that each distributed
location has the same data. In the duplication process, changes to the master
database only are allowed. This is to ensure that local data will not be
overwritten. Both of the processes can keep the data current in all distributive
locations.

The term replication refers to the operation of copying and maintaining database
objects in multiple databases belonging to a distributed system. While
replication relies on distributed database technology, database replication offers
D i s t r i b u t e d D a t a b a s e s | 12

applications benefits that are not possible within a pure distributed database
environment. Replication uses distributed database technology to share data
between multiple sites, but a replicated database and a distributed database are
not the same. In a distributed database, data is available at many locations, but a
particular table resides at only one location. Replication means that the same
data is available at multiple locations.

Some of the common reasons for using replication are availability,


performance; network load reduction. Replication improves the availability of
applications because it provides them with alternative data access options. If
one site becomes unavailable, users can continue to query or even update the
remaining locations. In other words, replication provides excellent failover
protection.

Replication provides fast, local access to shared data because it balances activity
over multiple sites. Some users can access one server while other users access
other servers, thereby reducing the load at all servers. Also, users can access
data from the replication site that has the lowest access cost, which is typically
the site that is geographically closest to them. Replication can be used to
distribute data over multiple regional locations. Then, applications can access
various regional servers instead of accessing one central server. This
configuration can reduce network load dramatically.

Most commonly, replication is used to improve local database performance and


protect the availability of applications because alternate data access options
exist. For example, an application may normally access a local database rather
than a remote server to minimize network traffic and achieve maximum
performance. Furthermore, the application can continue to function if the local
server experiences a failure, but other servers with replicated data remain
D i s t r i b u t e d D a t a b a s e s | 13

accessible. The replication of fragments improves reliability and efficiency of


read-only queries but increase update cost.

7. CONCLUSION

Distributed databases have become necessity as networks expand and


organizations perform geographically distributed operations. International
companies store their data at different sites of a computer network, possibly in a
variety of forms, ranging from flat files, to hierarchical, relational or object-
oriented databases. The network itself consists of variety of transmission media,
network topologies or network speeds. Design approaches for distributed
databases have to consider various factors that can affect performance: CPU
time, data transmission time, disk I/O operation time. As communication
technology, hardware, software protocols advances rapidly and prices of
network equipments falls every day, developing distributed database systems
become more and more feasible. Communication networks make it feasible to
access remote data or databases, allowing the sharing of data among a
potentially large community of users. There is also a potential for increased
reliability: when one computer fails, data at other sites is still accessible. Critical
data may be replicated at different sites, making it available with higher
probability. Multiple processors also open the door to improved performance.

8. REFERENCES

1. Distributed Databases from Wikipedia.org.


2. Min-Sheng Li and Deng-Jyi Chen, “The Reliability Problem in
Distributed Database Systems”, International Conference on Information,
Communications and Signal Processing ICICS '97 Singapore, 9-12
September 1997
3. Florin Dumitriu and Liviu Cretu, “DISTRIBUTED DATABASE
TECHNOLOGY. A MANAGEMENT PERSPECTIVE”.

You might also like