Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Unit – V

Distributed Database Systems

A distributed database is a database that is under the control of a central database management
system (DBMS) in which storage devices are not all attached to a common CPU. It may be
stored in multiple computers located in the same physical location, or may be dispersed over a
network of interconnected computers.

Collections of data (eg. in a database) can be distributed across multiple physical locations. A
distributed database can reside on network servers on the Internet, on corporate intranets or
extranets, or on other company networks. Replication and distribution of databases improve
database performance at end-user worksites. [1]

To ensure that the distributive databases are up to date and current, there are two processes:
replication and duplication. Replication involves using specialized software that looks for
changes in the distributive database. Once the changes have been identified, the replication
process makes all the databases look the same. The replication process can be very complex and
time consuming depending on the size and number of the distributive databases. This process can
also require a lot of time and computer resources. Duplication on the other hand is not as
complicated. It basically identifies one database as a master and then duplicates that database.
The duplication process is normally done at a set time after hours. This is to ensure that each
distributed location has the same data. During the duplication process, changes to the master
database only are allowed. This is to ensure that local data will not be overwritten. Both of the
processes can keep the data current in all distributive locations.[2]

Besides distributed database replication and fragmentation, there are many other distributed
database design technologies. For example, local autonomy, synchronous and asynchronous
distributed database technologies. These technologies' implementation can and does depend on
the needs of the business and the sensitivity/confidentiality of the data to be stored in the
database, and hence the price the business is willing to spend on ensuring data security,
consistency and integrity.

Distributed Database Design

Introduction: Design Strategies


􀁘 Alternative design strategies
» approaches of distribution design
– Top-down design
» suitable when designing systems from scratch
» mostly in homogeneous systems
􀀍main focus in this chapter
– Bottom-up design
» suitable when DBs already exist at a number of sites
» mostly in heterogeneous systems
Distribution design
» design the local conceptual schemas
􀀹 by distributing entities over the sites of DCS
– fragmentation
– allocation

Reasons for fragmentation


􀁹 relation may not be a suitable unit of distribution
– application views are usually subsets of relations
» i.e., locality or proximity
– permits a number of transactions to execute concurrently
􀁹 i.e., transactions that access different portions of a relation
» inter-query concurrency
» intra-query concurrency
􀁹 i.e., parallel execution of a single query

􀂠 Disadvantage of fragmentation
– may require extra processing, e.g., join
» for views that cannot be defined on a single fragment
– semantic data control is more difficult
» especially, integrity enforcement

Fragmentation alternatives
– horizontal fragmentation
– vertical fragmentation

(Ex) Horizontal fragmentation

PNO PNAME BUDGET LOC


P1 Instrumentation 150000 Montreal
P2 Database Develop. 135000 New York
P3 CAD/CAM 250000 New York
P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston
PROJ

PROJ1: projects with budgets less than $200,000


PROJ2: projects with budgets greater than or equal to $200,000

PROJ1

PNO PNAME BUDGET LOC


P1 Instrumentation 150000 Montreal
P2 Database Develop. 135000 New York

PROJ2

PNO PNAME BUDGET LOC


P3 CAD/CAM 250000 New York
P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston

Query Processing

Introduction
SQL query processing requires that the DBMS identify and execute a strategy for retrieving the results of the query.
The SQL query determines what data is to be found, but does not define the method by which the data manager
searches the database. Hence, query optimization is necessary for high-level relational queries and provides an
opportunity for the DBMS to systematically evaluate alternative query execution strategies and to choose an optimal
strategy. In some cases the data manager cannot determine the optimal strategy. Assumptions are made which are
predicated on the actual structure of the SQL query. These assumptions can significantly affect the query
performance. This implies that certain queries can exhibit significantly different response times for relatively
innocuous changes in query syntax and structure.
For the purpose of this discussion an example medical database will be used. Figure 1 below illustrates our subject
database schema for physicians, patients, and medical services. The Physician table contains one row for every
physician in the system. Various attributes describe the physician name, address, provider number and specialty.
The Patient table contains one row for every individual in the system. Patients have attributes listing their social
security number, name, residence area, age, gender, and doctor. For simplicity, a physician can see many patients,
but a patient has only one doctor. A Services table exists which lists all the valid medical procedures which can be
performed. When a patient is ill and under the care of a physician, a row exists in the Treatment table describing
the prescribed treatment. This table contains one attribute recording the cost of the individual service and a
compound key that identifies the patient, physician, and the specific service received.
Query Processing
The steps necessary for processing an SQL query are shown in Figure 2. The SQL query statement is first parsed
into its constituent parts. The basic SELECT statement is formed from the three clauses SELECT, FROM, and
WHERE. These parts identify the various tables and columns that participate in the data selection process. The
WHERE clause is used to determine the order and precedence of the various attribute comparisons through a
conditional expression. An example query to determine the names and addresses of all patients of Doctor 1234 is
shown as query Q1 below. The WHERE clause uses a conjunctive clause which combines two attribute
comparisons. More complex conditions are possible.
Q1: SELECT Name, Address, Dr_Name
FROM Patient, Physician
WHERE Patient.Doctor = Physician.Provider AND Physician.Provider = 1234
The query optimizer has the task of determining the optimum query execution plan. The term “optimizer” is actually
a misnomer, because in many cases the optimum strategy is not found. The goal is to find a reasonably efficient
strategy for executing the query. Finding the perfect strategy is usually too time consuming and can require detailed
information on both the data storage structure and the actual data content. Usually this information is simply not
available.
Once the execution plan is established the query code is generated. Various techniques such as memory
management, disk caching and parallel query execution can be used to improve the query performance. However,
if the plan is not correct, then the query performance cannot be optimum.

You might also like