Professional Documents
Culture Documents
06 Unit6
06 Unit6
Unit 6
Unit 6
Structure: 6.1 Introduction Objectives 6.2
A Framework for Distributed Database Design 6.2.1 6.2.2 Objectives of the Design of Data Distribution Top Down and Bottom Up Approach A classical Design Methodologies Self Assessment Questions
6.3
The Design of Database Fragmentation 6.3.1 Horizontal Fragmentation 6.3.1.1 Primary Fragmentation 6.3.1.2 Derived Horizontal Fragmentation 6.3.2 6.3.3 Vertical Fragmentation Mixed Fragmentation
Self Assessment Questions 6.4 The Allocation of Fragments Self Assessment Questions 6.5 Query Processing Problem Self Assessment Questions 6.6 Objectives of Query Processing Self Assessment Questions 6.7 Characterization of Query Processors Self Assessment Questions 6.8 Layers of Query Processing 6.8.1 6.8.2 Query Decomposition Data Localization
Page No. 110
Unit 6
6.8.3 6.8.4
6.1 Introduction
The concept of data distribution itself is difficult to design and implement because of various technical and organizational issues. So we need to have an efficient design methodology. From the technical aspect, the
interconnection of sites and appropriate distribution of the data and applications to the sites depending upon the requirement of applications and for optimizing performances. From the organizational point, the issue of decentralization is crucial and distributing an application has a greater effect on the organization. The increasing success of relational database technology in data processing is suitable, in part, to the availability of nonprocedural languages, which can significantly improve application development and end-user productivity. Query Processing has considerably important both in Centralized and Distributed processing systems. However, the query processing problem is much more difficult in distributed environments than in the conventional systems. In exact, the relations involved in distributed queries may be fragmented and/or replicated, there by inducing communication overhead costs. Objectives: By the end of Unit 6 learners are able to describe the topics like A framework for distributed database design The objectives of design of data distribution
Page No. 111
Unit 6
The design of database fragmentation like Horizontal Fragmentation Vertical Fragmentation Mixed Fragmentation
Various problems of query processing About an ideal Query Processor The concept of layering in query processing
different
Site of Origin: The site from which the application is issued. The frequency of invoking the request at each site
Page No. 112
Unit 6
The number, type and the statistical distribution of accesses made by each application to each required data.
6.2.1 Objectives of the Design of Data Distribution In the design of data distribution the following objectives should be considered. Processing Locality: Reducing the remote references in turn maximizing the local references is the primary aim of the data distribution. This can be achieved by having redundant fragment allocation meeting the site requirements. Complete locality is an extended idea, which simplifies the execution of application. Availability and Reliability of Distributed Data: Availability is achieved by having multiple copies of the data for read only applications. Reliability is achieved by storing the multiple copies of the information, as it will be helpful in case of system crashes. Workload Distribution: workload distribution is the major goal to have high degree of parallelism. Storage Costs and Processing Locality: Cost criteria and Availability of storage areas should be intelligently handled for effective data distribution. Using the all above criteria may increase the design complexity. So important aspects are taken as objectives depending upon the requirement and others are treated as constraints. In the next section let us design a simple approach for maximizing the processing locality. 6.2.2 Top-Down and Bottom-Up Approach: Classical Design Methodologies There are two classical approaches as far as distributed databases design is concerned. They are:
Sikkim Manipal University Page No. 113
Unit 6
1. Top-Down Approach: This may be quite useful when the system has to be designed from the scratch. Here we follow the following steps: Design of Global Schema. Design of Fragmentation Schema. Design of Allocation Schema. Design of Local Schema (Design of Physical Databases). 2. Bottom-Up Approach: This can be used for an existing system. This approach is based on the integration of existing schemata into a single, global schema. But requires that the following aspects have to be fulfilled. The selection of a common database model for describing the Global schema of the database. The translation of each local schema into the common data model. The Integration of common schemata into a common Global schema, i.e. the merging of common data definitions and the resolution of conflicts among different representations given to the same data. The Bottom-Up design requires solving these three problems. Then of course the design steps are just reverse of the previous method. Self Assessment Questions 6.2 1. is the actual procedure of dividing the existing global relations into horizontal, vertical or mixed fragments. 2. In the objectives of design of data distribution, is an extended idea, which simplifies the execution of application. 3. There are classical approaches as far as distributed databases design is concerned. 4. The Design of Global schema, Fragmentation schema, Allocation Schema and Local Schema is the steps of approach.
Sikkim Manipal University Page No. 114
Unit 6
Unit 6
The Vertical Partitioning Problem: Here set must be disjoint. Of course one attribute must be common. For example assume that a relation S is vertically fragmented using this concept into S1 and S2.This can be useful where an application can be executed using either S1 or S2.Otherwise having the complete S at a particular site may be a unnecessary burden.
Two possible Design Approaches 1. The Split Approach: The global relations are progressively split into fragments 2. The Grouping Approach: The attributes are progressively aggregated to constitute fragments. Both are Heuristic approaches as each iteration steps look for best choice. In both the cases formulas are used to indicate the best
Figure 6.1: The different possible join graphs Sikkim Manipal University Page No. 116
Unit 6
The Vertical Clustering Problem: Here sets can overlap. Here depending upon the requirement you may have more than one common attribute in the two different fragments of a global relation. It introduces Replication within fragments, as some common attributes are present in the fragments. It is suitable only for Read-Only applications; because for applications, which involve frequent updating of these common attributes needs to be referred to the sites where all these attributes are present. Therefore, Vertical clustering is suggested where overlapping attributes are not heavily updated.
6.3.3 Mixed Fragmentation The simple way for performing this is: Apply Horizontal fragmentation to Vertical fragments Apply Vertical fragmentation to Horizontal fragments
Both these aspects are illustrated using the following figures 6.2 and 6.3. A1 A2 A3 A4 A5
A1
A2
A3
A4
A5
Unit 6
Self Assessment Questions 6.3 1. The correctness of fragmentation requires that each global relation be selected in one and only one fragment. 2. A is a join between horizontally fragmented relations. 3. In vertical partitioning problem, the attributes are progressively aggregated to constitute fragments; the approach is called as . 4. is suggested where overlapping attributes are not heavily updated.
Determine the set of all sites where the benefit of allocating one copy of the fragment is higher than the cost, and allocate a copy of the fragment to each element of this site; this method selects all beneficial sites.
Unit 6
Start from a non-replicated version. Then progressively introduce replicated copies from the most beneficial; the process is terminated when no additional replication is beneficial.
Both the reliability and availability of the system increases if there are two or three copies of the fragment, but further copies give a less than proportional increase. Self Assessment Questions 6.4 1. In allocation of fragmentation, allocation is complex design since the degree of replication is a variable of the problem.
communication operations and optimized with respect to a cost function to be minimized. This cost function refers to computing resources such as disk I/Os, CPUs, and communication networks.
Unit 6
The low-level query actually implements the execution strategy for the query. The transformation must achieve both correctness and efficiency. The well-defined mapping with the above said functional characteristics makes the correctness issue easy. But producing an efficient execution strategy is more complex. A relational calculus query may have many equivalent and correct transformations into relational algebra. Since each equivalent execution strategy can lead to different consumptions of computer resources, the main problem is to select the execution strategy that minimizes the resource consumption. Self Assessment Questions 6.5 1. The role of a distributed is to map a high level query on a distributed database into a sequence of database operations on relational fragments. 2. The calculus query must be decomposed into a sequence of relational operations called an query.
The main objectives of query processing in a distributed environment is to form a high level query on a distributed database, which is seen as a single database by the users, into an efficient execution strategy expressed in a low level language on local databases.
An important point of query processing is query optimization. Because many execution strategies are correct transformations of the same highlevel query, the one that optimizes (minimizes) resource consumption should be retained.
The good measures of resource consumption are: o The total cost that will be incurred in processing the query. It is the some of all times incurred in processing the operations of the query at various sites and intrinsic communication.
Unit 6
The resource time of the query. This is the time elapsed for executing the query. Since operations can be executed in parallel at different sites, the response time of a query may be significantly less than its cost.
Obviously the total cost should be minimized. o In a distributed system, the total cost to be minimized includes CPU, I/O, and communication costs. These costs can be minimized by reducing the number of I/O operations through fast access methods to the data and efficient use of main memory. The communication cost is the time needed for exchanging the data between sites participating in the execution of the query. o In centralized systems, only CPU and I/O cost have to be considered.
Unit 6
Types of Optimization: Conceptually, query optimization is to choose a best point of solution space that leads to the minimum cost. A popular approach called exhaustive search is used. This is a method where heuristic techniques are used. In both centralized and distributed systems a common heuristic is to minimize the size of intermediate relations. Performing unary operations first and ordering the binary operations by the increasing size of their intermediate relations can do this.
Optimization Timing: A query may be optimized at different times relative to the actual time of query execution. Optimization can be done statically before executing the query or dynamically as the query is executed. The main advantage of the later method is that the actual sizes of the intermediate relations are available to the query processor, thereby minimizing the probability of a bad choice.
Statistics: The effectiveness of the query optimization is based on statistics on the database. Dynamic query optimization requires statistics in order to choose the operation that has to be done first. Static query optimization requires statistics to estimate the size of intermediate relations. The accuracy of the statistics can be improved by periodical updating.
Decision Sites: Most of the systems use centralized decision approach, in which a single site generates the strategy. However, the decision process could be distributed among various sites participating in the elaboration of the best strategy. The centralized approach is simpler but requires the knowledge of the complete distributed database where as the distributed approach requires only local information.
Exploitation of the Network Topology: the distributed query processor exploits the network topology. This issue reduces the work of distributed query optimization, which can be dealt as two separate problems:
Unit 6
Selection of the global execution strategy, based on the inter-site communication and selection of each local execution strategy, based on a centralized query processing algorithms. With local area networks, communication costs are comparable to I/O costs. o Exploitation of Replicated Fragments: For reliability purposes it is useful to have fragments replicated at different sites. Query processors have to exploit this information either statically or dynamically for processing the query efficiently. o Use of Semi-Joins: The semi-join operation reduces the size of the data that are exchanged between the sites so that the communication cost can be reduced. Self Assessment Questions 6.7 1. In distributed context, the is generally some form of relational algebra augmented with communication primitives. 2. Dynamic query optimization requires in order choosing the operation that has to be done first. 3. For purposes it is useful to have fragments replicated at different sites.
Unit 6
optimization, and local query optimization. The first three layers are performed by a central site and use global information; the local sites do the fourth.
CALCULUS QUERY ON DISTRIBUTED RELATIONS
QUERY DECOMPOSITION
GLOBAL SCHEMA
CONTROL SITE
DATA LOCALIZATION
FRAGMENT SCHEMA
FRAGMENT QUERY
GLOBAL OPTIMIZATION
STATISTICS ON
LOCAL SITES
LOCAL OPTIMIZATION
LOCAL SCHEMA
Unit 6
6.8.1 Query Decomposition The first layer decomposes the distributed calculus query into an algebraic query on global relations. The information needed for this transformation is found in the global conceptual schema describing the global relations. However, the information about data distribution is not used here but in the next layer. Thus the techniques used by this layer are those of a centralized DBMS. Query decomposition can be viewed as four successive steps o The calculus query is rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of a query generally involves the manipulation of the query quantifiers and of the query qualification by applying logical operator priority. o The normalized query is analyzed semantically so that incorrect queries are detected and rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of relational calculus. Typically, they use some sort of graph that captures the semantics of the query. o o The correct query (still expressed in relational calculus) is simplified. One way to simplify a query is to eliminate redundant predicates. The calculus query is restructured as an algebraic query. The quality of an algebraic query is defined in terms of expected performance. The traditional way to do this transformation toward a "better" algebraic specification is to start with an initial algebraic query and transform it in order to find a "good" one. The initial algebraic query is derived immediately from the calculus query by translating the predicates and the target statement into relational operations as they appear in the query. This directly translated algebra query is then restructured through transformation rules. The algebraic query generated by this layer is good in the sense that the worse executions are avoided.
Unit 6
6.8.2 Data Localization The input to the second layer is an algebraic query on distributed relations. The main role of the second layer is to localize the querys data using data distribution information. Relations are fragmented and stored in disjoint subsets called fragments, each being stored at a different site. This layer determines which fragments are involved in the query and transforms the distributed query into a fragment query. Fragmentation is defined through fragmentations rules that can be expressed as relational operations. A distributed relation can be reconstructed by applying the fragmentation rules, and then deriving a program, called a localization program, of relational algebra operations, which then act on fragments. Generating a fragments query is done in two steps. o The distributed query is mapped into a fragment query by substituting each distributed relation by its reconstruction program (also called materialization program. o The fragment query is simplified and restructured to produce another good query. Simplification and restructuring may be done according to the same rules used in the decomposition layer. As in the decomposition layer, the final fragment query is generally far from optimal because information regarding fragments is not utilized. 6.8.3 Global Query Optimization The input to the third layer is a fragment query, that is, an algebraic query on fragments. The goal of query optimization is to find an execution strategy for the query, which is close to optimal. An execution strategy for a distributed query can be described with relational algebra operations and communication primitives (send/receive operations) for transferring data between sites. The previous layers have already optimized the query for example, by eliminating redundant expressions. However, this optimization is independent of fragments characteristics such as cardinalities. In addition,
Sikkim Manipal University Page No. 126
Unit 6
communication operations are not yet specified. By permuting the ordering of operations within one fragment query, many equivalent queries may be found. Query optimization consists of finding the best ordering of operations in the fragments query, including communication operations, which minimize a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost and so on. An important aspect of query optimization is join ordering, since permutations of the joint within the query may lead to improvements of orders of magnitude. One basic technique for optimizing a sequence of distributed join operations is through the semi-join operator. The main value of the semi-join in a distributed system is to reduce the size of the join operands and then the communication cost. The output of the query optimization layer is an optimized algebraic query with communication operation included on fragments. 6.8.4 Local Query Optimization The last layer us performed by all the sites having fragments involved in query. Each sub-query executing at one site, called a local query, is then optimized using the local schema of the site. At this time, the algorithms to perform the relational operations may be chosen. Local optimization uses the algorithms of centralized systems. Self Assessment Questions 6.8 1. How many layers are involved to map the distributed query into an optimized sequence of local operations? 2. The layer decomposes the distributed calculus query into an algebraic query on global relations. 3. The main role of the data localization layer is to the querys data using data distribution information.
Sikkim Manipal University Page No. 127
Unit 6
4. One basic technique for optimizing a sequence of distributed join operations is through the operator.
6.9 Summary
In this unit we have discussed the four phases of the design of Distributed databases: Global schema, Fragmentation schema, Allocation schema and Local schema. Some important aspects of design of fragmentation and allocation schemas are described. Also in this unit we have provided an overview of query processing in distributed DBMSs. We have introduced the function and objectives of query processing. The goals of the query processing are discussed. We have described a characterization of query processors based on their implementation choices. Also we proposed a generic layering scheme for describing distributed query processing.
two top-down
Page No. 128
Unit 6
3. Grouping Approach
4. Vertical clustering
Answers to Self Assessment Questions 6.4 1. redundant Answers to Self Assessment Questions 6.5 1. query processor
2. algebraic
communication