Download as pdf or txt
Download as pdf or txt
You are on page 1of 89

Distributed Database Systems

Distributed DBMS Design

Dr. M. Alam

Distributed Database Management Systems

Outline
Introduction Design Strategies Distribution Design Issues Fragmentations
Horizontal & Vertical Fragmentation

Allocation

Dr. M. Alam

Distributed Database Management Systems

Design Introduction
In the general setting :
Making decisions about the placement of data and programs across the sites of a computer network as well as possibly designing the network itself.

In Distributed DBMS, the placement of applications entails


Placement of the distributed DBMS software; and Placement of the applications that run on the database
Dr. M. Alam Distributed Database Management Systems 3

Dimensions of the Design Problem


Access pattern behavior Dynamic Static Partial Information Data Data + Program Level of knowledge Complete Information

Level of sharing
Dr. M. Alam Distributed Database Management Systems 4

Level of Sharing
No Sharing
Each application & its data execute at one site. No communication with other programs or access to data at other sites.

Data Sharing
All programs are replicated at all sites but not the data files. User requests are handled at home site only & necessary data files are moved around the network.

Data and Program Sharing


A program at a site can request a service from another program at second site that have to access data file may be located at third site. It is difficult (some time impossible) in heterogeneous systems whereas data moving is relatively easy.

Dr. M. Alam

Distributed Database Management Systems

Access Pattern Behavior


User requests may be static (no change over time) or dynamic. Most of the real life applications are dynamic. What is the nature/level of dynamicity?
It depends on relationship between distributed DB design & query processing.

Dr. M. Alam

Distributed Database Management Systems

Level of Knowledge
No Information
Designer has no information how user will access DB.

Complete Information
Access pattern can be predicted that do not deviate significantly from prediction.

Partial Information
Predicted pattern may deviate..

Dr. M. Alam

Distributed Database Management Systems

Distributed Design Strategies


Top-down Design Process
Mostly used for designing the systems from scratch Mostly used for designing homogeneous systems

Bottom-up Design Process


When the databases already exist at a number of sites

Dr. M. Alam

Distributed Database Management Systems

Top-down Approach
Requirement Analysis System Requirements (Objectives) Conceptual Design GCS
User Input View Integration

View Design External Schema Def.

Access In formation Distributed Design Local Conceptual Schemas Physical Design Physical Schema

User Input

Feedback
Dr. M. Alam

Observation and Monitoring


Distributed Database Management Systems

Feedback
9

Top-down Approach (Cont.)


Requirement Analysis: Defines environment of the system and elaborates the data & processing needs of all potential DB users. System Requirements: Specifies which objectives (Performance, reliability & availability, economics, flexibility, ) final system is expected to achieve. View Design
Deals with defining the interfaces for end users. Results in individual External Schema Definition (Global users)

Conceptual Design is Global Conceptual Schema


Process by which enterprise is examined Integration of user views. 1. Entity analyses, Determining the entities, attributes and the relationship between these. 2. Functional analyses, concerned with determining the function
Dr. M. Alam Distributed Database Management Systems 10

Top-down Approach (Cont.)


Activities so far are similar to Centralized DB Design Distribution Design
Design the LCS by distributing the entities over the sites of distributed system. It consists of two steps: Fragmentation & Allocation

Physical Design: Maps LCS to physical storage devices available at corresponding sites. Access pattern information about fragments is its input. Monitoring: Design & development activity needs constant monitoring & periodic adjustments.

Dr. M. Alam

Distributed Database Management Systems

11

Bottom-up Approach
Suitable for applications where database already exists. Designing task involves integrating these into one database. Starting point is individual conceptual schemas Exists primarily in the context of heterogeneous database
External Schema Global Schema External Schema

Component Schema Local Schema

Component Schema Local Schema

Multidatabase Architecture
Dr. M. Alam Distributed Database Management Systems 12

Distribution Design Issues


1. 2. 3. 4. 5. 6.

Why fragment at all? How to fragment? How much should be fragmented? How to test correctness? Allocation Strategy Required Information

Dr. M. Alam

Distributed Database Management Systems

13

Fragmentation
Unit of distribution: Relation?
Views are subsets of relations, extra communication. Unnecessary high volume of remote data access or unnecessary replication.

Can't we just distribute relations? Instead of distributing relations, why not divide the relation into sub-relation (fragments) & distribute fragments. Fragments of relations (sub-relations)
Views that cannot be defined on a single fragment will require extra processing Semantic data control (especially integrity enforcement) is more difficult

Dr. M. Alam

Distributed Database Management Systems

14

Merits & Demerits of Fragmentation


Merits 1. Permits a number of transactions to executed concurrently 2. Results in parallel execution of a single query 3. Increases level of concurrency ( intra query concurrency) 4. Increases system throughput Demerits 1. Applications whose views are defined on more than one fragment may suffer performance degradation, if applications have conflicting requirements. 2. Simple tasks like checking for dependencies, would result in chasing after data in a number of sites 3. Difficult to manage in case of non-exclusive Fragmentation (replication) 4. Maintenance of integrity constraints

Dr. M. Alam

Distributed Database Management Systems

15

Fragmentation Alternatives
Horizontal Fragmentation
Partitions a table along its tuples. Performed based on some Predicate / Condition Primary & Derived Horizontal Fragmentation.

Vertical Fragmentation
Different subsets of attributes are stored at different places

Dr. M. Alam

Distributed Database Management Systems

16

Fragmentation : Example
Id 1 2 3 Name Akram Bashir Saleem Sal 10K 15K 20K Dept. D1 D2 D3

Horizontal Fragmentation Rows split : Sal > 15K Id Name 1 Akram 2 Bashir Sal Dept. 10K D1 15K D2

Vertical Fragmentation Columns split : Primary Key Retained Id Name 1 2 3 Akram Bashir Saleem Id Sal 1 10K 2 3 15K 20K Dept. D1 D2 D3
17

Id Name Sal Dept. 3 Saleem 20K D3


Dr. M. Alam

Distributed Database Management Systems

Horizontal Fragmentation Example


PROJ1 : Projects with Bud < 30M PROJ2 : Projects with Bud 30M PRJ
P-ID P-Name P1 P2 P3 P4 Bridge Flyover Building School Bud 20M 40M 50M 30M LOC Karachi Lahore Islamabad Karachi

PRJ1 P-ID P-Name Bud P1 Bridge 20M LOC Bridge

PRJ2 P-ID P2 P3 P4 P-Name Flyover Building School Bud 40M 50M 30M LOC Lahore Islamabad Karachi
18

Dr. M. Alam

Distributed Database Management Systems

Vertical Fragmentation Example


PROJ1: Information about project budgets PROJ2: Information about project names and locations
PRJ1 P-ID P1 P2 P3 P4
Dr. M. Alam

PRJ
P-ID P1 P2 P3 P4 P-Name Bridge Flyover Building School PRJ2 Bud 20M 40M 50M 30M LOC Karachi Lahore Islamabad Karachi

Bud 20M 40M 50M 30M

P-ID P1 P2 P3 P4
Distributed Database Management Systems

P-Name Bridge Flyover Building School

LOC Karachi Lahore Islamabad Karachi


19

Degree of Fragmentation
Finite number of alternatives

Tuples or Attributes

Relations

Finding the suitable level of partitioning within this range, Between no to the extreme level that could be to the individual tuple or column fragmentation; a compromised decision.
Dr. M. Alam Distributed Database Management Systems 20

Correctness of Fragmentation
Completeness Decomposition of relation R into fragments R1, R2, ..., Rn is complete if and only if each data item in R can also be found in some Ri (Lossless Decomposition). Reconstruction If relation R is decomposed into fragments R1, R2, ..., Rn, then there should exist some relational operator such that R = 1inRi , Ri FR will be different for different forms of fragmentation. Ensures that constraints defined on the data in the form of dependencies are preserved. Disjointness If relation R is decomposed into fragments R1, R2, ..., Rn, and data item di is in Rj, then di should not be in any other fragment Rk (k j ).
Dr. M. Alam Distributed Database Management Systems 21

Other Fragmentation Issues


Privacy Security Bandwidth of Connection Reliability Replication Consistency Local User Needs

Dr. M. Alam

Distributed Database Management Systems

22

Allocation Alternatives
Let database is fragmented properly, one has to allocate fragments to various sites over the network. Non-replicated Database
Partitioned : each fragment resides at only one site

Replicated Database
Fully replicated : each fragment at each site Partially replicated : each fragment at some of the sites

Replication is advantageous if
read - only queries update queries 1

Reference: W. W. Chu, Optimal File Allocation in Multiple Computer System, IEEE Transaction on Computers, 885-889, October 1969.
Dr. M. Alam Distributed Database Management Systems 23

Comparison of Replication Alternatives


Full Replication Easy Easy or nonexistent Moderate Very High Possible Application Partial Partitioning Replication Same Difficulty Same Difficulty Difficult High Realistic Easy Low Possible Application
24

Query Processing Directory Management Concurrency Control Reliability Reality

Dr. M. Alam

Distributed Database Management Systems

Information Requirements
Database Information Application Information Communication Network Information Computer System Information

Dr. M. Alam

Distributed Database Management Systems

25

Database Information
Need to consult the conceptual DB design Apart from tables, we need relationships, cardinality and the owner and member tables Selectivity of fragments and size of a fragment How DB relations can be connected with each other (Join)? Directed links relate relations to each other using equi-join operation Owner(L1) = PAY while Member(L1) = EMP
PAY Title, Sal L1 EMP ID, Name, Title L2 ASIGN
Dr. M. Alam

P-ID, P-Name, Bud, LOC L3

PRJ

ID, P-ID, Resp, Dur


26

Distributed Database Management Systems

Application Information
Qualitative and Quantitative information Qualitative information : Guides fragmentation activity
User query uses Predicates, at least most important should be known. Simple Predicates : Given relation R[A1, A2, , An] pj : Ai Value A simple predicate pj is where {=,<,,>,,}, ValueDi & Di is the domain of Ai. For relation R we define Pr = {p1, p2, ,pm} Example : P-Name = Building, Bud 40M, .... Minterm Predicates M: Given R and Pr={p1, p2, ,pm} M={m1,m2,,mr} is defined as M={ mi|mi = p Pr pj* }, 1jm, 1iz j where pj* = pj or pj* = (pj). Some minterm predicates may be meaningless.
Dr. M. Alam Distributed Database Management Systems 27

Application Information (Cont.)


Example: P-Name = Building & Bud 40M Minterm Predicates
m1: P-Name=Building Bud 40M m2: (P-Name =" Bridge ") 40M m3: P-Name = " Bridge (Bud 40M) m4: (P-Name =" Bridge ") (Bud 40M) Note: Some minterm predicates may be meaningless

Dr. M. Alam

Distributed Database Management Systems

28

Application Information (Cont.)


Quantitative information deals allocation models.
Two sets of data

Minterm Selectivity [sel(mi)]


The number of tuples of the relation that would be accessed by a user query which is specified according to a given minterm predicate mi.

Access Frequency
Frequency with which user application access data. acc(qi) gives frequency of query qi in given period. acc(mi)] represents frequency of minterm mi.

Dr. M. Alam

Distributed Database Management Systems

29

Communication Network Information


Communication network information
Unit cost of storing data at a site Unit cost of processing at a site

Dr. M. Alam

Distributed Database Management Systems

30

Computer System Information


Bandwidth Latency Communication Overhead

Dr. M. Alam

Distributed Database Management Systems

31

Horizontal Fragmentation
Partitions a relation along its tuples. Primary Horizontal Fragmentation (PHF)
Performed using predicates that are defined on that relation.

Derived Horizontal Fragmentation (DHF)


Partitioning a relation that results from predicates being defined on another relation.

Dr. M. Alam

Distributed Database Management Systems

32

Primary Horizontal Fragmentation


Definition : Rj = Fj (R ), 1 j w Therefore,
A horizontal fragment Ri of relation R consists of all the tuples of R which satisfy a minterm predicate mi.

where Fj is a selection formula, which is (preferably) a minterm predicate.

Given a set of minterm predicates M, there are as many horizontal fragments of relation R as there are minterm predicates. Set of horizontal fragments also referred to as minterm fragments.

Dr. M. Alam

Distributed Database Management Systems

33

PHF - Example
R1 = Sal 15K (R) R2 = Sal > 15K (R)
R
Id 1 2 3 Name Akram Bashir Saleem Sal 10K 15K 20K Dept. D1 D2 D3

R1
Id Name 1 Akram 2 Bashir Sal Dept. 10K D1 15K D2

R2
Id Name 3 Saleem Sal 20K Dept. D3

Dr. M. Alam

Distributed Database Management Systems

34

PHF Algorithm
Given: A relation R and the set of simple predicates Pr Output: The set of fragments of R = {R1, R2,,Rw} which obey the fragmentation rules. Preliminaries :
Pr should be complete Pr should be minimal

Dr. M. Alam

Distributed Database Management Systems

35

Completeness of Simple Predicates


Completeness: A set of simple predicates Pr is said to be complete if and only if there is an equal probability access by every application to any tuple belonging to any minterm fragment that is defined according to Pr.
Complete set of predicates should be used as the basis of PHF

Example
Assume PRJ[P-ID, P-Name, Bud, LOC] has two applications defined on it. 1. Find the budgets of projects at each location. [Complete] 2. Find projects with budgets less than 40M. [Not Complete]
Dr. M. Alam Distributed Database Management Systems 36

Minimality of Simple Predicates


If a predicate influences how fragmentation is performed, (i.e., causes a fragment f to be further fragmented into, say, fi and fj) then there should be at least one application that accesses fi and fj differently. In other words, the simple predicate should be relevant in determining a fragmentation. If all the predicates of a set Pr are relevant, then Pr is minimal. Let mi and mj are minterm predicates that are identical in definition except mi contains simple predicate Pi in its natural form while mj contains compliment of Pi . Also fi and fj be two fragments defined according to mi and mj. Then Pi is relevant if and only if. acc(mi) acc(mj) card(fi) card(fj)
Dr. M. Alam Distributed Database Management Systems 37

Example of Minimality
PRJ1 : Projects with LOC=Karachi PRJ2 : Projects with LOC=Lahore, PRJ3 : Projects with Bud30M PRJ4 : Projects with Bud>30M Pr = {LOC=Karachi, LOC=Lahore, Bud30M, Bud>30M} is minimal (in addition to being complete). However, if we add P-Name = Building then Pr is no longer minimal because it is not relevant to Pr. No application would access resulting fragments differently.
Dr. M. Alam Distributed Database Management Systems 38

Horizontal Fragmentation Example


PRJ
P-ID P-Name P1 P2 P3 P4 Bridge Flyover Building School Bud 20M 40M 50M 30M LOC Karachi Lahore Islamabad Karachi PRJ1 P-ID P1 P4 PRJ2 P-ID P2 PRJ3 P-ID P-Name P1 Bridge P4 School
Dr. M. Alam

P-Name Bridge School

Bud 20M 30M

LOC Karachi Karachi

P-Name Bud Flyover 40M

LOC Lahore

PRJ4 Bud 20M 30M LOC Karachi Karachi P-ID P2 P3 P-Name Flyover Building Bud 40M 50M LOC Lahore Islamabad
39

Distributed Database Management Systems

PHF- COM_MIN Algorithm


Given: A relation R and a set of simple predicates Pr Output: A complete and minimal set of simple predicates Pr' for Pr Rule 1: A relation or fragment is partitioned into at least two parts which are accessed differently by at least one application. Declare: Fi is a set of minterm fragments

Dr. M. Alam

Distributed Database Management Systems

40

PHF-COM-MIN Algorithm (Cont.)


Initialization:
Find a pi Pr such that pi partitions R according to Rule 1 Pr' pi ; Pr Pr pi ; F fi

Iteratively add predicates to Pr' until it is complete


Find a pj Pr such that pj partitions some fk defined according to minterm predicate over Pr' according to Rule 1 Set Pr' = Pr' pj ; Pr Pr pj; F F fi If pk Pr' which is non-relevant then Pr' Pr' pk F F fk

Dr. M. Alam

Distributed Database Management Systems

41

Horizontal Partitioning Algorithm


Makes use of COM_MIN to perform fragmentation Input: A relation R and a set of simple predicates Pr Output: A set of minterm predicates M according to which relation R is to be fragmented Pr COM_MIN (R, Pr) Determine the set M of minterm predicates Determine the set I of implications among pi Pr Eliminate the contradictory minterms from M
Dr. M. Alam Distributed Database Management Systems 42

Example Fragmentation of PAY


Application:
Check the salary information and determine raise. Assume that employee records are kept at two sites application run at two sites Simple predicates
p1 : SAL 7000 & p2 : SAL > 7000 Pr = {p1, p2}

Apply COM-MIN Algorithm Initialize with i=1, results Pr = {p1} It is complete and minimal because p2 cant further partition f1 (Minterm fregments w.r..t p1) . So Pr =Pr= {p1,p2} Minterm predicates
m1 : (SAL 7000) m2 : (SAL 7000) = (SAL > 7000)
Dr. M. Alam Distributed Database Management Systems 43

Example Fragmentation of PAY


PAY Title Superintendent Designer Programmer Engineer Analyst PAY1 Title Superintendent Designer Programmer
Dr. M. Alam

Sal 5000 7000 6000 8500 9000 PAY2 Title Sal Engineer 8500 Analyst 9000

Sal 5000 7000 6000


Distributed Database Management Systems

44

Example Fragmentation of PRJ


Applications: 1. Find the name and budget of projects given their location, issued at three sites. 2. Access project information according to budget : one site accesses Bud 30M other accesses Bud > 30M PRJ Simple predicates for (1) p1 : LOC = Karachi p2 : LOC = Lahore p3 : LOC = Islamabad Simple predicates for (2) p4 : Bud 30M p5 : Bud > 30M
Dr. M. Alam

P-ID P1 P2 P3 P4

P-Name Bridge Flyover Building School

Bud 20M 40M 50M 30M

LOC Karachi Lahore Islamabad Karachi

Distributed Database Management Systems

45

Example Fragmentation of PRJ


Pr = Pr' = {p1, p2, p3, p4, p5 } Implications (Implications must be based on the database semantics not according to current value)
p1 p2 p3 p2 p1 p3 p3 p1 p2 p4 p5 & p5 p4

From predicates p1 to p5 in Pr, there may be so many minterm predicates, like,


p1 ^ p2 ^ p5 ^ p4 ^ p5

Excluding the contradicting minterm predicates


Dr. M. Alam Distributed Database Management Systems 46

Fragmentation of PRJ (Cont.)


Minterm fragments left after elimination m1 : (LOC = Karachi) (Bud 30M) m2 : (LOC = Karachi) (Bud > 30M) m3 : (LOC = Lahore) (Bud 30M) m4 : (LOC = Lahore) (Bud > 30M) m5 : (LOC = Islamabad) (Bud 30M) Empty m6 : (LOC = Islamabad) (Bud > 30M)
Distributed Database Management Systems 47

Empty Empty

Dr. M. Alam

PHF Example
PRJ1
P-ID P1 P4 P-Name Bridge School Bud LOC 20M Karachi 30M Karachi

PRJ2
P-ID P-Name Bud LOC

PRJ3
P-ID P-Name Bud LOC

PRJ4
P-ID P2 P-Name Flyover Bud 40M LOC Lahore

PRJ5
P-ID P-Name Bud LOC

PRJ6
P-ID P3 P-Name Building Bud 50M LOC Islamabad
48

Dr. M. Alam

Distributed Database Management Systems

PHF Correctness
Completeness
Resulting fragmentation is guaranteed to be complete as long as selection predicates are complete. Since Pr' is complete and minimal, the selection predicates are complete

Reconstruction
If relation R is fragmented into FR = {R1,R2,,Rr} R = R
i FR

Ri

Disjointness
Minterm predicates that form the basis of fragmentation should be mutually exclusive.
Dr. M. Alam Distributed Database Management Systems 49

Derived Horizontal Fragmentation (DHF)

Dr. M. Alam

Distributed Database Management Systems

50

Derived Horizontal Fragmentation


Defined on a member relation of a link according to a selection operation specified on its owner. Link between the owner and the member relations is defined as equi-join. An equi-join can be implemented by means of semijoins. Given a link L where owner (L) = S and member (L) = R, the derived horizontal fragments of R are defined as
Ri = R Si , 1 i w Where w is the max number of fragments that will be defined on R, and Fi is the formula using which the primary horizontal fragment Si is defined Si = Fi (S)
Dr. M. Alam Distributed Database Management Systems 51

DHF Example
Consider the link L1: Owner (L1) = PAY Member (L1) = EMP We want to group employees on the basis of their salaries one with salary less than or equal to 7000 and other more than that. Three inputs are needed: Partition of owner (PAY1 & PAY2), member relation, set of semijoin predicates between owner & member (EMP.Title = PAY.Title)
PAY Title, Sal L1 EMP ID, Name, Title L2 ASG
Dr. M. Alam

P-ID, P-Name, Bud, LOC L3

PRJ

ID, P-ID,Rresp, Dur


52

Distributed Database Management Systems

DHF Example (Cont.)


Consider relations EMP & PAY Fragmentation of EMP is
EMP1 = EMP PAY1 EMP2 = EMP PAY2 Operation on smaller relations Performing joins in distributed fashion PAY1 Title Superintendent Designer Programmer PAY2 Title Engineer Analyst Sal 8500 9000

Sal 5000 7000 6000

Where
PAY1 = Sal 7000 (PAY) PAY2 = Sal > 7000 (PAY) A relation may have more than two Links!
More than one DHF are possible.
Dr. M. Alam Distributed Database Management Systems

53

DHF Example (Result)


EMP ID Name Akram 1 2 Bashir 3 Saleem 4 Aslam 5 Hafeez 6 Liaqat Title Superintendent Designer Programmer Engineer Programmer Analyst EMP1 ID Name 1 Akram 2 Bashir 3 Saleem 5 Hafeez EMP2 ID 4 6 Name Aslam Liaqat Title Engineer Analyst Title Superintendent Designer Programmer Programmer

Assignment: Fragment ASG where location is Karachi or other cities.


Dr. M. Alam Distributed Database Management Systems 54

DHF Correctness
Completeness
Difficult to define involvement of two relations. Let R be the member relation of a link whose owner is relation S which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A be the join attribute between R and S. Then, for each tuple t of R, there should be a tuple t' of S such that t[A]=t[A] Referential Integrity: Ensures that tuples of any fragment of the member are also in the owner relation. There is no ASG tuple which has a P-ID that is not also contained in PRJ. Similarly same Title appears in EMP & PAY.

Reconstruction
Same as primary horizontal fragmentation.

Disjointness
Simple join graphs between the owner and the member fragments.

Dr. M. Alam

Distributed Database Management Systems

55

Vertical Fragmentation

Dr. M. Alam

Distributed Database Management Systems

56

Vertical Fragmentation
More difficult than horizontal, because more alternatives exist. Easier to enforce functional dependencies. Approaches: Grouping & Splitting Grouping
Starts by assigning each attribute to one fragment At each step, joins some of the fragments until some criteria is satisfied. Results in overlapping fragments

Splitting
Starts with a relation and decides on beneficial partitioning based on the access behavior of applications to the attributes Fits more naturally within the top-down design Generates non-overlapping fragments.

Dr. M. Alam

Distributed Database Management Systems

57

VF Information Requirements
Application Information
Attribute Affinities A measure that indicates how closely related the attributes are. This is obtained from more primitive usage data. Attribute Usage values Given a set of queries Q = {q1, q2,, qq} that will run on the relation R[A1, A2,, An], use(qi, Aj) = 1 if attribute Aj is referenced by query qi 0 otherwise

* use(qi,) can be defined accordingly


Dr. M. Alam Distributed Database Management Systems 58

VF Definition of use(qi,Aj)
Consider the following 4 queries for relation PRJ
q1: SELECT FROM WHERE SELECT FROM WHERE BUD PRJ P-ID=Value P-NAME PRJ LOC=Value A1 q1 q2 q3 q4
Dr. M. Alam

q2: SELECT P-NAME, BUD FROM PRJ q4: SELECT SUM(BUD) FROM PRJ WHERE LOC=Value A2 0 1 1 0 A3 1 1 0 1 A4 0 0 1 1
59

q3:

Let A1= P-ID, A2= P-NAME, A3= BUD, A4= LOC


1 0 0 0

Distributed Database Management Systems

VF Affinity Measure aff(Ai, Aj)


The attribute affinity measure between two attributes Ai and Aj of a relation R[A1, A2, , An] with respect to the set of applications Q = (q1, q2, , qq) is defined as follows :
aff (Ai,Aj) = all queries that access A and A (query access) i j access query access = all sitesaccess frequency of a query execution ref (qi) = access execution
Dr. M. Alam Distributed Database Management Systems 60

VF Calculation of aff(Ai, Aj)


Assume each query in the previous example accesses the attributes once during each execution refi(qk)=1.
S1 5 25 3 S2 0 25 0 S3 0 25 0

Assume the access frequencies acci(qk) Then


aff(A1, A3) = acc1(q1) *1 + acc2(q1) *1 + acc3(q1) *1 = 15*1 + 20*1+10*1 = 45

q1 q2 q3 q4 A1 A2 A3 A4

15

20

10

The attribute affinity matrix AA is

A1 A2 A3 A4 45 0 45 0 5 75 0 80 45 5 53 3 3 78 0 75

Assignment: Make calculations for affinity matrix AA.


Dr. M. Alam Distributed Database Management Systems 61

VF Clustering Algorithm
Take the attribute affinity matrix AA and reorganize the attribute orders to form clusters where the attributes in each cluster demonstrate high affinity to one another. Bond Energy Algorithm (BEA) has been used for clustering of entities. BEA finds an ordering of entities (in our case attributes) such that the global affinity measure AM is maximized. AM = (affinity of Ai and Aj with their neighbors) AM = aff(Ai,Aj) [aff(Ai,Aj-1)+aff( Ai,Aj+1)+aff(Ai-1,Aj)+aff(Ai+1, Aj)]
i j n n i j n n

Where
n n i j

aff(A0,Aj) = aff(Ai,A0) = aff(An+1,Aj) = aff(Ai, An+1) = 0

AM = aff(Ai,Aj) [aff(Ai,Aj-1)+aff( Ai,Aj+1) *Column operations only


Dr. M. Alam Distributed Database Management Systems 62

Bond Energy Algorithm


Input: The AA matrix Output: The clustered affinity matrix CA which is a perturbation of AA Initialization: Place and fix one of the columns of AA in CA. Iteration: Place the remaining n-i columns in the remaining i+1 positions in the CA matrix. For each column, choose the placement that makes the most contribution to the global affinity measure. Row order: Order the rows according to the column ordering.
Dr. M. Alam Distributed Database Management Systems 63

Bond Energy Algorithm


Best placement? Define contribution of a placement:
cont(Ai, Ak, Aj) = 2bond(Ai, Ak)+2bond(Ak, Al) 2bond(Ai, Aj)

where

bond(Ax, Ay) = aff(Az, Ax) aff(Az, Ay)


z =1

Dr. M. Alam

Distributed Database Management Systems

64

Calculation of Bond & Contribution


cont(A1, A4, A2) = 2bond(A1, A4)+2bond(A4, A2) 2bond(A1, A2) bond(A1, A4) = aff(A1, A1) aff(A1, A4) + aff(A2, A1) aff(A2, A4) + aff(A3, A1) aff(A3, A4) + aff(A4, A1) aff(A4, A4) = 45*0 + 0*75 +45*3 + 0*78 = 135 bond(A4, A2) = aff(A1, A4) aff(A1, A2) + aff(A2, A4) aff(A2, A2) + aff(A3, A4) aff(A3, A2) + aff(A4, A4) aff(A4, A2) = 0*0 + 75*80 +3*5 + 78*75 =6000+5850= 11850 bond(A1, A2) = aff(A1, A1) aff(A1, A2) + aff(A2, A1) aff(A2, A2) + aff(A3, A1) aff(A3, A2) + aff(A4, A1) aff(A4, A2) = 45*0 + 0*80 +45*5 + 0*75 = 225 cont(A1, A4, A2) = 2* 135 +2* 11850 2* 225 = 23550
Dr. M. Alam Distributed Database Management Systems 65

BEA Example
Consider the following AA matrix and the corresponding CA matrix where A1 and A2 have been placed. Place A3:
A1 A2 AA = A3 A4 A1 45 0 45 0 A2 A3 0 5 80 5 5 53 75 3 A4 0 75 3 78 A1 45 0 CA = 45 0 A2 0 80 5 75

Ordering (0-3-1) :
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)2bond(A0 , A1) = 2* 0 + 2* 4410 2*0 = 8820 = 2bond(A1 , A3)+2bond(A3 , A2)2bond(A1,A2) = 2* 4410 + 2* 890 2*225 = 10150

Ordering (1-3-2) :
cont(A1,A3,A2)

Ordering (2-3-4) : cont (A2,A3,A4)


Dr. M. Alam

= 1780
66

Distributed Database Management Systems

BEA Example
A1 A3 A2 45 45 0 80 5

Therefore, the CA matrix has to form

45 53 0

3 75

A 1 A 3 A2 A4

When A4 is placed, the final form of the CA matrix (after row organization) is Top-left Low affinity measure High affinity measure
Distributed Database Management Systems

A 1 45 A 3 45 A2 A4 0 0

45 53

0 5

0 3 75 78

5 80 3 75

Bottom-Right
Dr. M. Alam

67

VF Algorithm
Divide a set of clustered attributes A={A1, A2, , An} into two (or more) sets {A1, A2, , Ai} and {Ai, Ai+1, , An} such that there are no (or minimal) applications that access both (or more than one) of the sets. Set of Applications = Q = {q1, q2, , qq}
A1 A2 A3 Ai Ai+1 . . .Am A1 A2 Ai Ai+1 Am
Dr. M. Alam Distributed Database Management Systems 68

... ...

TA

BA

VF Algorithm (Definitions)
AQ = = TQ = = BQ = = OQ = = Set of applications that access A { Aj \ use(qi, Aj) = 1 } Set of applications that access only TA { qj \ AQ(qi) belongs to TA } Set of applications that access only BA { qj \ AQ(qi) belongs to BA } Set of applications that access both TA and BA Q - { TQ BQ }

Dr. M. Alam

Distributed Database Management Systems

69

VF Algorithm (Definitions)
CTQ = Total no. of accesses to attributes of only TA by applications CTQ =
qi BQ TQ

Refj(qi) accj(qi)

CBQ = Total no. of accesses to attributes of only BA by applications CBQ =


qi

Refj(qi) accj(qi)

COQ = Total no. of accesses to attributes of both TA & BA by applications COQ =


qi OQ

Refj(qi) accj(qi) CTQ CBQ COQ2

Then find the point along the diagonal that maximizes


Dr. M. Alam Distributed Database Management Systems 70

VF Algorithm (Problems)
1.

Cluster forming in the middle of the CA matrix


Shift a row up and a column left and apply the algorithm to find the best partitioning point Do this for all possible shifts Cost O(m2)

2.

More than two clusters


m-way partitioning Try 1, 2, , m1 split points along diagonal and try to find the best point for each of these Cost O(2m) Alternative: Perform binary splitting iteratively till having m-partitions.

Dr. M. Alam

Distributed Database Management Systems

71

VF Example
Apply partition algorithm to the CA matrix obtained for relation PRJ.
A 1 A 3 A2 A4 A 1 45 A 3 45 A2 A4 0 0 45 53 0 5 0 3 75 78

5 80 3 75

Result: FPRJ = { PRJ1, PRJ2} PRJ1 = { A1, A3 } = {P-ID, Bud} PRJ2 = { A1, A2, A4 } = {P-ID, P-Name, LOC} **Each Ri contains key attribute(s) or system assigned tuple IDs (TIDs)
Dr. M. Alam Distributed Database Management Systems 72

VF Correctness
A relation R, defined over attribute set A and key K, generates the vertical partitioning FR = {R1, R2, , Rr}. Completeness
The following should be true for A: A = AR
i

Reconstruction
Reconstruction can be achieved by R= Ri FR K Ri

Disjointness
TID's are not considered to be overlapping since these are maintained by the system totally invisible to the users. Duplicated keys are not considered to be overlapping
Dr. M. Alam Distributed Database Management Systems 73

Hybrid Fragmentation
Horizontal or vertical fragmentation is not sufficient to satisfy the requirements of user applications. Consists of a horizontal fragment that is vertically fragmented, or a vertical fragment that is horizontally fragmented. In horizontal fragmentation, minimum size of a fragment is one tuple, whereas in vertical fragmentation is one attribute per fragment. p(a1, ... ,an(R)) or a1, ... ,an(p(R))
R HF R1 VF R11
Dr. M. Alam

HF R2

VF R12

VF R21

VF R22

VF R23
74

Distributed Database Management Systems

Hybrid Fragmentation- Example

R VF R1 HF R11 HF R12 HF HF R13 R21 VF R2 HF R22 R1 VF VF VF VF R13 R21 HF

R HF R2 VF R22 HF R3

R11 R12

Dr. M. Alam

Distributed Database Management Systems

75

Hybrid Fragmentation - Correctness


Completeness
It is complete if intermediate & leaf fragments are complete.

Disjointness
If intermediate & leaf fragments are disjoint then hybrid fragmentation is also disjoint.

Reconstruction
Start from leaves of partitioning tree & moves upward performing joins & unions. (See Figure)

R11
Dr. M. Alam

R12

R21

R22

R23
76

Distributed Database Management Systems

Allocation Model

Dr. M. Alam

Distributed Database Management Systems

77

Fragment Allocation
Problem Statement
Given F = {F1, F2, , Fn} Fragments S ={S1, S2, , Sm} Network sites Q = {q1, q2,, qq} Applications Find the "optimal" distribution of F to S.

Optimality
Minimal cost: (Usually in terms of time) Communication + storage + processing (querying & update) Performance: Response time and/or throughput Constraints: Per site constraints (storage & processing)

Dr. M. Alam

Distributed Database Management Systems

78

Information Requirements
Database information
selectivity of fragments size of a fragment

Application information
access types and numbers access localities

Communication network information


unit cost of storing data at a site unit cost of processing at a site

Computer system information


bandwidth latency communication overhead
Dr. M. Alam Distributed Database Management Systems 79

Allocation
File Allocation (FAP) vs Database Allocation (DAP)
FAP model separates query processing cost into two parts: retrieval processing cost & update processing cost. In DAP model query processing cost is consist of processing cost (PC)& Transmission cost (TC).

Fragments are not individual files


Relationships have to be maintained

Access to databases is more complicated


Remote file access model is not applicable Relationship between allocation and query processing

Cost of integrity enforcement (IE) should be considered Cost of concurrency control (CC) should be considered
Dr. M. Alam Distributed Database Management Systems 80

Allocation Information Requirements


Database Information Application Information
seli(Fj): Selectivity of a fragment Fj w.r.t query qi. Size of a fragment: size(Fj) = card(Fj) * length(Fj) RRij: Number of read accesses of a query to a fragment URij: Number of update accesses of a query to a fragment UM{uij}: A matrix indicating which queries updates which fragments RM{rij}: A similar matrix for retrievals O {o(i)}: o(i) specifies originating site of query qi. USCk: Unit cost of storing data at a site Sk. LPCk: Unit cost of processing at a site Sk. gij: Communication cost/frame betweensites Si and Sj. Frame size in bytes. Complex networks (Channel capacity, distance between sites, protocol overhead and so on.)
Dr. M. Alam Distributed Database Management Systems 81

Site Information

Network Information

Allocation Model
Object: Minimize the total cost of processing & storage while meeting the certain constraints General Form min(Total Cost) subject to Response time constraint Storage constraint Processing constraint Decision Variable
xij =
Dr. M. Alam

1 0

if fragment Fi is stored at site Sj otherwise


Distributed Database Management Systems 82

Allocation Model Total Cost


Total Cost [TOC]

all queries Query processing cost (QPCi) + all sites all fragments Fragment storing cost at a site (STCjk)
Storage Cost (of fragment Fj at site Sk)
USCk * size( Fj) * xjk

[ STCjk ] [QPCi ]

Query Processing Cost (for one query)

Processing component (PCi)+ Transmission component (TCi)

Dr. M. Alam

Distributed Database Management Systems

83

Allocation Model
Query Processing Cost
Processing component
Access cost + Integrity enforcement cost + Concurrency control cost

Access cost

all sites all fragments(No. of update accesses+ No. of read accesses) *


xij * Local processing cost at a site

ACi= all sites all fragments (ujk * URij + rij * RRjk ) * xjk * LPCk Integrity enforcement and concurrency control costs can be similarly calculated. For this unit local processing cost will be different. (Discussed in Chapter 6 & 11 in detail)
Dr. M. Alam Distributed Database Management Systems 84

Allocation Model
Query Processing Cost
Transmission Component Cost of processing updates + Cost of processing retrievals Cost of Updates (Inform all replicas)

all sites all fragments Update message cost + all sites all fragments Acknowledgment cost
Cost of retrievals

all fragments min sites(Cost of retrieval command + all


Cost of sending back the result)
Dr. M. Alam Distributed Database Management Systems 85

Allocation Model - Constraints


Response Time
Execution time of qi Max. allowable response time for qi

Storage Constraint

all fragments Storage requirement of a fragment at site Sk


Storage capacity at that site Sk

Processing Constraint

all queries

Processing load of a qi at site Sk Processing capacity of site Sk

Dr. M. Alam

Distributed Database Management Systems

86

Allocation Model
Solution Methods
FAP is NP-complete DAP is also NP-complete

Heuristics based on
Single commodity warehouse location (for FAP) Knapsack problem Branch and bound techniques Network flow algorithm
Dr. M. Alam Distributed Database Management Systems 87

NP-Complete
1.

2.

NP refers to "nondeterministic polynomial time." NP is the set of all decision problems for which the instances where the answer is "yes" have efficiently verifiable proofs of the fact that the answer is indeed "yes." More precisely, these proofs have to be verifiable in polynomial time by a deterministic Turing machine. OR NP is the set of decision problems where the "yes"-instances can be recognized in polynomial time by a non-deterministic Turing machine. The equivalence of the two definitions follows from the fact that an algorithm on such a non-deterministic machine consists of two phases, the first of which consists of a guess about the solution which is generated in a non-deterministic way, while the second consists of a deterministic algorithm which verifies or rejects the guess as a valid solution to the problem.
Distributed Database Management Systems 88

Dr. M. Alam

Attempts to reduce the solution space


Assume all candidate partitionings are known; select the best partitioning & placement for each relation. Ignore replication at first, replication is handled at next step (Greedy Algorithm)

Dr. M. Alam

Distributed Database Management Systems

89

You might also like