Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Distributed Database Management System

CSE-4845 Final Exam

Theories from Previous Question


Ques: Briefly describe the Distributed Query Processing Steps. [OR]
Ques: Transform a high-level query (of relational calculus/SQL) into an equivalent and more
efficient lower-level query (of relational algebra). [OR]
Ques: Discuss about the normalization of query decomposition.
Ans: Query processing: A 3-step process that transforms a high-level query (of relational
calculus/SQL) into an equivalent and more efficient lower-level query (of relational algebra).
1. Parsing and translation
- Check syntax and verify relations.
- Translate the query into an equivalent relational algebra expression.
2. Optimization
- Generate an optimal evaluation plan (with lowest cost) for the query plan.
3. Evaluation
- The query-execution engine takes an (optimal) evaluation plan, executes that plan, and
returns the answers to the query

Ques: Write down the steps of transform the query to a normalized form in a query
decomposition process.
Ans:
1. Normalization: Transform query to a normalized form
2. Analysis: Detect and reject ”incorrect” queries; possible only for a subset of relational
calculus
3. Elimination of redundancy: Eliminate redundant predicate
4. Rewriting: Transform Query into RA and optimize query
Ques: Write down the properties of transaction. [OR]
Ques: Write down the ACID properties of Transaction.
Ans:
➢ Atomicity − This property states that a transaction is an atomic unit of processing, that is,
either it is performed in its entirety or not performed at all. No partial update should exist.
➢ Consistency − A transaction should take the database from one consistent state to another
consistent state. It should not adversely affect any data item in the database.
➢ Isolation − A transaction should be executed as if it is the only one in the system. There
should not be any interference from the other concurrent transactions that are
simultaneously running.
➢ Durability − If a committed transaction brings about a change, that change should be
durable in the database and not lost in case of any failure.
Ques: Why do we need data localization and mention the issues also.
Ans: We need data localization to
- Apply data distribution information to the algebra operations and determine which
fragments are involved
- Substitute global query with queries on fragments
- Optimize the global query
Data Localization Issues
➢ Various more advanced reduction techniques are possible to generate simpler and
optimized queries.
➢ Reduction of horizontal fragmentation (HF)
– Reduction with selection
– Reduction with join
➢ Reduction of vertical fragmentation (VF)
– Find empty relations
➢ Reduction with selection for HF
– Can be applied if fragmentation predicate is inconsistent with the query selection
predicate
➢ Reduction with join for HF
– Joins on horizontally fragmented relations can be simplified when the joined relations
are fragmented according to the join attributes
➢ Reduction with join for derived HF
– The horizontal fragmentation of one relation is derived from the horizontal
fragmentation of another relation by using semijoins.
➢ Reduction for Vertical Fragmentation
– Recall, VF distributes a relation based on projection, and the reconstruction operator
is the join.
– Similar to HF, it is possible to identify useless intermediate relations, i.e., fragments
that do not contribute to the result.
Ques: Write down the steps of Two-Phase Locking Protocol.
Ans:

Ques: What do you mean by Local Recovery Management (LRM) in DDBMS? [OR]
Ques: Discuss about the architecture of the Local Recovery Management System.
Ans:
Ques: How many ways of LRM to deal with update/ write operation? Describe it.
Ans:

Ques: Write down the taxonomy of concurrency control algorithm.


Ans:
Ques: Which tool is useful to identify the deadlock?

Ques: What do you know about the failure in 2PC protocol?


Ans:
Coordinator Timeouts: Coordinator Site Failures:
- Timeout in INITIAL -Failure in INITIAL
- Timeout in WAIT -Failure in WAIT
- Timeout in ABORT or COMMIT -Failure in ABORT or COMMIT

Participant Timeouts: Participant Site Failures:


-Timeout in INITIAL -Failure in INITIAL
-Timeout in READY -Failure in READY
-Failure in ABORT or COMMIT

Ques: What is reliability in DDBMS?


Ans:
Ques: Describe in-place update and out-place update.
Ans: In-Place Update

Database Log Keeps Enough


Information about the database
updates

Out-place Update

Ques: Is there any difference between Two-Phase “Locking” and Two-Phase “Commit”?
Ans: They are largely unrelated; it just happens that they have two-phase in their name.
2PL is a scheme for acquiring locks for records in a transaction; it is useful in both non-distributed
and distributed settings.
2PC is a scheme to execute a transaction across multiple machines, where each machine has some
of the records used in the transaction.
Ques: What do you know about the data localization?
Ans: Data localization takes as input the decomposed query on global relations and applies data
distribution information to the query in order to localize its data.
To increase the locality of reference and/or parallel execution, relations are fragmented and then
stored in disjoint subsets, called fragments, each being placed at a different site.
Data localization determines which fragments are involved in the query and thereby transforms
the distributed query into a fragment query.
Ques: What do you know about query optimization?
Ans: Query Optimization
➢ Query optimization is a crucial and difficult part of the overall query processing
➢ Objective of query optimization is to minimize the following cost function:
I/O cost + CPU cost + communication cost
➢ Two different scenarios are considered:
- Wide area networks
∗ Communication cost dominates: Low bandwidth, low speed, and high protocol
overhead
∗ Most algorithms ignore all other cost components
- Local area networks
∗ Communication cost not that dominant
∗ Total cost function should be considered

Ques: How do we identify and reject the type incorrect or semantically incorrect queries in
query decomposition.
Ans:
Type Incorrect
– Checks whether the attributes and relation names of a query are defined in the global schema
– Checks whether the operations on attributes do not conflict with the types of the attributes, e.g.,
a comparison > operation with an attribute of type string
Semantically Incorrect
– Checks whether the components contribute in any way to the generation of the result
– Only a subset of relational calculus queries can be tested for correctness, i.e., those that do not
contain disjunction and negation
– Typical data structures used to detect the semantically incorrect queries are:
∗ Connection graph (query graph)
∗ Join graph
Ques: What do you know about Transaction?
Ans: A transaction consists of a series of operations performed on a database. The important issue
in transaction management is that if a database was in a consistent state prior to the initiation of a
transaction, then the database should return to a consistent state after the transaction is completed.

Ques: Briefly describe the states of Transaction.


Ans: States of a Transaction
1. Active: Initial state and during the execution
2. Partially Committed: After the final statement has been executed
APCFA 3. Committed: After successful completion
4. Failed: After discovery that normal execution can no longer proceed
5. Aborted: After the transaction has been rolled back and the DB restored to its state prior
to the start of the transaction. Restart it again or kill it.
Ques: What do you know about the formalization of transaction? Give an example.
Ans: Formalization of Transaction is arranging the transaction according to a fixed structure
Example

Ques: What is Dirty Read? Give an example.

Ans: A dirty read occurs when a transaction reads data that has not yet been committed.

For example, suppose transaction 1 updates a row. Transaction 2 reads the updated row before
transaction 1 commits the update. If transaction 1 rolls back the change, transaction 2 will have
read data that is considered never to have existed.

Ques: Write down the properties of Strict 2PL (Two Phase Locking) protocol.
Ans:
Ques: What do you know about Time-Stamp Ordering?
Ans:

Ques: What is concurrency control?


Ans: Concurrency control is provided in a database to:
- Enforce isolation among transactions.
- Preserve database consistency through consistency preserving execution of
transactions.
- Resolve read-write and write-read conflicts.
Ques: What do you know about Schedule and Conflicts?
Ans: Conflict in Concurrency Control
Schedule in Concurrency Control

A series of operation from one transaction to another transaction is known as schedule. It is used
to preserve the order of the operation in each of the individual transaction.

Example of Schedules

Ques: Write down the difference between in-place and out-place update.
Ans:
Ques: What is the concept of conceptual design and logical design?
Ans: Data Warehousing Conceptual Design
➢ It is the first step towards the design of a Data Warehouse
➢ It starts from the documentation related to the integrated database and consists of:
1. Facts definition
2. For each fact:
- attribute tree definition
- attribute tree editing
- dimensions definition
- measures definition
- hierarchies’ definition
- fact schemata creation
- glossary definition
Data Warehousing Logical Design
➢ Starting from the conceptual design it is necessary to determine the logical schema of data
➢ We use ROLAP (Relational On-Line Analytical Processing) model to represent
multidimensional data
➢ ROLAP uses the relational data model, which means that data is stored in relations
➢ Given the DFM representation of multidimensional data, two schemas are used:
- star schema
- snowflake schema
Ques: What is the difference between star schema and snowflake schema?
Ans: Difference between Star Schema and Snowflake Schema
Parameters Star Schema Snowflake Schema
Definition and meaning A star schema contains both A snowflake schema contains
dimension tables and fact all three- dimension tables,
tables in it. fact tables, and sub-
dimension tables.
Type of Model It is a top-down model type. It is a bottom-up model type.
Space Occupied It makes use of more allotted It makes use of less allotted
space. space.
Time Taken for Queries With the Star Schema, the With the Snowflake Schema,
process of execution of the process of execution of
queries takes less time. queries takes more time.
Use of Normalization The Star Schema does not The Snowflake Schema
make use of normalization. makes use of both
Denormalization as well as
Normalization.
Complexity of Design The design of a Star Schema The designing of a Snowflake
is very simple. Schema is very complex.
Query Complexity It is very low in the case of a It is comparatively much
Star Schema. higher in the case of a
Snowflake Schema.
Complexity of Understanding It is very easy to understand a It is comparatively more
Star Schema. difficult to understand a
Snowflake Schema.
Total Number of Foreign The total number of foreign The total number of foreign
Keys keys is less in the case of a keys is more in the case of a
Star Schema. Snowflake Schema.
Data Redundancy Data redundancy is Data redundancy is
comparatively higher in Star comparatively lower in
Schema. Snowflake Schema.
Ques: Why and when will we use the Three Phase Commit Protocol (3PC)?
Ans: There is a problem with 2PC Protocol. It is blocking which means:
- Ready implies that the participant waits for the coordinator
- If coordinator fails, site is blocked until recovery; independent recovery is not
possible
- The problem is that sites might be in both: commit and abort phase
To overcome this problem, Three Phase Commit Protocol (3PC) is used.
Ques: What do you know about the anomalies. Explain different types of anomalies with
example.

Ques: Discuss about the query optimization issues.


Ans: Query Optimization Issues
Several issues have to be considered in query optimization
➢ Types of query optimizers
- wrt the search techniques (exhaustive search, heuristics)
- wrt the time when the query is optimized (static, dynamic)
➢ Statistics
➢ Decision sites
➢ Network topology
➢ Use of semijoins
Types of Query Optimizers wrt Search Techniques
➢ Exhaustive search
➢ Heuristics
Types of Query Optimizers wrt Optimization Timing
➢ Static
➢ Dynamic
➢ Hybrid
Statistics
➢ Relation/fragments
➢ Attribute
➢ Common assumptions
Decision sites
➢ Centralized
➢ Distributed
➢ Hybrid
Network topology
➢ Wide area networks (WAN) point-to-point
➢ Local area networks (LAN)
Use of Semijoins
➢ Reduce the size of the join operands by first computing semijoins
➢ Particularly relevant when the main cost is the communication cost
➢ Improves the processing of distributed join operations by reducing the size of data
exchange between sites
➢ However, the number of messages as well as local processing time is increased

Ques: What do you know about the elimination of redundancy in Query Optimization?
Ans:

Ques: Mention the rules of Query Rewriting in query optimization.


Ans:
• Ordering of the operators of relational algebra is crucial for efficient query processing
• Rule of thumb: move expensive operators at the end of query processing
• Cost of RA operations:
Ques: What is locking based concurrency control algorithm?
Ans: Locking Based Concurrency Control Algorithm
Ques: What are the differences between 3PC and 2PC?
Ans: The 2PC protocol is a blocking Two-Phase commit protocol. The 3PC protocol is a non-
blocking Three-Phase commit protocol.
Ques: What are the different isolation levels?
Ans:
Serializability

➢ The serializability of schedules is used to find non-serial schedules that allow the
transaction to execute concurrently without interfering with one another.
➢ It identifies which schedules are correct when executions of the transaction have
interleaving of their operations.
➢ A non-serial schedule will be serializable if its result is equal to the result of its transactions
executed serially.
Example:

Two-Phase Locking Protocol


Example of Two-Phased Locking Protocol

Properties of Two-Phased Locking Protocol


Deadlock Management
A set of transactions is in a deadlock if several of the transactions are waiting for each other.
Deadlock prevention:
1. Check a transaction after initialization if it has all the required resources available.
2. The expected resources needed must be pre-declared.
Adv:
1. No transaction roll-back or restart
2. Required no run time support
Dis-adv:
1. Reduced concurrency due to pre-allocation
2. Increased overhead
3. Difficult to determine the needed resources in advance.
Deadlock Avoidance
Deadlock Detection

Commit Protocols
Centralized Two Phase Commit Protocol

2PC Protocol and Site Failures


Problems with 2PC Protocol

Three Phase Commit Protocol


Segment 8

Data Warehouse
➢ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection
of data in support of management’s decision-making process.
➢ The four keywords: subject-oriented, integrated, time-variant, and non-volatile distinguish
data warehouses from other data repository systems such as relational database systems,
transaction processing systems and file systems.
- Subject-oriented: Focuses on modeling and analysis of data for decision makers
[exclude data that are not useful]
- Integrated: Constructed by integrating multiple heterogenous sources such as
relational databases, flat files, and online transaction.
- Time-variant: Data are stored to provide information from an historic perspective
[e.g., the past 5-10 years]
- Nonvolatile: Doesn’t require transaction processing, recovery, and concurrency
control
Data Warehousing Schemas
1. Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains a large central table (fact table) and a set of smaller attendant tables
(dimension tables).
2. Snowflake schema: The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake.
3. Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.
Dimension of Data Warehouse

Where is Data Warehouse Useful?

You might also like