Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

With the availability of more and more resources, like processors and discs, parallelism is also

employed to speed up the process of query execution.

The following techniques can be used to make a query parallel

 I/O parallelism

 Internal parallelism of queries

 Parallelism among queries

 Within-operation parallelism

 Parallelism in inter-operation

I/O parallelism

This type of parallelism involves partitioning the relationships among the discs in order to speed up
the retrieval of relationships from the disc.

The inputted data is divided within, and each division is processed simultaneously. After processing
all of the partitioned data, the results are combined. Another name for it is data partitioning.

Hash partitioning is best suited for point queries that are based on the partitioning attribute and
have the benefit of offering an even distribution of data across the discs.

It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored
on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship
takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different
methods of partitioning:

Hash partitioning

A hash function is a quick mathematical operation. The partitioning properties are hashed for each
row in the original relationship.

Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4.
The row is now stored on disk3 if the function returns 3.

Range partitioning

Each disc receives continuous attribute value ranges while using range partitioning. For instance, if
we are range partitioning three discs with the numbers 0, 1, and 2, we may assign a relation with a
value of less than 5 is written to disk0, numbers from 5 to 40 are sent to disk1, and values above 40
are written to disk2.

It has several benefits, such as putting shuffles on the disc that have attribute values within a
specified range.

Round-robin partitioning

Any order can be used to study the relationships in this method. It sends the ith tuple to the disc
number (i% n).

Therefore, new rows of data are received by discs in turn. For applications that want to read the full
relation sequentially for each query, this strategy assures an even distribution of tuples across drives.
Schema Partitioning

Various tables inside a database are put on different discs using a technique called schema
partitioning.

__________________________________________________________________________________

Intra-query parallelism

Using a shared-nothing paralleling architecture technique, intra-query parallelism refers to the


processing of a single query in a parallel process on many CPUs. This employs two different
strategies:

First method — In this method, a duplicate task can be executed on a small amount of data by each
CPU.

Second method — Using this method, the task can be broken up into various sectors, with each CPU
carrying out a separate subtask.

Inter-query parallelism

Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction
processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction
dispatching.

We can also employ a variety of techniques, such as effective lock management. This technique runs
each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various transactions operating on
various processes have acquired. When simultaneous transactions don’t accept the same data, inter-
query parallelism on shared storage architecture works well.

Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in
DBMS.

Intra-operation parallelism

In this type of parallelism, we execute each individual operation of a task, such as sorting, joins,
projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.

Database systems naturally employ this kind of parallelism. Consider the following SQL example:

SELECT * FROM the list of vehicles and sort by model number;

Since a relation might contain a high number of records, the relational operation in the
aforementioned query is sorting.

Because this operation can be done on distinct subsets of the relation in several processors, it takes
less time to sort the data.

Inter-operation parallelism

This term refers to the concurrent execution of many operations within a query expression. They
come in two varieties:

Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first
operation’s output before the first operation has finished producing the whole set of rows in its
output.

Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one
operation to consume tuples concurrently with another operation and thereby reduce them.

It is advantageous for systems with a limited number of CPUs and prevents the storage of interim
results on a disc.
Independent parallelism- In this form of parallelism, operations contained within query phrases that
are independent of one another may be carried out concurrently. This analogy is extremely helpful
when dealing with parallelism of a lower degree.

Conclusion: A huge task is broken down into numerous smaller tasks using parallel processing, which
then runs each of the smaller tasks on various nodes and processors at the same time.

The greater task thus gets finished faster as a result. Separate jobs compete for the same resource in
sequential processing. Only Task 1 can be completed immediately. Task 1 must be finished before
task 2 may be started, and so on. Task 3 must follow suit.

A larger portion of the CPU is allocated to the jobs during parallel processing. There is no waiting
involved because each autonomous task starts working right away on its own processor.

Concurrency Management, Task Synchronisation, Resource Sharing, Data Placement, and Network
Scaling are qualities that a parallel database system should retain.

--------------------------------------------------------------------------------------------------------------------------------------

A Distributed database is defined as a logically related collection of data that is shared which is
physically distributed over a computer network on different sites.

Distributed DBMS :
The Distributed DBMS is defined as, the software that allows for the management of the distributed
database and make the distributed data available for the users.
A distributed DBMS consist of a single logical database that is divided into a number of pieces called
the fragments. In DDBMS, Each site is capable of independently processing the users request.

Users can access the DDBMS via applications classified:

1. Local Applications –
Those applications that doesn’t require data from the other sites are classified under the
category of Local applications.

2. Global Applications –
Those applications that require data from the other sites are classified under the category of
Global applications.

Characteristics of Distributed DDBMS :


A DDBMS has the following characteristics-

1. A collection of logically related shared data.

2. The data is split into a number of fragments.

3. Fragments may be duplicate.

4. Fragments are allocated to sites.

5. The data at each site is under the control of DBMS and managed by DBMS.

Distributed Processing :
The Distributed processing is centralized database that can be accessed over a computer network by
different sites. The data is centralized even though other users may be accessing the data from the
other sites, we do not consider this to be DDBMS, simply distributed processing.

Parallel DBMS :
A parallel DBMS is a DBMS that run across multiple processor and is designed to execute operations
in parallel, whenever possible. The parallel DBMS link a number of smaller machines to achieve same
throughput as expected from a single large machine.

There are three main architectures for Parallel DBMS-

1. Shared Memory –
Shared memory is a highly coupled architecture in which a number of processors within a
single system who share system memory. It is also known as symmetric multiprocessing
(SMP). This approach is more popular on platforms like personal workstations that support a
few microprocessor in parallel.

2. Shared Disk –
Shared disk is a loosely coupled architecture used for application that are centralized and
require a high availability and performance.Each processor is able to access all disks directly,
but each has it’s own private memory.It is also called Clusters.

3. Shared Nothing –
Shared nothing is a multiple processor architecture in which every processor is a part of a
complete system, which has its own memory and disk storage( has it’s own resources). It is
also called Massively Parallel Processing (MPP).

Features of Distributed DBMS

There is a presence of a certain number of features that make DDBMS very popular in organizing
data.

 Data Fragmentation: The overall database system is divided into smaller subsets which are
fragmentations. This fragmentation can be three types horizontal (divided by rows
depending upon conditions), vertical (divided by columns depending upon conditions), and
hybrid (horizontal + vertical).

 Data Replication: DDBMS maintains and stores multiple copies of the same data in its
different fragments to ensure data availability, fault tolerance, and seamless performance.

 Data Allocation: It determines if all data fragments are required to be stored in all sites or
not. This feature is used to reduce network traffic and optimize the performance.

 Data Transparency: DDBMS hides all the complexities from its users and provides
transparent access to data and applications to users.

Types of Distributed DBMS

There are 6 types of DDBMS present there which are discussed below:

 Homogeneous: In this type of DDBMS, all the participating sites should have the exact same
DBMS software and architecture which makes all underlying systems consistent across all
sites. It provides simplified data sharing and integration.
 Heterogeneous: In this type of DDBMS, the participating sites can be from multiple sites and
use different DBMS software, data models, or architectures. This model faces little
integration problem as all site’s data representation and query language can be different
from each other.

 Federated: Here, the local databases are maintained by individual sites or federations. These
local databases are connected via a middleware system that allows users to access and query
data from multiple distributed databases. The federation combines different local databases
but maintains autonomy at the local level.

 Replicated: In this type, the DDBMS maintains multiple copies of the same data fragment
across different sites. It is used to ensure data availability, fault tolerance, and seamless
performance. Users can access any data from the nearest replica if the root is down for some
reason. However, it is required to perform high-end synchronization of data changes in
replication.

 Partitioned: In a Partitioned DDBMS, the overall database is divided into distinct partitions,
and each partition is assigned to a specific site. Partitioning can be done depending on
specific conditions like date range, geographic location, and functional modules. Each site
controls its own partition and the data from other partitions should be accessed through
communication and coordination between sites.

 Hybrid: It is just a combination of multiple other five types of DDBMS which are discussed
above. The combination is done to address specific requirements and challenges of complex
distributed environments. Hybrid DDBMS provides more optimized performance and high
scalability.
In this example, we are going to see how the horizontal fragmentation looks
in a table.

Input :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT * FROM student WHERE salary<35000;
SELECT * FROM student WHERE salary>35000;
Output
id name age salary
1 aman 21 20000
2 naman 22 25000
id name age salary
4 soman 24 36000

This example shows how the Select statement is used to do vertical


fragmentation.

Input Table :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT * FROM name;#fragmentation 1
SELECT * FROM id, age;#fragmentation 2
Mixed or Hybrid Fragmentation

It is done by performing both horizontal and vertical partitioning together. It


is a group of rows and columns in relation.
Example

This example shows how the Select statement is used with the where clause
to provide the output.

SELECT * FROM name WHERE age=22;

Data Replication

Data replication means a replica is made i. e. data is copied at multiple locations to improve the
availability of data. It is used to remove inconsistency between the same data which result in a
distributed database so that users can do their task without interrupting the work of other users.

Types of data replication :

Transactional Replication

It makes a full copy of the database along with the changed data. Transactional consistency is
guaranteed because the order of data is the same when copied from publisher to subscriber
database. It is used in server−to−server environments by consistently and accurately replicating
changes in the database.

Snapshot Replication

It is the simplest type that distributes data exactly as it appears at a particular moment regardless of
any updates in data. It copies the 'snapshot' of the data. It is useful when the database changes
infrequently. It is slower to Transactional Replication because the data is sent in bulk from one end to
another. It is generally used in cases where subscribers do not need the updated data and are
operating in read−only mode.

Merge Replication

It combines data from several databases into a single database. It is the most complex type of
replication because both the publisher and subscriber can do database changes. It is used in a
server−to−client environment and has changes sent from one publisher to multiple subscribers.

Data Allocation

It is the process to decide where exactly you want to store the data in the database. Also involves the
decision as to which data type of data has to be stored at what particular location. Three main types
of data allocation are centralized, partitioned, and replicated.

Centralises: Entire database is stored at a single site. No data distribution occurs

Partitioned: The database gets divided into different fragments which are stored at several sites.

Replicated: Copies of the database are stored at different locations to access the data.

Costs (Transfer of Data) of Distributed Query Processing


In Distributed Query processing, the data transfer cost of distributed query
processing means the cost of transferring intermediate files to other sites for
processing and therefore the cost of transferring the ultimate result files to the
location where that result is required. Let’s say that a user sends a query to site
S1, which requires data from its own and also from another site S2. Now, there
are three strategies to process this query which are given below:
1. We can transfer the data from S2 to S1 and then process the query
2. We can transfer the data from S1 to S2 and then process the query
3. We can transfer the data from S1 and S2 to S3 and then process the
query. So the choice depends on various factors like the size of
relations and the results, the communication cost between different
sites, and at which the site result will be utilized.
Commonly, the data transfer cost is calculated in terms of the size of the
messages. By using the below formula, we can calculate the data transfer cost:
Data transfer cost = C * Size
Where C refers to the cost per byte of data transferring and Size is the no. of
bytes transmitted.

Ie transfer s1 to s2( join them => no of queries will become the total number of new tuples.) Transfer
to s3

Or reverse

Or transfer s1 to s3, and then s2 to s3 too.

Choose optimum cost.

Distributed Query Optimization


Distributed query optimization requires evaluation of a large
number of query trees each of which produce the required results
of a query. This is primarily due to the presence of large amount
of replicated and fragmented data. Hence, the target is to find an
optimal solution instead of the best solution.

The main issues for distributed query optimization are −

 Optimal utilization of resources in the distributed system.


 Query trading.
 Reduction of solution space of the query.
Optimal Utilization of Resources in the Distributed
System

A distributed system has a number of database servers in the


various sites to perform the operations pertaining to a query.
Following are the approaches for optimal resource utilization −

Operation Shipping − In operation shipping, the operation is run at


the site where the data is stored and not at the client site. The
results are then transferred to the client site. This is appropriate
for operations where the operands are available at the same site.
Example: Select and Project operations.
Data Shipping − In data shipping, the data fragments are
transferred to the database server, where the operations are
executed. This is used in operations where the operands are
distributed at different sites. This is also appropriate in systems
where the communication costs are low, and local processors are
much slower than the client server.
Hybrid Shipping − This is a combination of data and operation
shipping. Here, data fragments are transferred to the high-speed
processors, where the operation runs. The results are then sent
to the client site.

Distributed DBMS environments face unique challenges in concurrency control and recovery.

Multiple Copies of Data Items

Dealing with multiple copies of data items is a significant challenge in distributed DBMS
environments. Consistency among these copies is crucial for proper concurrency control, and
recovery methods are responsible for making a copy consistent with others if the site storing the
copy fails.

Failure of Individual Sites

In the event of site failure, distributed DBMS should continue to operate with its running sites if
possible. When a site recovers, its local database must be brought up-to-date with the rest of the
sites before rejoining the system.

Failure of Communication Links


The system must be able to deal with the failure of one or more of the communication links that
connect the sites. Network partitioning may occur, breaking up the sites into two or more partitions,
where the sites within each partition can communicate only with each other.

Distributed Commit

Problems can arise with committing a transaction that is accessing databases stored on multiple sites
if some sites fail during the commit process. The two-phase commit protocol is often used to deal
with this problem.

Distributed Deadlock

Deadlock may occur among several sites, so techniques for dealing with deadlocks must be extended
to take this into account

Distributed Concurrency Control and Recovery Techniques

Distributed concurrency control and recovery techniques must deal with the challenges mentioned
above and others. In this section, we review some of the suggested techniques to handle recovery
and concurrency control in DDBMSs.

Distributed Concurrency Control Based on a Distinguished


Copy of a Data Item
In distributed databases, replicated data items pose a challenge for
concurrency control. To address this issue, several concurrency control
techniques have been proposed that extend the concurrency control methods
used for centralized databases. These techniques involve designating a
particular copy of each data item as a distinguished copy, with locks for the
data item associated with the distinguished copy. All locking and unlocking
requests are sent to the site that contains the distinguished copy.

Different methods have been proposed for choosing the distinguished copies,
including the primary site technique, primary site with backup site, and
primary copy technique. In the primary site technique, all distinguished
copies are kept at a single primary site, which acts as the coordinator site for
concurrency control. However, this approach has certain disadvantages, such
as overloading the primary site with locking requests and causing system
bottlenecks. Failure of the primary site also paralyzes the system, limiting
reliability and availability.
The primary site with backup site approach addresses the issue of primary
site failure by designating a second site as a backup site. All locking
information is maintained at both the primary and backup sites, and the
backup site takes over as the primary site in case of failure. The primary
copy technique distributes the load of lock coordination among various sites
by storing distinguished copies of different data items at different sites.
Failure of one site affects only transactions accessing locks on items whose
primary copies reside at that site.

Choosing a new coordinator site in case of failure involves electing a new


coordinator site through a complex algorithm. If no backup site exists or if
both the primary and backup sites are down, the election process is initiated
by a site that repeatedly fails to communicate with the coordinator site. The
site proposes itself as the new coordinator, and as soon as it receives a
majority of yes votes, it declares itself as the new coordinator.

Distributed Concurrency Control Based on Voting


Distributed Concurrency Control Based on Voting is a concurrency control
method that differs from other replicated items methods in that it does not
rely on a distinguished copy to maintain locks. Instead, a lock request is sent
to all sites containing a copy of the data item, and each copy maintains its
own lock and can grant or deny the request.

If a transaction requesting a lock is granted the lock by a majority of the


copies, it holds the lock and informs all copies that it has been granted the
lock. However, if a transaction does not receive a majority of votes granting
it a lock within a certain time-out period, it cancels its request and informs all
sites of the cancellation.

The voting method is considered a truly distributed concurrency control


method, as the responsibility for a decision resides with all the sites involved.
Simulation studies have shown that voting has higher message traffic among
sites than do the distinguished copy methods. If the algorithm takes into
account possible site failures during the voting process, it becomes extremely
complex.

Distributed Recovery
Distributed recovery in a distributed database is a process that has several
issues. One of the major challenges is detecting whether a site is down or
not. It often requires exchanging messages with other sites. For example, if
site X sends a message to site Y and does not receive a response. It is
difficult to determine if the message was not delivered due to a
communication failure. If site Y is down and could not respond, or if site Y
sent a response that was not delivered.

Another significant problem in distributed recovery is distributed commit.


When a transaction modifies data at multiple sites. It cannot commit until it
ensures that the effects of the transaction on each site are not lost. To
achieve this, each site must record the local effects of the transaction
permanently in its local site log-on disk before committing. The two-phase
commit protocol is used to ensure the accuracy of the distributed commit.

An in-memory database is a database type that uses volatile memory (most often RAM or
Intel Optane) as the primary data storage. Memory stores, by design, enable minimal
response time and high latency. By eliminating the time needed to query data from disk, in-
memory databases implement real-time responses to queries.
However, memory volatility makes these databases sensitive to crashes and downtime. Once
shut down, all data on it is lost. Various logging mechanisms and hybrid modeling techniques
address this issue to maintain database persistence.

How to Implement In-Memory Database?

Implementing an in-memory database requires additional workarounds to provide database


durability. Some example mechanisms to enforce durability are:

Persisting data to disk at intervals through snapshots. If snapshots are set up at regular time intervals
as scheduled jobs, the mechanism brings partial durability.

Using non-volatile memory, such as flash memory or NVRAM.

Logging all transactions to journal files with automatic recovery.

Database replication to other in-memory databases on different sites or in combination with on-disk
storage.

Below are some examples of where in-memory stores find applications:

 IoT and edge computing. IoT sensors stream massive amounts of data. An in-
memory database can store and perform calculations using real-time data before
sending it to an on-disk database.
 Ecommerce applications. Shopping carts, search results, session management, and
quick page loads are all possible with an in-memory database. Fast results provide the
user with a better overall experience, regardless of traffic surges.
 Gaming industry. The gaming industry uses in-memory databases for updating
leaderboards in real time, which is especially important for building engagement.
 Real-time security and fraud detection. In-memory databases help perform
complex processing and analytics in real time, making them a perfect choice for fraud
detection.

Challenges and Disadvantages


Volatility and Data Durability
 Data Loss Risks: The primary challenge with non-persistent IMDBs is
the risk of data loss.
 Strategies for Mitigation: Use of persistent IMDBs, replication, and
regular backup mechanisms.
Cost Implications
 Higher Costs: RAM is more expensive than disk storage, leading to
higher costs for large-scale data storage.
 Cost-Benefit Analysis: Necessary to weigh the performance benefits
against the cost implications.
Memory Management
 Complexity in Large Datasets: Managing large datasets in memory
can be complex and requires efficient memory management
strategies.
 Garbage Collection Overhead: In languages like Java, garbage
collection can impact performance.

You might also like