Professional Documents
Culture Documents
ADTHEORY1
ADTHEORY1
I/O parallelism
Within-operation parallelism
Parallelism in inter-operation
I/O parallelism
This type of parallelism involves partitioning the relationships among the discs in order to speed up
the retrieval of relationships from the disc.
The inputted data is divided within, and each division is processed simultaneously. After processing
all of the partitioned data, the results are combined. Another name for it is data partitioning.
Hash partitioning is best suited for point queries that are based on the partitioning attribute and
have the benefit of offering an even distribution of data across the discs.
It should be mentioned that partitioning is beneficial for the sequential scans of the full table stored
on “n” discs and the speed at which the table may be scanned. For a single disc system, relationship
takes around 1/n of the time needed to scan the table. In I/O parallelism, there are four different
methods of partitioning:
Hash partitioning
A hash function is a quick mathematical operation. The partitioning properties are hashed for each
row in the original relationship.
Let’s say that the data is to be partitioned across 4 drives, numbered disk1, disk2, disk3, and disk4.
The row is now stored on disk3 if the function returns 3.
Range partitioning
Each disc receives continuous attribute value ranges while using range partitioning. For instance, if
we are range partitioning three discs with the numbers 0, 1, and 2, we may assign a relation with a
value of less than 5 is written to disk0, numbers from 5 to 40 are sent to disk1, and values above 40
are written to disk2.
It has several benefits, such as putting shuffles on the disc that have attribute values within a
specified range.
Round-robin partitioning
Any order can be used to study the relationships in this method. It sends the ith tuple to the disc
number (i% n).
Therefore, new rows of data are received by discs in turn. For applications that want to read the full
relation sequentially for each query, this strategy assures an even distribution of tuples across drives.
Schema Partitioning
Various tables inside a database are put on different discs using a technique called schema
partitioning.
__________________________________________________________________________________
Intra-query parallelism
First method — In this method, a duplicate task can be executed on a small amount of data by each
CPU.
Second method — Using this method, the task can be broken up into various sectors, with each CPU
carrying out a separate subtask.
Inter-query parallelism
Each CPU executes numerous transactions when inter-query parallelism is used. Parallel transaction
processing is what it is known as. To support inter-query parallelism, DBMS leverages transaction
dispatching.
We can also employ a variety of techniques, such as effective lock management. This technique runs
each query sequentially, which slows down the running time.
In such circumstances, DBMS must be aware of the locks that various transactions operating on
various processes have acquired. When simultaneous transactions don’t accept the same data, inter-
query parallelism on shared storage architecture works well.
Additionally, the throughput of transactions is boosted, and it is the simplest form of parallelism in
DBMS.
Intra-operation parallelism
In this type of parallelism, we execute each individual operation of a task, such as sorting, joins,
projections, and so forth, in parallel. Intra-operation parallelism has a very high parallelism level.
Database systems naturally employ this kind of parallelism. Consider the following SQL example:
Since a relation might contain a high number of records, the relational operation in the
aforementioned query is sorting.
Because this operation can be done on distinct subsets of the relation in several processors, it takes
less time to sort the data.
Inter-operation parallelism
This term refers to the concurrent execution of many operations within a query expression. They
come in two varieties:
Pipelined parallelism — In pipeline parallelism, a second operation consumes a row of the first
operation’s output before the first operation has finished producing the whole set of rows in its
output.
Additionally, it is feasible to perform these two processes concurrently on several CPUs, allowing one
operation to consume tuples concurrently with another operation and thereby reduce them.
It is advantageous for systems with a limited number of CPUs and prevents the storage of interim
results on a disc.
Independent parallelism- In this form of parallelism, operations contained within query phrases that
are independent of one another may be carried out concurrently. This analogy is extremely helpful
when dealing with parallelism of a lower degree.
Conclusion: A huge task is broken down into numerous smaller tasks using parallel processing, which
then runs each of the smaller tasks on various nodes and processors at the same time.
The greater task thus gets finished faster as a result. Separate jobs compete for the same resource in
sequential processing. Only Task 1 can be completed immediately. Task 1 must be finished before
task 2 may be started, and so on. Task 3 must follow suit.
A larger portion of the CPU is allocated to the jobs during parallel processing. There is no waiting
involved because each autonomous task starts working right away on its own processor.
Concurrency Management, Task Synchronisation, Resource Sharing, Data Placement, and Network
Scaling are qualities that a parallel database system should retain.
--------------------------------------------------------------------------------------------------------------------------------------
A Distributed database is defined as a logically related collection of data that is shared which is
physically distributed over a computer network on different sites.
Distributed DBMS :
The Distributed DBMS is defined as, the software that allows for the management of the distributed
database and make the distributed data available for the users.
A distributed DBMS consist of a single logical database that is divided into a number of pieces called
the fragments. In DDBMS, Each site is capable of independently processing the users request.
1. Local Applications –
Those applications that doesn’t require data from the other sites are classified under the
category of Local applications.
2. Global Applications –
Those applications that require data from the other sites are classified under the category of
Global applications.
5. The data at each site is under the control of DBMS and managed by DBMS.
Distributed Processing :
The Distributed processing is centralized database that can be accessed over a computer network by
different sites. The data is centralized even though other users may be accessing the data from the
other sites, we do not consider this to be DDBMS, simply distributed processing.
Parallel DBMS :
A parallel DBMS is a DBMS that run across multiple processor and is designed to execute operations
in parallel, whenever possible. The parallel DBMS link a number of smaller machines to achieve same
throughput as expected from a single large machine.
1. Shared Memory –
Shared memory is a highly coupled architecture in which a number of processors within a
single system who share system memory. It is also known as symmetric multiprocessing
(SMP). This approach is more popular on platforms like personal workstations that support a
few microprocessor in parallel.
2. Shared Disk –
Shared disk is a loosely coupled architecture used for application that are centralized and
require a high availability and performance.Each processor is able to access all disks directly,
but each has it’s own private memory.It is also called Clusters.
3. Shared Nothing –
Shared nothing is a multiple processor architecture in which every processor is a part of a
complete system, which has its own memory and disk storage( has it’s own resources). It is
also called Massively Parallel Processing (MPP).
There is a presence of a certain number of features that make DDBMS very popular in organizing
data.
Data Fragmentation: The overall database system is divided into smaller subsets which are
fragmentations. This fragmentation can be three types horizontal (divided by rows
depending upon conditions), vertical (divided by columns depending upon conditions), and
hybrid (horizontal + vertical).
Data Replication: DDBMS maintains and stores multiple copies of the same data in its
different fragments to ensure data availability, fault tolerance, and seamless performance.
Data Allocation: It determines if all data fragments are required to be stored in all sites or
not. This feature is used to reduce network traffic and optimize the performance.
Data Transparency: DDBMS hides all the complexities from its users and provides
transparent access to data and applications to users.
There are 6 types of DDBMS present there which are discussed below:
Homogeneous: In this type of DDBMS, all the participating sites should have the exact same
DBMS software and architecture which makes all underlying systems consistent across all
sites. It provides simplified data sharing and integration.
Heterogeneous: In this type of DDBMS, the participating sites can be from multiple sites and
use different DBMS software, data models, or architectures. This model faces little
integration problem as all site’s data representation and query language can be different
from each other.
Federated: Here, the local databases are maintained by individual sites or federations. These
local databases are connected via a middleware system that allows users to access and query
data from multiple distributed databases. The federation combines different local databases
but maintains autonomy at the local level.
Replicated: In this type, the DDBMS maintains multiple copies of the same data fragment
across different sites. It is used to ensure data availability, fault tolerance, and seamless
performance. Users can access any data from the nearest replica if the root is down for some
reason. However, it is required to perform high-end synchronization of data changes in
replication.
Partitioned: In a Partitioned DDBMS, the overall database is divided into distinct partitions,
and each partition is assigned to a specific site. Partitioning can be done depending on
specific conditions like date range, geographic location, and functional modules. Each site
controls its own partition and the data from other partitions should be accessed through
communication and coordination between sites.
Hybrid: It is just a combination of multiple other five types of DDBMS which are discussed
above. The combination is done to address specific requirements and challenges of complex
distributed environments. Hybrid DDBMS provides more optimized performance and high
scalability.
In this example, we are going to see how the horizontal fragmentation looks
in a table.
Input :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT * FROM student WHERE salary<35000;
SELECT * FROM student WHERE salary>35000;
Output
id name age salary
1 aman 21 20000
2 naman 22 25000
id name age salary
4 soman 24 36000
Input Table :
STUDENT
id name age salary
1 aman 21 20000
2 naman 22 25000
3 raman 23 35000
4 sonam 24 36000
Example
SELECT * FROM name;#fragmentation 1
SELECT * FROM id, age;#fragmentation 2
Mixed or Hybrid Fragmentation
This example shows how the Select statement is used with the where clause
to provide the output.
Data Replication
Data replication means a replica is made i. e. data is copied at multiple locations to improve the
availability of data. It is used to remove inconsistency between the same data which result in a
distributed database so that users can do their task without interrupting the work of other users.
Transactional Replication
It makes a full copy of the database along with the changed data. Transactional consistency is
guaranteed because the order of data is the same when copied from publisher to subscriber
database. It is used in server−to−server environments by consistently and accurately replicating
changes in the database.
Snapshot Replication
It is the simplest type that distributes data exactly as it appears at a particular moment regardless of
any updates in data. It copies the 'snapshot' of the data. It is useful when the database changes
infrequently. It is slower to Transactional Replication because the data is sent in bulk from one end to
another. It is generally used in cases where subscribers do not need the updated data and are
operating in read−only mode.
Merge Replication
It combines data from several databases into a single database. It is the most complex type of
replication because both the publisher and subscriber can do database changes. It is used in a
server−to−client environment and has changes sent from one publisher to multiple subscribers.
Data Allocation
It is the process to decide where exactly you want to store the data in the database. Also involves the
decision as to which data type of data has to be stored at what particular location. Three main types
of data allocation are centralized, partitioned, and replicated.
Partitioned: The database gets divided into different fragments which are stored at several sites.
Replicated: Copies of the database are stored at different locations to access the data.
Ie transfer s1 to s2( join them => no of queries will become the total number of new tuples.) Transfer
to s3
Or reverse
Distributed DBMS environments face unique challenges in concurrency control and recovery.
Dealing with multiple copies of data items is a significant challenge in distributed DBMS
environments. Consistency among these copies is crucial for proper concurrency control, and
recovery methods are responsible for making a copy consistent with others if the site storing the
copy fails.
In the event of site failure, distributed DBMS should continue to operate with its running sites if
possible. When a site recovers, its local database must be brought up-to-date with the rest of the
sites before rejoining the system.
Distributed Commit
Problems can arise with committing a transaction that is accessing databases stored on multiple sites
if some sites fail during the commit process. The two-phase commit protocol is often used to deal
with this problem.
Distributed Deadlock
Deadlock may occur among several sites, so techniques for dealing with deadlocks must be extended
to take this into account
Distributed concurrency control and recovery techniques must deal with the challenges mentioned
above and others. In this section, we review some of the suggested techniques to handle recovery
and concurrency control in DDBMSs.
Different methods have been proposed for choosing the distinguished copies,
including the primary site technique, primary site with backup site, and
primary copy technique. In the primary site technique, all distinguished
copies are kept at a single primary site, which acts as the coordinator site for
concurrency control. However, this approach has certain disadvantages, such
as overloading the primary site with locking requests and causing system
bottlenecks. Failure of the primary site also paralyzes the system, limiting
reliability and availability.
The primary site with backup site approach addresses the issue of primary
site failure by designating a second site as a backup site. All locking
information is maintained at both the primary and backup sites, and the
backup site takes over as the primary site in case of failure. The primary
copy technique distributes the load of lock coordination among various sites
by storing distinguished copies of different data items at different sites.
Failure of one site affects only transactions accessing locks on items whose
primary copies reside at that site.
Distributed Recovery
Distributed recovery in a distributed database is a process that has several
issues. One of the major challenges is detecting whether a site is down or
not. It often requires exchanging messages with other sites. For example, if
site X sends a message to site Y and does not receive a response. It is
difficult to determine if the message was not delivered due to a
communication failure. If site Y is down and could not respond, or if site Y
sent a response that was not delivered.
An in-memory database is a database type that uses volatile memory (most often RAM or
Intel Optane) as the primary data storage. Memory stores, by design, enable minimal
response time and high latency. By eliminating the time needed to query data from disk, in-
memory databases implement real-time responses to queries.
However, memory volatility makes these databases sensitive to crashes and downtime. Once
shut down, all data on it is lost. Various logging mechanisms and hybrid modeling techniques
address this issue to maintain database persistence.
Persisting data to disk at intervals through snapshots. If snapshots are set up at regular time intervals
as scheduled jobs, the mechanism brings partial durability.
Database replication to other in-memory databases on different sites or in combination with on-disk
storage.
IoT and edge computing. IoT sensors stream massive amounts of data. An in-
memory database can store and perform calculations using real-time data before
sending it to an on-disk database.
Ecommerce applications. Shopping carts, search results, session management, and
quick page loads are all possible with an in-memory database. Fast results provide the
user with a better overall experience, regardless of traffic surges.
Gaming industry. The gaming industry uses in-memory databases for updating
leaderboards in real time, which is especially important for building engagement.
Real-time security and fraud detection. In-memory databases help perform
complex processing and analytics in real time, making them a perfect choice for fraud
detection.