Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

-

- Chapter One -
-
Database Basics

- An organized collection of structured information, or data, typically stored electronically


in a computer system.
- It is a collection of data, stored in a logically coherent manner so as to make retrieval of data
efficient.

Database Management Systems (DBMS)


- Software systems used to store, retrieve, and run queries on data
- DBMS makes it possible for end users to create, read, update and delete data in database
- The first general-purpose DBMS was designed by Charles Bachman at General Electric in the
early 1960s and was called the Integrated Data Store.

Advantages of DBMS
- Reducing Data Redundancy – As compared to file system which may store multiple copies
and thus has redundancy, in DBMS it is prevented. That means in a database as there is a single
database and any change in it is reflected immediately. Because of this, there is no chance of
encountering duplicate data.
- Data independence - refers to the separation of the data from the applications that use it,
allowing changes in the data storage or organization to be made without affecting the
applications that access the data.
- Efficient data access - refers to the ability to retrieve and manipulate data from a database
quickly and effectively.
- Data integrity - refers to the accuracy, consistency, and reliability of data stored in a database.
It ensures that data is valid, complete, and has not been altered or corrupted during storage,
retrieval, or transmission.
- Data security - involves protecting sensitive information from unauthorized access,
disclosure, alteration, or destruction. Only authorized users should be allowed to access the
database and their identity should be authenticated using a username and password.

Page 1 of 29
- Concurrent access and crash recovery- ensuring data consistency, reliability, and availability
in multi-user environments.
▪ Concurrent access - refers to the ability of multiple users or processes to access and
modify the same data concurrently without causing conflicts or inconsistencies. In a
multi-user database system, multiple transactions may execute simultaneously, each
performing read and write operations on the database. To ensure data consistency and
integrity, the database system employs concurrency control mechanisms to manage
concurrent access and prevent interference between transactions. Common
concurrency control techniques include locking, timestamp ordering, optimistic
concurrency control, and multi-version concurrency control (MVCC).
▪ Crash recovery - refers to the process of restoring a database to a consistent and
usable state after a system failure or crash. To ensure durability and reliability, database
management systems (DBMS) employ crash recovery mechanisms to recover from
failures and maintain data consistency.
- Reduced application development time - by providing abstraction layers, built-in functions,
data consistency features, transaction management, data modeling tools, and integration with
development frameworks, DBMS significantly reduces the time and effort required for
application development. Developers can focus on implementing business logic and user
interface components, knowing that the underlying database management tasks are handled
efficiently by the DBMS.
- Data Consistency - Data consistency is ensured in a database because there is no data
redundancy. All data appears consistently across the database and the data is same for all the
users viewing the database. Moreover, any changes made to the database are immediately
reflected to all the users and there is no data inconsistency.
- Privacy - only the authorized users can access a database according to its privacy constraints.
There are levels of database access and a user can only view the data he is allowed to.

Limitations of DBMS
- Cost - Requires skilled staff
- Complexity - Hardware cost – High storage
- Database Failure requirement
- Cost of Data Conversion
Page 2 of 29
Data Model
- Defines how data is connected to each other and how they are processed and stored inside the
system.
- It shows the logical structure of a database, including the relationships and constraints that
determine how data can be stored and accessed
- Determines the logical structure of a database and fundamentally determines in which manner
data can be stored, organized, and manipulated.
- A data model is a collection of high-level data description constructs that hide many low-level
storage details.

Types of data models


1. Hierarchical Models:
- Represents data in a tree-like structure, with parent-child relationships between data
elements. Each parent can have multiple children, but each child has only one parent.
- Serves data to the user in hierarchy of data elements.
- This model only provides for one-to-one and one-to-many relationships, not many-to-
many.
- Thus, its main drawback is its inability to support many-to-many relationships
(relational/object, object-oriented, semi-structured, associative, and entity attribute value).
2. Network Model
- Extends the hierarchical model by allowing each child to have multiple parents, creating a
more flexible structure for representing complex relationships.
- Similar to the hierarchical model, except that it provides for many-to-many relationships.
- Shaped like several logical trees sharing common branches. This model's main drawback
is its difficulty to implement.
3. Relational Model
- Organizes data into tables (relations) consisting of rows (tuples) and columns (attributes).
- Relationships between tables are defined using primary and foreign keys, and data integrity
is enforced through integrity constraints.
- Designed to overcome the complexity and inflexibility of the network and hierarchical
models

Page 3 of 29
- Popular relational database management systems include Oracle, MySQL, and Microsoft
SQL Server.
- Its drawback is poor horizontal scalability
4. Entity-Relationship (ER) Model
- An ER model is the logical representation of data as entities (objects) and relationships
between them. These objects are known as entities, and relationship is an association
among these entities.

- Entities have attributes that describe their properties, and relationships define how entities
are related to each other.
▪ For example, student_name, student_id describes the 'student' entity.
- A set of the same type of entities is known as an 'Entity set', and the set of the same type of
relationships is known as 'relationship set'. It is used for the conceptual design of a database.
5. Object-Oriented Model
- Extends the relational model by incorporating concepts from object-oriented
programming, such as inheritance, encapsulation, and polymorphism. Data is represented
as objects with properties and methods. Thus, it is a combination of Object-Oriented
programming and a Relational Database Model.
- Here real-world problems are represented as objects with different attributes. All objects
have multiple relationships between them.
6. Document Model:
- Stores data in a flexible, semi-structured format, such as JSON (JavaScript Object
Notation) or XML (eXtensible Markup Language). Documents can contain nested
structures and arrays, making them suitable for storing unstructured or semi-structured
data.
- The central data description construct in this model is a relation, which is set of records.
NOTE: A description of data in terms of a data model is called a schema.

Page 4 of 29
Database Schema
A database schema is the skeleton structure that represents the logical view of the entire database.
It defines how the data is organized and how the relations among them are associated. It formulates
all the constraints that are to be applied on the data. It’s the database designers who design the
schema to help programmers understand the database and make it useful. Schema can be regarded
as a template for describing a table like student table in university.
- DB Schema is a structure that represents the logical storage of the data in a database
- A database schema contains schema objects that may include tables, fields, packages, views,
relationships, primary key, foreign key.
- In actual, the data is physically stored in files that may be in unstructured form, but to retrieve
it and use it, we need to put it in a structured form. To do this, a database schema is used. It
provides knowledge about how the data is organized in a database and how it is associated
with other data.
- The schema does not physically contain the data itself; instead, it gives information about the
shape of data and how it can be related to other tables or models.

A database schema object includes the following:


- Consistent formatting for all data entries.
- Database objects and unique keys for all data entries.
- Tables with multiple columns, and each column contains its name and datatype.
Integrity constraint in DB Schema - conditions that the records in a relation must satisfy.
▪ For example, we could specify that every student has a unique student_ID value.

A database schema has 3 types: logical, physical and view/conceptual schema


1. Logical/Conceptual DB Schema
- Specifies all the logical constraints such as tables, views, and integrity constraints to be
applied to the stored data. Simply put, it consists of any rule needed to govern the database.
- Logical schema describes how the data is stored in the form of tables & how the attributes
of a table are connected. In logical schema different integrity constraints are defined in
order to maintain the quality of insertion and update the data.
- Various tools are used to create a logical database schema, and these tools demonstrate the
relationships between the component of our data; this process is called ER modelling.

Page 5 of 29
2. Physical DB Schema
- A physical database schema specifies how the data is stored physically on a storage system
or disk storage in the form of Files and Indices.
- Designing a database at the physical level is called a physical schema.
- Provides technical information about how the data will be stored on a storage system.
- Summarizes how the relations described in the conceptual schema are actually stored on
secondary storage devices such as disks and tapes.
3. View Schema/External Schema
- Specifies the interactions of users with the system. It is a view level design which is able
to define the interaction between end-user and database.
- User is able to interact with the database with the help of the interface without knowing
much about the stored mechanism of data in database.

Figure: Three-layer schema design

Data Independence
Database data independence refers to the ability to modify the database schema or physical storage
without affecting the application programs (app code modules) that interact with the database. It
allows changes to be made to the database structure or organization without requiring
corresponding changes to the application code, thus providing flexibility, scalability, and
maintainability in database systems. There are two main types of data independence:

1. Logical Data Independence - refers to the ability to change the logical schema (or conceptual
schema) of the database without affecting the external schema or application programs.
Changes to the logical schema include modifications to the tables, views, relationships, or data

Page 6 of 29
types, but these changes are transparent to the applications that access the database. For
example, let's consider a database used by an e-commerce application. If the database
administrators decide to split the "Product" table into separate tables for "Electronics" and
"Clothing" categories, they can do so without modifying the application code. The application
continues to interact with the database using the same external schema (e.g., SELECT * FROM
Products), and the changes to the logical schema are hidden from the application.
2. Physical Data Independence - refers to the ability to change the physical storage or
organization of the database without affecting the logical schema or application programs.
Changes to the physical schema include modifications to the storage structures, indexing
methods, file organization, or hardware configurations. For example, let's consider a database
stored on a traditional hard disk drive (HDD). If the database administrators decide to migrate
the database to a solid-state drive (SSD) for performance reasons, they can do so without
modifying the application code. The application continues to interact with the database using
the same logical schema (e.g., SELECT * FROM Products), and the changes to the physical
storage are transparent to the application.

Generally logical data independence allows changes to the logical schema without impacting the
external schema or application programs, while physical data independence allows changes to the
physical storage without affecting the logical schema or application programs. These types of
data independence help decouple the database structure from the application logic, enabling easier
maintenance, evolution, and scalability of database systems.

Queries in a DBMS
A query is a request for information from the database. It is a structured statement written in a
query language, such as SQL (Structured Query Language), that specifies the criteria for retrieving
or manipulating data stored in the database. Queries can retrieve specific data subsets, perform
calculations, filter data based on conditions, join data from multiple tables, and more.

Example of a simple SQL query:


▪ SELECT Orders.OrderID, Customers.CustomerName FROM Orders
INNER JOIN Customers ON Orders.CustomerID = Customers.CustomerID;

Page 7 of 29
Two kinds of query languages: Relational Algebra and relational Calculus
1. Relational Algebra:
- It is a procedural query language that defines a set of operations for manipulating relations
(tables) in a database. These operations include selection, projection, union, difference,
Cartesian product, and join.
- Takes instances of relations as input and yields instances of relations as output. They accept
relations as their input and yield relations as their output. Relational algebra is performed
recursively on a relation and intermediate results are also considered relations.
The fundamental operations of relational algebra are as follows –
- Selection (σ): Selects rows from a relation that satisfy a given predicate (condition).
- Projection (π): Selects specific columns (attributes) from a relation, discarding the others.
- Union (∪): Combines the tuples (rows) of two relations, removing duplicates.
- Difference (−): Subtracts tuples present in one relation from another relation.
o Example: Notation: (r – s) where r and s are relations. Finds all the tuples that are
present in r but not in s.
- Cartesian Product (×): Generates a new relation by combining every tuple from one relation
with every tuple from another relation.
- Join (⋈): Combines tuples from two relations based on a common attribute in both tables.

Example of relational algebra operations:


Let's say we have two relations:
 R1 (A, B, C) with tuples {(1, 2, 3), (4, 5, 6), (7, 8, 9)}
 R2 (A, D, E) with tuples {(1, 10, 11), (4, 12, 13), (7, 14, 15)}
▪ Selection: σB=5(R1) selects rows from R1 where the value of attribute B is 5.
▪ Projection: πA, C(R1) selects only attributes A and C from relation R1.
▪ Union: R1 ∪ R2 combines tuples from R1 and R2, resulting in {(1, 2, 3), (4, 5, 6), (7, 8, 9), (1, 10,
11), (4, 12, 13), (7, 14, 15)}.
▪ Difference: R1 - R2 subtracts tuples present in R2 from R1, resulting in {(1, 2, 3), (4, 5, 6), (7, 8,
9)}.
▪ Cartesian Product: R1 × R2 generates a new relation with every possible combination of tuples
from R1 and R2.
▪ Join: R1 ⋈ R2 joins tuples from R1 and R2 based on the common attribute A.

Page 8 of 29
2. Relational Calculus:
A non-procedural query language that describes what data to retrieve, rather than how to retrieve
it. It defines queries using mathematical logic and predicate calculus. It tells what to do but never
explains how to do it.
There are two types of relational calculus:
• Tuple Relational Calculus: Specifies the desired result by describing a set of tuples that
satisfy a given predicate. It is based on the logic of first-order predicate calculus.
• Domain Relational Calculus: Specifies the desired result by describing a set of values for
each tuple that satisfy a given predicate. It is based on the logic of domain calculus.

Data Definition Language (DDL) and Data Manipulation Language (DML)


DDL (Data Definition Language)
Database objects including tables, views, indexes, and constraints can be created, modified, and
deleted using these commands. DDL statements specify how the data is arranged and accessed
rather than altering or manipulating the data that is stored in the database

Common DDL commands include:


• CREATE: Used to create new database objects such as tables, views, indexes, and
constraints.
• ALTER: Used to modify the structure of existing database objects, such as adding,
modifying, or dropping columns in a table.
• DROP: Used to delete or remove existing database objects from the database.

Example of DDL commands:


-- Create a new table
CREATE TABLE Employees (
Employee_ID INT PRIMARY KEY,
FirstName VARCHAR (50),
LastName VARCHAR (50),
Department VARCHAR (50)
);
-- Add a new column to an existing table
ALTER TABLE Employees
ADD Email VARCHAR (100);

Page 9 of 29
-- Drop a table
DROP TABLE Employees;

DML (Data Manipulation Language)


DML commands are used to insert data in to DB and retrieve, update, and delete data stored in
the database. These commands allow users to manipulate the content of the database tables.
Common DML commands include:
▪ SELECT: Used to retrieve data from one or more tables based on specified criteria.
▪ INSERT: Used to add new rows or records into a table.
▪ UPDATE: Used to modify existing data in a table.
▪ DELETE: Used to remove rows or records from a table.

Example of DML commands:


-- Retrieve data from a table
SELECT * FROM Employees WHERE Department = 'IT';
-- Insert a new record into a table
INSERT INTO Employees (Employee_ID, FirstName, LastName, Department)
VALUES (1, 'John', 'Doe', 'Finance');
-- Update existing records in a table
UPDATE Employees SET Department = 'HR' WHERE Employee_ID = 1;
-- Delete records from a table
DELETE FROM Employees WHERE Department = 'IT';

Generally,
- DDL commands are used to define and modify the structure of the database schema, while
- DML commands are used to manipulate the data stored in the database tables.

Transaction management
Involves ensuring the integrity and consistency of data during database operations that involve
multiple steps or actions. A transaction is a logical unit of work that consists of one or more
database operations (such as inserts, updates, or deletes) that must either all succeed or all fail as
a single indivisible unit.

Page 10 of 29
ACID Properties
A transaction in a database system must maintain Atomicity, Consistency, Isolation, and Durability
− commonly known as ACID properties − in order to ensure accuracy, completeness, and data
integrity.

Atomicity: A transaction is atomic, meaning it is either executed in its entirety or not at all. If any
part of the transaction fails, the entire transaction is rolled back, and the database is restored to its
state before the transaction began. There must be no state in a database where a transaction is left
partially completed. States should be defined either before the execution of the transaction or after
the execution/abortion/failure of the transaction.
Consistency: A transaction ensures that the database remains in a consistent state before and after
the transaction. The integrity constraints, such as primary key constraints and foreign key
constraints, must be maintained during and after the transaction. If the database was in a
consistent state before the execution of a transaction, it must remain consistent after the execution
of the transaction as well.
Isolation: Transactions are isolated from each other, meaning the execution of one transaction does
not interfere with the execution of other transactions. Each transaction operates as if it is the only
transaction being executed in the database. Explained more: In a database system where more
than one transactions are being executed simultaneously and in parallel, the property of isolation
states that “all the transactions will be carried out and executed as if it is the only transaction in
the system.” No transaction will affect the existence of any other transaction.
Durability: Once a transaction is committed and changes are made to the database, they are
permanent and will persist even in the event of system failures. The changes made by committed
transactions are durable and should not be lost. Explained more: The database should be durable
enough to hold all its latest updates even if the system fails or restarts. If a transaction updates a
chunk of data in a database and commits, then the database will hold the modified data. If a
transaction commits but the system fails before the data could be written on to the disk, then that
data will be updated once the system springs back into action.

Example of Transaction Management:


Consider a banking system where a customer transfers money from one account to another. The
transaction involves two steps: deducting the amount from the source account and adding the

Page 11 of 29
amount to the destination account. These two operations must be performed together as a single
transaction to maintain data integrity and consistency.
-- Begin transaction
BEGIN TRANSACTION;
-- Step 1: Deduct amount from source account
UPDATE Accounts
SET Balance = Balance - 100
WHERE Account_Number = '123456';
-- Step 2: Add amount to destination account
UPDATE Accounts
SET Balance = Balance + 100
WHERE Account_Number = '789012';
-- Commit transaction if both steps succeed
COMMIT;
In this example:
• The BEGIN TRANSACTION command marks the beginning of the transaction.
• The UPDATE statements modify the balances of the source and destination accounts.
• If both UPDATE statements succeed, the COMMIT command commits the transaction,
making the changes permanent.
• If any step fails (e.g., due to insufficient funds), the transaction can be rolled back using
the ROLLBACK command, ensuring that no changes are made to the database.

Generally, in DBMS partial transactions are not allowed, and the effect of a group of transactions
is equivalent to some serial execution of all transactions

Concurrent Execution of Transactions


- Refers to the ability of a Database Management System (DBMS) to process multiple
transactions simultaneously (at the same time), allowing them to run concurrently without
interfering with each other.
- Concurrent execution in a DBMS refers to the capacity to carry out numerous/many
transactions simultaneously (at the same time) in a shared database.
- Concurrent execution allows many transactions to access the same data concurrently, which
can have numerous benefits, such as higher system throughput, faster response or reaction time
and resource utilization.

Page 12 of 29
Problems with Concurrent Execution in DBMS
- It introduces challenges such as maintaining data consistency and preventing conflicts
between transactions.
- Lost update: Lost updates occur when two or more transactions read the same data item
concurrently and update it independently(one-by-one). If one transaction's updates overwrite
the changes made by another transaction, the changes made by the latter transaction are lost.
- Dirty Reads: Occur when one transaction reads data that has been modified by another
transaction but not yet committed/finished completely. If the modifying transaction is later
rolled back, the changes read by the first transaction become invalid, leading to inconsistencies.
- Non-repeatable reads - occur when a transaction reads the same data item multiple times
during its execution, but the data values change between reads due to updates made by other
transactions. This inconsistency can lead to incorrect results or unexpected behavior.
- Phantom read - When a transaction reads a group of rows that meet a given criterion and a
subsequent transaction adds or deletes rows that meet the same requirement, this is known as
a phantom read. The same set of data will have new rows that were not present the first time
when the initial transaction reads them.
- Deadlocks - happens when many transactions are held up as they wait for one another to
release the resources they are holding. Deadlocks can happen when resources are not released
properly or are acquired by transactions in a different sequence. Deadlocks can result in
decreased system performance or even system crashes.

Concurrency control mechanisms in DBMS


- It is a procedure in DBMS which helps us for the management of two simultaneous processes
to execute without conflicts between each other, these conflicts occur in multi user systems.
- For example, if we take ATM machines and do not use concurrency, multiple persons cannot
draw money at a time in different places. This is where we need concurrency.

Mechanisms
i. Locking: A concurrency control mechanism that restricts access to data by allowing only one
transaction to modify a data item at a time. Locks are acquired before accessing data and
released after completing the transaction.

Page 13 of 29
❖ Example: Consider two transactions, T1 and T2, attempting to update the same bank account
balance concurrently. To prevent conflicts, T1 acquires a lock on the account record before
updating it. Meanwhile, T2 must wait until T1 releases the lock before it can proceed with its
update.
Lock guaranties exclusive use of data items to a current transaction. It first accesses the data items
by acquiring a lock, after completion of the transaction it releases the lock.
ii. Timestamp Ordering: A concurrency control mechanism that assigns a unique timestamp to
each transaction based on its start time. Transactions are executed based on their timestamps,
ensuring serializability and preventing conflicts between transactions.
❖ Example: Transaction T1 starts before T2 and updates a bank account balance. T2 starts after
T1 and attempts to read the same balance. Since T1 started before T2, its changes are
reflected/updated in the database, ensuring that T2 reads a consistent state of the data.
iii. Multi-Version Concurrency Control (MVCC): A concurrency control mechanism that
allows multiple versions of data items to exist concurrently in the database. Each transaction
operates on a snapshot of the database, ensuring that it sees a consistent view of the data even
as other transactions make changes.
❖ Example: In a MVCC-based system, when a transaction T1 reads a data item, it obtains a
consistent snapshot of the data at the time of the transaction's start. Meanwhile, another
transaction T2 updates the same data item. T1 continues to see the original version of the data
item until it commits, while T2 sees its own modified version.

More about locking in DBMS:


A formal definition of a lock:
A Lock is a variable assigned to any data item in order to keep track of the status of that data item
so that isolation and non-interference is ensured during concurrent transactions. It is a mechanism
that allows a transaction to reserve a data item for its exclusive or shared use, and prevent other
transactions from modifying or reading it.

Locking ensures the consistency and isolation of transactions, which are two of the ACID
properties of a DBMS. At its basic, a database lock exists to prevent two or more database users
from performing any change on the same data item at the very same time.

Page 14 of 29
Locking allows only one transaction to modify a data item at a time while other transactions wait
until the lock is released.

There are two types of locks: Binary Locks and Shared & Exclusive Locks
i. Binary Locks:
Remember that a lock is fundamentally a variable which holds a value. A binary lock is a
variable capable of holding only 2 possible values, i.e., a 1 (depicting a locked state) or a 0
(depicting an unlocked state). This lock is usually associated with every data item in the
database (maybe at table level, row level or even the entire database level).
ii. Shared & Exclusive Locks:
Shared Lock: The shared locks in DBMS are positioned on resources at whatever time a read
operation such as the SELECT command is executed. Numerous shared locks can be set on a
resource concurrently.
- Also known as a read lock.
- Allows multiple transactions to read (i.e., SELECT query) a data item simultaneously.
- Transactions holding shared locks can only read the data item and cannot modify it.
- Multiple transactions can hold shared locks on the same data item simultaneously.
Exclusive Lock: Then exclusive locks in DBMS are positioned on resources at whatever time a
write operation such as UPDATE/INSERT/DELETE are executed. But we can place only a single
exclusive lock on one resource at one time. It means that the initial user who will obtain an
exclusive lock will be able to continue to hold the individual proprietorship of the resource, and
here, no other operator can obtain the exclusive lock on that resource.
- Also known as a write lock.
- Allows only one transaction to modify (i.e., update, insert, delete) a data item at a time.
- Transactions holding exclusive locks have exclusive access to the data item and can modify it.

Page 15 of 29
- Exclusive locks are mutually exclusive, meaning that only one transaction can hold an
exclusive lock on a data item at a time.

Log file in DBMS:


A log file, also known as a transaction log or redo log, is a file used to record all changes made
to the database during transactions. The log file plays a crucial role in ensuring data consistency,
durability, and recovery in the event of system failures.

The main purposes of a log file in a DBMS are:


1. Recovery: The log file helps ensure database durability by providing a record of all
changes made to the database. In the event of a system failure or crash, the DBMS can use
the log file to recover the database to a consistent state by replaying the transactions that
were committed before the failure occurred.
2. Rollback: The log file enables transaction rollback by storing information about the
changes made by each transaction. If a transaction is aborted or rolled back, the DBMS can
use the log file to undo the changes made by the transaction and restore the database to its
previous state.
3. Concurrency Control: The log file supports concurrency control mechanisms by
providing a means to track and manage concurrent transactions. Locking and transaction
isolation levels rely on the log file to ensure that transactions are executed in a consistent
and isolated manner.

Write-Ahead Log Protocol (WAL):


A specific technique used in DBMS to ensure durability and atomicity of transactions. In the
Write-Ahead Log strategy:
1. Writing to Log Before Data Modification: Before modifying any data in the database,
the DBMS first writes the corresponding log records to the log file on disk. This ensures
that the changes are logged before they are applied to the database itself.
2. Committing Transaction After Log Write: After the log records are successfully written
to the log file, the transaction can be committed. Once a transaction is committed, the
changes can be safely applied to the database.

Page 16 of 29
3. Data Modification after Log Write: After the log records are written and the transaction
is committed, the DBMS can modify the actual data in the database. This ensures that the
transaction's changes are durable and will persist even in the event of a system failure.
By adhering to the Write-Ahead Log protocol, DBMSs can provide strong guarantees of
durability and atomicity for transactions, even in the presence of system crashes or failures. The
log file serves as a critical component in implementing the Write-Ahead Log strategy, facilitating
transaction recovery and ensuring data consistency.

A Database Administrator (DBA)


A professional responsible for the design, implementation, maintenance, and management of
databases within an organization. DBAs play a crucial role in ensuring the availability, security,
performance, and integrity of databases, which are essential components of information systems
in modern businesses.

DBA Roles:
❖ Database Design and Implementation:
- Designing and implementing database structures, schemas, and data models based on the
organization's requirements.
- Creating and configuring database instances, tables, indexes, views, stored procedures, and
other database objects.
❖ Database Maintenance and Optimization:
- Performing regular maintenance tasks such as database backups, restores, and integrity checks
to ensure data availability and reliability.
- Monitoring database performance and optimizing queries, indexes, and configurations for
improved efficiency and scalability.
- Troubleshooting and resolving performance issues, bottlenecks, and database errors.
❖ Security and Access Control:
- Implementing and maintaining database security measures such as authentication,
authorization, encryption, and auditing to protect sensitive data from unauthorized access,
modification, or disclosure.
- Managing user accounts, roles, permissions, and privileges to ensure appropriate access
controls and data confidentiality.
❖ Data Backup and Recovery:

Page 17 of 29
- Developing and implementing backup and recovery strategies to safeguard against data loss
and ensure business continuity in the event of system failures, disasters, or human errors.
- Testing and validating backup procedures and recovery plans to verify their effectiveness and
reliability.
❖ Database Monitoring and Reporting:
- Monitoring database health, performance metrics, and usage statistics using monitoring tools
and dashboards.
- Generating and analyzing reports on database usage, capacity planning, growth trends, and
resource utilization to support decision-making and optimization efforts.
❖ Database Upgrades and Patch Management:
- Planning and executing database upgrades, migrations, and patch deployments to apply
software updates, bug fixes, and security patches.
- Testing new database releases and patches in development or test environments before
deploying them to production.
❖ Documentation and Knowledge Sharing:
- Documenting database configurations, procedures, standards, and best practices to ensure
consistency, compliance, and knowledge transfer within the DBA team and across the
organization.
- Providing training, guidance, and support to developers, administrators, and end users on
database-related topics and tools.
❖ Database tuning:
- The DBA is responsible for evolving the database, in particular the conceptual and physical
schemas, to ensure adequate performance as user requirements change.

Generally, Database Administrators play a critical role in managing the organization's data assets,
ensuring data integrity, availability, and security, and supporting the organization's information
technology infrastructure and business objectives.

Query optimization
The process of selecting the most efficient execution plan for a given query to minimize the time
and resources required to retrieve the desired results. It involves analyzing various query execution
strategies, accessing methods, and join algorithms to determine the most cost-effective plan for
processing the query.

Page 18 of 29
-

- Chapter Two –
-

Entity Relationship Model


ER - model is a conceptual data model used to describe the structure of data in a database system.
It defines the conceptual view of a database and works around real-world entities and the
associations among them.

In the ER model:
Entities: Represent real-world objects or concepts, such as people, places, things, or events. Each
entity is uniquely identifiable and is described by a set of attributes. Entities are depicted as
rectangles in ER diagrams. For example, in a school database, students, teachers, classes, and
courses offered can be considered as entities. All these entities have some attributes or properties
that give them their identity.
• Attributes: Describe properties or characteristics of entities. i.e. entities are represented by
means of their properties, called attributes. Each attribute has a name and a data type,
specifying the kind of information it can hold (e.g., text, number, date). Attributes are depicted
as ovals connected to their respective entities. For example, a student entity may have name,
class, and age as attributes. There exists a domain or range of values that can be assigned to
attributes. For example, a student's name cannot be a numeric value. It has to be alphabetic. A
student's age cannot be negative, etc.

Types of Attributes
- Simple attribute − Simple attributes are atomic values, which cannot be divided further. For
example, a student's phone number is an atomic value of 10 digits.
- Composite attribute − Composite attributes are made of more than one simple attribute. For
example, a student's complete name may have first_name and last_name.
- Derived attribute − Derived attributes are the attributes that do not exist in the physical
database, but their values are derived from other attributes present in the database. For
example, average_salary in a department should not be saved directly in the database, instead
it can be derived. For another example, age can be derived from data_of_birth.

Page 19 of 29
- Single-value attribute − Single-value attributes contain single value. For example −
Social_Security_Number.
- Multi-value attribute − multi-value attributes may contain more than one values. For
example, a person can have more than one phone number, email_address, etc.
• Relationships: Represent associations or connections between entities. Relationships define
how entities are related to each other and can have various cardinalities (e.g., one-to-one, one-
to-many, many-to-many). Relationships are depicted as diamond shapes connected to the
related entities. For example, an employee works_at a department, a student enrolls in a
course. Here, Works_at and enrolls are called relationships.

Database Design
A collection of steps that help create, implement, and maintain a business's data management
systems. It involves identifying the structure of the data that will be stored in the database, defining
the relationships between different data elements, and establishing the rules that govern how the
data can be manipulated and accessed.
i. Requirement analysis: The first step in database design is to gather and analyze the
requirements of the system for which the database is being designed. This involves
understanding the data that needs to be stored, the relationships between different data
elements, and the operations that need to be performed on the data.
ii. Conceptual Design: In this phase, the high-level structure of the database is defined. This
includes identifying the entities (objects or things about which data will be stored), the
attributes of each entity, and the relationships between entities.
iii. Logical Design: Once the conceptual design is complete, the next step is to translate it into
a logical data model. This involves choosing a data model (such as relational, hierarchical,
or object-oriented) and defining the tables, columns, keys, and relationships that will be
used to represent the data in the database (ChatGPT).
This phase consists of coming up with a high-level design of our database based on initially
gathered requirements to structure & organize our data accordingly. A high-level overview
on paper is made of the database without considering the physical level design, this phase
proceeds by identifying the kind of data to be stored and what relationship will exist among
those data. Entity, Key attributes identification & what constraints are to be implemented

Page 20 of 29
is the core functionality of this phase. It involves techniques such as Data Modeling to
visualize data, normalization to prevent redundancy, etc. (GeeksforGeeks)
iv. Normalization: Normalization is the process of organizing the data in the database to
minimize redundancy and dependency. This typically involves decomposing tables into
smaller, more atomic tables and establishing rules to ensure that data is stored in a
consistent and efficient manner.
v. Physical Design: In the physical design phase, the logical data model is translated into a
physical implementation. This includes specifying details such as the storage structures
(e.g., tables, indexes), data types, and access paths (e.g., primary keys, foreign keys) that
will be used to store and retrieve data efficiently. Involves the implementation of the logical
design made in the previous stage. All the relationships among data and integrity
constraints are implemented to maintain consistency & generate the actual database.

Logical Design
- Defines the structure of data and what relationship exists among those data.
Components:
- Data models: Hierarchical, network, relational, ER-Model, Object Oriented Model
- Entity: objects in the real world,
- Relationships: unary, binary, ternary
- Attributes: properties of a specific entity that define its behavior
- Normalization: looking for loopholes that may arise as a result of CRUD operations.

An Example of logical Design picture

Page 21 of 29
Physical Design
The main purpose of the physical design is to actually implement the logical design that is, show
the structure of the database along with all the columns & their data types, rows, relations,
relationships among data & clearly define how relations are related to each other.
Following are the steps taken in physical design
Step 1: Entities are converted into tables or relations that consist of their properties (attributes)
Step 2: Apply integrity constraints: establish foreign key, unique key, and composite key
relationships among the data and apply various constraints.
Step 3: Entity names are converted into table names, property names are translated into attribute
names, and so on.
Step 4: Apply normalization & modify as per the requirements.
Step 5: Final Schemes are defined based on the entities & attributes derived in logical design.

An Example of Physical Design picture

Source of DB Design note: ChatGPT and Geeks for Geeks

Types of keys in DBMS:


A key refers to an attribute/a set of attributes that help us identify a row (or tuple) uniquely in a
table (or relation). A key is also used when we want to establish relationships between the different
columns and tables of a relational database. The individual values present in a key are commonly
referred to as key values.

Page 22 of 29
We use a key for defining various types of integrity constraints in a database. A table, on the other
hand, represents a collection of the records of various events for any relation. Now, there might be
thousands of these records, and some of these might even be duplicated. Thus, we need a way in
which one can identify all of these records uniquely and separately, i.e., without any duplicates.
This hassle is removed with the help of keys. The primary motive of the keys is to provide every
record with a unique identity of its own.

Considering the following three tables:


1. Employees
- EmployeeID (PK) - DepartmentID (FK)
- FirstName - Email
- LastName
2. Departments
- DepartmentID (PK) - ManagerID (FK)
- DepartmentName
3. Tasks
- TaskID (PK) - EmployeeID (FK)
- TaskDescription - Deadline

Records in each table:


Employeess:
EmployeeID FirstName LastName DepartmentID Email
1 John Doe 101 john@example.com
2 Jane Smith 102 jane@example.com
3 Mike Johnson 101 mike@example.com

Departments:
DepartmentID DepartmentName ManagerID
101 HR 1
102 IT 2
103 Marketing 3

Page 23 of 29
Tasks:
TaskID TaskDescription EmployeeID Deadline
1 Review resumes 1 2024-04-10
2 Update software 2 2024-04-15
3 Create ad campaign 3 2024-04-20

Primary Key:
- A primary key is a unique identifier for each record in a table.
- It ensures that each row in a table is uniquely identifiable.
- Primary keys cannot have NULL values and also no duplicate.
- Only one primary key can exist per table.
- EmployeeID, DepartmentID and TaskID are PKs for the tables described
Supper key
- A super key is a set of one or more columns that uniquely identify each record in a table.
- It may include more columns than necessary to uniquely identify records.
- Super keys are used to define candidate keys and primary keys.
- A super key refers to the set of all those keys that help us uniquely identify all the rows present
in a table. It means that all of these columns present in a table that can identify the columns of
that table uniquely act as the super keys. We need to pick the primary key of any table from
the super key’s set so as to make it the table’s identity attribute.
Example: Supper keys of Department table:
- {DepartmentID} - {DepartmentID, DepartmentName}
- {DepartmentName} - {DepartmentID, ManagerID}
- {ManagerID} - {DepartmentName, ManagerID}
- {DepartmentID,DepartmentName,ManagerD}

Candidate key
- A candidate key is a column or set of columns in a table that can uniquely identify each record.
- Like a primary key, it ensures the uniqueness of records, but unlike a primary key, a table can
have multiple candidate keys.
- The chosen primary key is one of the candidate keys selected to uniquely identify records in a
table.

Page 24 of 29
- Candidate key is a subset of supper keys and can uniquely identify each record in the table.
Example: Candidate keys of Department table:
- {DepartmentID} - {DepartmentID, DepartmentName}
- {DepartmentName} - {DepartmentID, ManagerID}
- {ManagerID} - {DepartmentName, ManagerID}

My Question to ChatGPT: So, can we say a supper key is candidate key and vice versa?
Answer from ChatGPT:
Yes, it's correct to say that a super key is a candidate key, but not vice versa. That’s Candidate
Key is a subset of a super key. All super keys can't be candidate keys. But all candidate keys are
super keys.
- A super key is any set of attributes (one or more) within a table that can uniquely identify each
tuple (row) in that table. This includes both candidate keys and other combinations of attributes
that uniquely identify each tuple.
- Super Key is a set of attributes or columns that uniquely identifies each row table whereas, a
candidate key is a set of attributes that recognizes the tuples in a relation, or table.
- A candidate key, on the other hand, is a minimal super key. It's a super key with the property
that if any attribute is removed from it, it will no longer uniquely identify each tuple. In other
words, a candidate key is a subset of super keys.
- So, every candidate key is a super key because it can uniquely identify each tuple, but not every
super key is a candidate key because some super keys may contain redundant attributes or are
not minimal.

Composite Key
- To know what a composite key is we need to have the knowledge of what a primary key is, a
primary key is a column that has a unique and not null value in an SQL table.
- Now a composite key is also a primary key, but the difference is that it is made by the
combination of more than one column to identify the particular row in the table.
- While a primary key and a composite key might do the same things, the primary key will
consist of one column, where the composite key will consist of two or more columns.
- The composite key refers to a set of multiple attributes that help us uniquely identify every
tuple/row/record present in a table. The attributes present in a set may not be unique whenever

Page 25 of 29
we consider them separately. Thus, when we take them all together, it will ensure total
uniqueness.
- It is also known as Compound Key, where each attribute creating a key is a foreign key in its
own right. Use of composite key is: Two or more attributes together form a composite key
that can uniquely identify a tuple in a table. We need to find out such table columns
combination that can form a candidate key and hence a composite key.
Example of candidate keys of Department table:
- Composite Key: {DepartmentID, DepartmentName}
- Composite Key: {DepartmentID, ManagerID}
- Composite Key: {DepartmentName, ManagerID}

Alternate Key:
- An alternate key is a candidate key that is not selected as the primary key.
- In other words, it is a column or set of columns that could serve as the primary key but is
not chosen as such.
Example: Alternate keys are candidate keys that are not chosen as the primary key. In the provided
tables, we can identify the following alternate keys:
1. Employees:
▪ Alternate Key: {Email}
2. Departments:
▪ Alternate Key: {DepartmentName}
▪ Alternate Key: {ManagerID}
3. Tasks:
▪ Alternate Key: {TaskDescription}
▪ Alternate Key: {EmployeeID}
▪ Alternate Key: {Deadline}

These alternate keys can uniquely identify each record in their respective tables, but they are not
selected as the primary key. Instead, another candidate key (or a composite key) is chosen as the
primary key.

Page 26 of 29
Foreign Key (FK):
- A foreign key is a column or set of columns in one table that refers to the primary key in
another table.
- It establishes a relationship between two tables by enforcing referential integrity.
- Foreign keys ensure that the values in the referencing column(s) exist in the referenced
table's primary key column(s).
- They help maintain data consistency and integrity across related tables.
- Foreign keys can have NULL values, indicating that a particular record has no
corresponding entry in the referenced table.
Example: Foreign keys are attributes in a table that establish a relationship with the primary key
in another table. In the provided tables, the foreign keys are as follows:
1. Employees:
- Foreign Key: DepartmentID (references Departments.DepartmentID)
2. Departments:
- Foreign Key: ManagerID (references Employees.EmployeeID)
3. Tasks:
- Foreign Key: EmployeeID (references Employees.EmployeeID)

These foreign keys establish relationships between the tables by referencing the primary keys in
other tables. They ensure referential integrity and help maintain consistency in the data across
related tables.

Unique Key:
- A unique key is a constraint in a relational database that ensures the values in a column or a
group of columns are unique across all rows in the table.
- Unlike a primary key, a unique key allows for one NULL value, meaning that while the values
must be unique, they can also be null except in SQL Server where NULL values are considered
to be not equal.

The main difference between a unique key and a primary key is that a primary key uniquely
identifies each record in a table and cannot contain null values, whereas a unique key also ensures
uniqueness but allows for one null value.

Page 27 of 29
Example of unique keys from the provided tables:
1. Employees:
- Unique Key: {Email}
2. Departments:
- Unique Key: {DepartmentName}
- Unique Key: {ManagerID}
3. Tasks:
- Unique Key: {TaskDescription}
- Unique Key: {Deadline}
These unique keys ensure that the values in the specified columns are unique across all rows in
their respective tables.

Weak Entity
A weak entity is an entity in a database that cannot be uniquely identified by its attributes alone.
Instead, it depends on a related entity called the identifying or owner entity, along with a
discriminator attribute, to distinguish between its instances. Weak entities typically have a partial
key, which is a set of attributes that uniquely identifies them only when combined with the primary
key of the owner entity.

Example: Consider a database model for a library system. In this system, we have entities such as
"Book" and "Author." Each book can have multiple authors, and an author can write multiple
books. Let's focus on the relationship between books and their chapters.
In this scenario:
- The "Book" entity is the owner entity because it can exist independently and has a primary key
(e.g., ISBN).
- The "Chapter" entity is the weak entity because it depends on the "Book" entity for its existence
and cannot be uniquely identified on its own.

Page 28 of 29
Here's how the entities might look in a simplified form:

Book (Owner Entity):


+--------+-----------------------+
| ISBN | Title |
+--------+-----------------------+
| 123456 | Database Management |
| 789012 | Data Structures |
+--------+-----------------------+

Chapter (Weak Entity):


+--------+--------------+---------------------+
| BookID | ChapterNumber| Title |
+--------+--------------+---------------------+
| 123456 | 1 | Introduction |
| 123456 | 2 | Entity-Relationship |
| 789012 | 1 | Arrays |
| 789012 | 2 | Linked Lists |
+--------+--------------+---------------------+

In this example:
- Each book in the "Book" entity has a unique ISBN as its primary key.
- The "Chapter" entity depends on the "Book" entity for its identification. It uses a combination
of the "BookID" (foreign key referencing ISBN) and "ChapterNumber" to uniquely identify
each chapter within a book.
- The "BookID" attribute in the "Chapter" entity acts as the discriminator attribute,
distinguishing chapters based on the book to which they belong.
- A chapter's identity is only meaningful within the context of its associated book. For example,
there can be multiple Chapter 1s, each belonging to a different book.

Generally, a weak entity relies on another entity (the owner entity) and a discriminator attribute
for its identification. It cannot exist without its associated owner entity and is identified based on
the combination of its partial key and the primary key of the owner entity.

Page 29 of 29

You might also like