Professional Documents
Culture Documents
DBMS Module 4
DBMS Module 4
Yet it is possible to create poor table structures even in a good database design.
So how do you recognize a poor table structure, and how do you produce a good
table?
The answer to both questions involves normalization.
Normalization is a process for evaluating and correcting table structures to minimize
data redundancies, thereby reducing the likelihood of data anomalies.
Normalization works through a series of stages called normal forms.
The first three stages are described as first normal form (1NF), second normal form
(2NF), and third normal form (3NF).
From a structural point of view, 2NF is better than 1NF, and 3NF is better than 2NF.
For most purposes in business database design, 3NF is as high as we need to go in the
normalization process and 3NF structures also meet the requirements of fourth normal
form (4NF).
Data Anomalies
Normalization is the process of splitting relations into well structured relations
that allow users to insert, delete, and update tuples without introducing database
inconsistencies.
Without normalization many problems can occur when trying to load an integrated
conceptual model into the DBMS.
These problems arise from relations that are generated directly from user views are
called anomalies.
There are three types of anomalies: update, deletion and insertion anomalies.
For example, each employee in a company has a department associated with them as
well as the student group they participate in.
An update anomaly is a data inconsistency that results from data redundancy and a
partial update.
A deletion anomaly is the unintended loss of data due to deletion of other data.
For example, if the student group Beta Alpha Psi disbanded and was deleted from the
table above, J. Longfellow and the Accounting department would cease to exist. This
results in database inconsistencies and is an example of how combining information
that does not really belong together into one table can cause problems.
An insertion anomaly is the inability to add data to the database due to absence of
other data.
For example, assume Student_Group is defined so that null values are not allowed. If a
new employee is hired but not immediately assigned to a Student_Group then this
employee could not be entered into the database. This results in database
inconsistencies due to omission.
Update, deletion, and insertion anomalies are very undesirable in any database.
Anomalies are avoided by the process of normalization.
The objective of normalization is to ensure that each table conforms to the concept of
well-formed relations, that is, tables that have the following characteristics:
Each table represents a single subject. For example, a course table will contain only
data that directly pertains to courses. Similarly, a student table will contain only
student data.
No data item will be unnecessarily stored in more than one table (in short, tables
have minimum controlled redundancy). The reason for this requirement is to
ensure that the data are updated in only one place.
All nonprime attributes in a table are dependent on the primary key—the entire
primary key and nothing but the primary key. The reason for this requirement is to
ensure that the data are uniquely identifiable by a primary key value.
Each table is void of insertion, update, or deletion anomalies. This is to ensure the
integrity and consistency of the data.
To accomplish the objective, the normalization process takes you through the steps that
lead to successively higher normal forms. The most common normal forms and their
basic characteristic are listed in Table.
First normal form (1NF) Table format, no repeating groups, and PK identified
Boyce-Codd normal form (BCNF) Every determinant is a candidate key (special case of 3NF)
From the data modeler’s point of view, the objective of normalization is to ensure that
all tables are at least in third normal form (3NF). Even higher-level normal forms exist.
However, normal forms such as the fifth normal form (5NF) and domain-key normal
form (DKNF) are not likely to be encountered in a business environment and are mainly
of theoretical interest.
More often than not, such higher normal forms usually increase joins (slowing
performance) without adding any value in the elimination of data redundancy. Some
very specialized applications, such as statistical research, might require normalization
beyond the 4NF, but those applications fall outside the scope of most business
operations. Because this book focuses on practical applications of database techniques,
the higher-level normal forms are not covered.
Functional Dependency
Functional dependency is a relationship that exists when one attribute uniquely
determines another attribute.
If R is a relation with attributes X and Y, a functional dependency between the
attributes is represented as X->Y, which specifies Y is functionally dependent on X.
Here X is a determinant set and Y is a dependent attribute. Each value of X is
associated with precisely one Y value.
Functional dependency in a database serves as a constraint between two sets of
attributes. Defining functional dependency is an important part of relational
database design and contributes to aspect normalization.
Full Functional Dependency: In a relation , there exists Full Functional
Dependency between any two attributes X and Y, when X is functionally dependent
on Y and is not functionally dependent on any proper subset of Y.
Partial Functional Dependency: In a relation, there exists Partial Dependency,
when a non prime attribute (the attributes which are not a part of any candidate
key ) is functionally dependent on a proper subset of Candidate Key.
For example : Let there be a relation R ( Course, Sid , Sname , fid, schedule , room ,
marks )
Full Functional Dependencies: {Course , Sid) -> Sname , {Course , Sid} -> Marks, etc.
Partial Functional Dependencies : Course -> Schedule , Course -> Room
Six rules IR1 through IR6 (inference rules for functional dependencies)
In detail explanation:
In the transitive rule, if X determines Y and Y determine Z, then X must also determine Z.
If X → Y and Y → Z then X → Z
Example:
sid -> sname and sname -> city then sid -> city
Normal forms
First normal form
Ex: ID NAME
1 a,b wrong
2 c
Each data in a column should be of same kind. ( means of same data type).
Ex: ID NAME
1 a
h c
Wrong
Wrong right
Ex:
ID NAME NAME ID FNAME LNAME
1 A B 1 A B
order in which you store the data in table does not matter.
101 Akon Os , cn
Table 2
Rollno Name Subject
101 Akon Os
101 Akon Cn
103 Bkon c
Note that if the primary key is not a composite key, all non-key attributes are always fully
functional dependent on the primary key.
A table that is in 1st normal form and contains only a single key as the primary key is
automatically in 2nd normal form.
The table in this example is in the 1NF Since all the attributes are single valued. But it is
not yet in 2 NF. If the student1 leaves university and the tuple is deleted, then we lose all
the information about the professor schmid , since this attribute is fully functional
dependent on the primary key IDSt. To solve this problem, we must create a new table
professor with the attribute professor (the name) and the key IDProf. The third table grade
is necessary for combining the two relations student and professor and to manage the
grades. Besides the grade it contains only the 2 IDs of the student and professor. If now a
student is deleted, we don’t lose the information about the professor.
Example 2: Suppose a school wants to store the data of teachers and the subjects they
teach. They create a table that looks like this: Since a teacher can teach more than one
subjects, the table can have multiple rows for a same teacher.
The table is in 1 NF because each attribute has atomic values. However, it is not in 2NF
because non prime attribute teacher_age is dependent on teacher_id alone which is a
proper subset of candidate key. This violates the rule for 2NF as the rule says “no non-
prime attribute is dependent on the proper subset of any candidate key of the table”.
To make the table complies with 2NF we can break it in two tables like this:
teacher_details table:
teacher_id teacher_age
111 38
222 38
333 40
teacher_subject table:
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
A bank uses the following relation: Vendor(ID, Name, Account_No, Bank_Code_No, Bank)
The attribute ID is the identification key. All attributes are single valued (1NF). The table is
also in 2NF.
The following dependencies exist:
ID is prime attribute and other attribute name account_no, Bank_code_no, Bank are non prime
attributes.
1. Name, Account_No, Bank_Code_No are functionally dependent on ID (ID --> Name,
Account_No, Bank_Code_No)
2. Bank is functionally dependent on Bank_Code_No (Bank_Code_No --> Bank ) which is a non
prime attribute.
The table in this example is in 1 NF and 2NF. But there is transitive dependency between
bank-code-ni and bank, because bank-code-no is not the primary key of this relation. To get
to the third normal form 3NF, we have to put the bank name in a separate table together with
clearing number to identify it.
In this table Rollno is a prime attribute, and state, city,sname are non prime attributes.
To make the table to be in 3NF, divide the table into two tables.
Boyce and Codd Normal Form is a higher version of the Third Normal form or an
extension to the third normal form, and is also known as 3.5 Normal Form.
This form deals with certain type of anomaly that is not handled by 3NF. A 3NF table which
does not have multiple overlapping candidate keys is said to be in BCNF.
For a table to be in BCNF, following conditions must be satisfied:
R must be in 3rd Normal Form
For each functional dependency ( X → Y ), X should be a super Key.
( note: always non prime attribute should not determine prime attribute)
Super key : combination of one or more attributes which help to uniquely identity tuple
in a table.
Example
Below we have a college enrolment table with columns student_id, subject and professor.
103 C# P.Chash
One student can enroll for multiple subjects. For example, student with student_id 101,
has opted for subjects - Java & C++
For each subject, a professor is assigned to the student.
And, there can be multiple professors teaching one subject like we have for Java.
In the table above student_id, subject together form the primary key, because
using student_id and subject, we can find all the columns of the table.
One more important point to note here is, one professor teaches only one subject, but one
subject may have two different professors.
Hence, there is a dependency between subject and professor here, where subject depends
on the professor name.
This table satisfies the 1st Normal form because all the values are atomic, column names
are unique and all the values stored in a particular column are of same domain.
This table also satisfies the 2nd Normal Form as their is no Partial Dependency.
And, there is no Transitive Dependency; hence the table also satisfies the 3rd Normal
Form.
student_id p_id
101 1
101 2
and so on...
And, Professor Table
p_id professor subject
1 P.Java Java
2 P.Cpp C++
and so on...
1. For a dependency A → B, if for a single value of A, multiple value of B exists, then the
table may have multi-valued dependency.
2. Also, a table should have at-least 3 columns for it to have a multi-valued dependency.
3. And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B,
then B and C should be independent of each other.
If all these conditions are true for any relation (table), it is said to have multi-valued
dependency.
Example
Below we have a college enrolment table
1 Science Cricket
1 Maths Hockey
2 C# Cricket
2 Php Hockey
In the table above, student with s_id 1 has opted for two courses, Science and Maths, and
has two hobbies, Cricket and Hockey and the two records for student with s_id 1, will give
rise to two more records, as shown below, because for one student, two hobbies exists,
hence along with both the courses, these hobbies should be specified.
s_id course hobby
1 Science Cricket
1 Maths Hockey
1 Science Hockey
1 Maths Cricket
And, in the table above, there is no relationship between the columns course and hobby.
They are independent of each other.
So there is multi-value dependency, which leads to un-necessary repetition of data and
other anomalies as well.
To satisfy 4th Normal Form- To make the above relation satisfy the 4th normal form,
we can decompose the table into 2 tables.
CourseOpted Table
s_id course
1 Science
1 Maths
2 C#
2 Php
Hobbies Table,
s_id hobby
1 Cricket
1 Hockey
2 Cricket
2 Hockey
Now this relation satisfies the fourth normal form.
A table can also have functional dependency along with multi-valued dependency. In that
case, the functionally dependent columns are moved in a separate table and the multi-
valued dependent columns are moved to separate tables.
If we can decompose table further to eliminate redundancy and anomaly, and when
we re-join the decomposed tables by means of candidate keys, we should not be
losing the original data or any new record set should not arise. In simple words,
joining two or more decomposed table should not lose records nor create new
records.
In the above table, John takes both Computer and Math class for Semester 1 but he
doesn't take Math class for Semester 2. In this case, combination of all these fields
required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and
who will be taking that subject so we leave Lecturer and Subject as NULL. But all three
columns together acts as a primary key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2
& P3:
P1
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
Closure set means need to find all candidate keys of a given relation. Following example will
explain about how to find a closure set.
Example 1. R(ABCD) FD { A->B , B->C , C->D}
Sol: A+ = BCDA ( all attributes are present in a relation, so it can be candidate key)
B+ = BCD ( all attributes are NOT present in a relation, so it can not be candidate key)
C+ = CD
D+ = D
AB+ = ABCD ( all attributes are present in a relation, but still doesn’t consider as candidate
key, why? )
RULE: always we have to choose minimal attribute to be a candidate key ( means single
attribute should be a primary key). But here AB is a combination of both A & B attributes which
become super key. so, it cant be candidate key.
E+ = EC (this cant be candidate key, as all attributes not present). This indicates that single
attribute cant be primary key.
So we now need to go with combination of attributes along with E.
AE+ = ABECD BE+ = BECDA CE+ = CE DE+ = DEACB
Properties of normalization
This property says that that extra tuple or less tuple generation problem does not occur after
decomposition. Lets explain with an example. We have a relation R
R
A B C
1 2 1
2 2 2
3 3 2
The relation R is divided randomly into R1(AB) and R2(BC).
R1 R2
A B B C
1 2 2 1
2 2 2 2
3 3 3 2
We have divided the relation R, we should have some common attribute between two tables to
query it. So here B is a common attribute.
Before it perform natural join, it performs cross product of two tables. To that cross product
result natural join will be applied. (natural join means only common values between two tables
will be retrieved)
After applying natural join to the above cross product , we get the following result R11
R11
A B C == R
1 2 1
1 2 2 Compare A B C
2 2 1 this with 1 2 1
2 2 2 original 2 2 2
table R 3 3 2
3 3 1
3 3 2
Observe in R11, for A=1, it has two values of C (1 and 2). But in original table A =1 means C
has only one value. This indicates after joining extra tuples are added.
So, this is called lossy decomposition. Lossy does not mean that we are losing data, lossy is in
terms of tuples, after joining tables there should be no extra tuples. Data inconsistency occurs.
All the values in new table R11 are not valid.
Always we should get lossless decomposition when we join tables. So for that there is a rule to
Dept., of CSE, GST
Page 27
NORMALIZATION OF DATABASE TABLES MODULE 4
Rule: common attribute should be a candidate key or super key of either R1 or R2 or both.
So in above relation R, A attribute is a candidate key or primary key because it doesn’t have
duplicate values. So A attribute should be common between tables.
R1 R2
A B A C
1 2 1 1
2 2 2 2
3 3 3 2
If we apply natural join for the above tables. We get lossless decomposition, so no redundancy.
Conditions are:
1) R1 U R2 = R ( ex AB U AC = ABC)
2) R1 ∩ R2 = ϕ (ex AB ∩ AC = ϕ )
3) Common attribute should be candidate key or super key .
The decomposition of relation R with FDs , F into R1 and R2 with FDs F1 and F2 respectively,
is said to be dependency preserving if , (F1 U F2) + = F+
i.e A relation R is divided into R1 & R2, similarly functional dependencies is divided into F1 &
F2. later when it is rejoined like R1 U R2 = R and F1 U F2 = F , should get the original
table.
Example 1: Let a relation R(A,B,C,D) and set a FDs F = { A -> B , A -> C , C -> D}
are given.
R1 = (A, B, C) with FDs F1 = {A -> B, A -> C}, and R2 = (C, D) with FDs F2 = {C ->
D}.
1. Put the FDs in a standard form: obtain a collection G of equivalent FDs with a single
attribute on right side.
2. Minimize the left side of each FD: for each FD in G, check each attribute in the left side to
see if it can be deleted while preserving equivalence to F+.
3. Delete redundant FDs : check each remaining FD in G to see if it can be deleted while
preserving equivalence to F+.
Solution: suppose if we find A+ = ABC, this indicates that attribute A alone can determine all
attributes. In F this AB->C functional dependency is an extraneous attribute, reduce it to minimal.
Equivalence of FDs:
Will be given a relation R, along with two set of FDs, need to find whether these functional
dependencies are equal or not. Lets solve an example.
Solution: X covers Y
(tip: by using X functional dependencies find the closure set for Y)
A+ = ABC
B+ = BC
Check from these closure set, whether Y functional dependencies are determined or not.
Yes, A+ = ABC determines A->B , A->C. B+ = BC determines B->C.
So we can say X covers Y.
Similarly, Y covers X
(tip: by using Y functional dependencies find the closure set for X)
A+ = ABC
B+ = BC
Check from these closure set, whether X functional dependencies are determined or not.
Yes, A+ = ABC determines A->B B+ = BC determines B->C.
So we can say Y covers X.
Solution: X covers Y
AB+ = ABCD
C+ = CD
Check from these closure set, whether Y functional dependencies are determined or not.
Yes, A+ = ABC determines AB->C , AB->D.
C+ = CD determines C->D.
So we can say X covers Y.
Y covers X
AB+ = ABCD
B+ = B
C+ = CD
Check from these closure set, whether X functional dependencies are determined or not.
Yes, AB+ = ABCD determines AB->CD B+ =B doesn’t determine
C+ = CD determines C->D.
Denormalization
Denormalization is a database optimization technique in which we add redundant
data to one or more tables. This can help us avoid costly joins in a relational
database.
Note- that denormalization does not mean not doing normalization. It is an
optimization technique that is applied after doing normalization.
In a traditional normalized database, we store data in separate logical tables and
attempt to minimize redundant data. We may strive to have only one copy of each
piece of data in database.
For example, in a normalized database, we might have a Courses table and a
Teachers table.Each entry in Courses would store the teacherID for a Course but not
the teacherName. When we need to retrieve a list of all Courses with the Teacher
name, we would do a join between these two tables.
In some ways, this is great; if a teacher changes is or her name, we only have to
update the name in one place.
The drawback is that if tables are large, we may spend an unnecessarily long time
doing joins on tables.
Denormalization, then, strikes a different compromise. Under denormalization, we
decide that we’re okay with some redundancy and some extra effort to update the
database in order to get the efficiency advantages of fewer joins.
Pros of Denormalization:-
1. Retrieving data is faster since we do fewer joins
2. Queries to retrieve can be simpler (and therefore less likely to have bugs),
since we need to look at fewer tables.
Cons of Denormalization:-
1. Updates and inserts are more expensive.
2. Denormalization can make update and insert code harder to write.
3. Data may be inconsistent.
4. Data redundancy necessities more storage.
3NF requires that every non-key attribute is fully and nontransitionally dependent on each candidate key.
There is no such requirement for key attributes.
BCNF requires that every attribute is fully and nontransitionally dependent on each candidate key. Which
can also be phrased as: satisfies 3NF and in addition also requires that every key attribute is fully and
nontransitionally dependent on each candidate key.
The above means that every BCNF relation is also 3NF, but every 3NF relation is not necessarily BCNF.
Hence BCNF is stronger than 3NF.