Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

UNIT-3

Syllabus: Schema Refinement (Normalization): Purpose of Normalization or schema refinement,


concept of functional dependency, normal forms based on functional dependency(1NF, 2NF and 3 NF),
concept of surrogate key, Boyce-codd normal form(BCNF), Lossless join and dependency preserving
decomposition, Fourth normal form(4NF), Fifth Normal Form (5NF).
…………………………………………………………………………………………………………….

Schema Refinement:
Schema is a relation with attribute set like R(A,B,C,D).Eg. student(sid,sname,cid).
A schema refinement process means to avoid redundancy in the database. Redundancy means same data
placed in 2 different locations.
Redundancy can be in 3 types.
1. Table Level Redundancy:
If a file with the name student.txt is stored in C: Drive and another file with the same name student.txt
is stored in D: Drive then this type of redundancy called Table level Redundancy.
2. Row Level Redundancy:
If a single table contains different tuples with the same values called Row Level Redundancy.
Eg: student table
Sid Sname branch
S1 A CSE
S1 A CSE
S3 B IT
3. Column Level Redundancy:
If a single table contains different columns/fields with the same values called Column Level or Subset of
Column Level Redundancy.
Eg: student table
Sid Sname branch
S1 A CSE
S2 A CSE
S3 B IT

Purpose of Schema Refinement:


In the database to resolve this redundancy we use Normalization or schema refinement.

Worker_id Worker_name Worker_dept Worker_address


65 Ramesh ECT001 Jaipur

65 Ramesh ECT002 Jaipur

73 Amit ECT002 Delhi

76 Vikas ECT501 Pune

76 Vikas ECT502 Pune

79 Rajesh ECT669 Mumbai


In the above table, we have four columns which describe the details about the workers like their name,
address, department and their id. The above table is not normalized, and there is definitely a chance of
anomalies present in the table.
There can be three types of an anomaly in the database:
Updation / Update Anomaly
When we update some rows in the table, and if it leads to the inconsistency of the table then this
anomaly occurs. This type of anomaly is known as an updation anomaly. In the above table, if we want
to update the address of Ramesh then we will have to update all the rows where Ramesh is present. If
during the update we miss any single row, then there will be two addresses of Ramesh, which will lead
to inconsistent and wrong databases.
Insertion Anomaly
If there is a new row inserted in the table and it creates the inconsistency in the table then it is called the
insertion anomaly. For example, if in the above table, we create a new row of a worker, and if it is not
allocated to any department then we cannot insert it in the table so, it will create an insertion anomaly.
Deletion Anomaly
If we delete some rows from the table and if any other information or data which is required is also
deleted from the database, this is called the deletion anomaly in the database. For example, in the above
table, if we want to delete the department number ECT669 then the details of Rajesh will also be deleted
since Rajesh's details are dependent on the row of ECT669. So, there will be deletion anomalies in the
table.

To remove this type of anomalies, we will normalize the table or split the table or join the tables. There
can be various normalized forms of a table like 1NF, 2NF, 3NF, BCNF etc. we will apply the different
normalization schemes according to the current form of the table.
Example 2:

Stu_id Stu_name Stu_branch Stu_club

2018nk01 Shivani Computer science literature


2018nk01 Shivani Computer science dancing

2018nk02 Ayush Electronics Videography

2018nk03 Mansi Electrical dancing

2018nk03 Mansi Electrical singing

2018nk04 Gopal Mechanical Photography


In the above table, we have listed students with their name, id, branch and their respective clubs.
Updation / Update Anomaly
In the above table, if Shivani changes her branch from Computer Science to Electronics, then we will
have to update all the rows. If we miss any row, then Shivani will have more than one branch, which
will create the update anomaly in the table.
Insertion Anomaly
If we add a new row for student Ankit who is not a part of any club, we cannot insert the row into the
table as we cannot insert null in the column of stu_club. This is called insertion anomaly.
Deletion Anomaly
If we remove the photography club from the college, then we will have to delete its row from the table.
But it will also delete the table of Gopal and his details. So, this is called deletion anomaly and it will
make the database inconsistent.

Anomalies:
If the database is redundant we face 3 problems/Anomalies.
1. Update anomalies: If one copy of such repeated data is updated, an inconsistency is created unless
all copies are similarly updated.
Example: STAFF table

For example Room H940 has been improved, it is now of Room size= 500. For updating a single entity,
we have to update all other columns where room=H940.
2. Insertion anomalies: It may not be possible to store some information unless some other
information is stored as well.
Example: STAFF table
For example we have built a new room(B123) but it has not yet been timetabled for any courses or
members of staff.
Deletion anomalies: It may not be possible to delete some information without losing some other
information as well.
Example: STAFF table

For example if we remove the entity, course_no from the table, the details of room C320 get deleted.
Which implies the corresponding course will also get deleted.
Functional Dependency:
The functional dependency is a relationship that exists between two attributes. It typically exists
between the primary key and non-key attribute within a table.
X → Y
The left side of FD is known as a determinant, the right side of the production is known as a dependent.
Example:
Assume we have an employee table with attributes: Emp_Id, Emp_Name, Emp_Address.
Here Emp_Id attribute can uniquely identify the Emp_Name attribute of employee table because if we
know the Emp_Id, we can tell that employee name associated with it.
Functional dependency can be written as:
Emp_Id → Emp_Name
We can say that Emp_Name is functionally dependent on Emp_Id.
Types of Functional Dependency:
There are 2 types of functional dependencies.
1. Trivial Functional Dependency
2. Non-Trivial Functional Dependency
Trivial Functional Dependency:
 A → B has trivial functional dependency if B is a subset of A.
 The following dependencies are also trivial like: A → A, B → B
Example:
Consider a table with two columns Employee_Id and Employee_Name.
{Employee_id, Employee_Name} → Employee_Id is a trivial functional dependency as Employee_
Id is a subset of {Employee_Id, Employee_Name}.
Also, Employee_Id → Employee_Id and Employee_Name → Employee_Name are trivial dependenc
ies too.
Non-Trivial functional dependency:
 A → B has a non-trivial functional dependency if B is not a subset of A.
 When A intersection B is NULL, then A → B is called as complete non-trivial.
Example:
ID → Name, Name → DOB
Armstrong's axioms Properties/Inference Rules:
The term Armstrong axioms refer to the sound and complete set of inference rules or axioms,
introduced by William W. Armstrong, that is used to test the logical implication of functional
dependencies. If F is a set of functional dependencies then the closure of F, denoted as F+, is the set of
all functional dependencies logically implied by F. Armstrong’s Axioms are a set of rules, that when
applied repeatedly, generates a closure of functional dependencies .
Armstrong's axioms are used to conclude functional dependencies on a relational database.
The Functional dependency has 6 types of inference rule:
1. Reflexive Rule (IR1):
In the reflexive rule, if Y is a subset of X, then X determines Y.
If X ⊇ Y then X → Y
Example:
X = {a, b, c, d, e}
Y = {a, b, c}
2.Augmentation Rule (IR2):
The augmentation is also called as a partial dependency. In augmentation, if X determines Y, then XZ
determines YZ for any Z.
If X → Y then XZ → YZ
Example:
For R(ABCD), if A → B then AC → BC
3. Transitive Rule (IR3):
In the transitive rule, if X determines Y and Y determine Z, then X must also determine Z.
If X → Y and Y → Z then X → Z
4. Union Rule (IR4):
Union rule says, if X determines Y and X determines Z, then X must also determine Y and Z.
If X → Y and X → Z then X → YZ
Proof:
X → Y (given)
X → Z (given)
X → XY (using IR2 on 1 by augmentation with X. Where XX = X)
XY → YZ (using IR2 on 2 by augmentation with Y)
X → YZ (using IR3 on 3 and 4)
5. Decomposition Rule (IR5):
Decomposition rule is also known as project rule. It is the reverse of union rule.
This Rule says, if X determines Y and Z, then X determines Y and X determines Z separately.
If X → YZ then X → Y and X → Z
Proof:
X → YZ (given)
YZ → Y (using IR1 Rule)
X → Y (using IR3 on 1 and 2)
6. Pseudo transitive Rule (IR6):
In Pseudo transitive Rule, if X determines Y and YZ determines W, then XZ determines W.
If X → Y and YZ → W then XZ → W
Proof:
X → Y (given)
WY → Z (given)
WX → WY (using IR2 on 1 by augmenting with W)
WX → Z (using IR3 on 3 and 2)
Surrogate key:
It is an artificial key which aims to uniquely identify each record is called a surrogate key. This kind of
partial key in dbms is unique because it is created when you don't have any natural primary key.
Surrogate key is usually an integer. A surrogate key is a value generated right before the record is
inserted into a table.
Example 1:
Fname Lastname Start Time End Time

Anne Smith 09:00 18:00

Jack Francis 08:00 17:00

Anna McLean 11:00 20:00

Shown Willam 14:00 23:00

Above, given example, shown shift timings of the different employee. In this example, a surrogate key is
needed to uniquely identify each employee.
Surrogate keys in sql are allowed when
 No property has the parameter of the primary key.
 In the table when the primary key is too big or complicated.
Example 2:
ProductPrice Table:
Key ProductID Price

505_92 1987 200

698_56 1256 170

304_57 1898 250

458_66 1666 110

Above, the surrogate key is Key in the ProductPricetable.


Some other examples of a Surrogate Key:
 Counter can also be shown as Surrogate Key.
 System date/time stamp
 Random alphanumeric string.

Keys:

Keys play an important role in the relational database.


It is used to uniquely identify any record or row of data from the table. It is also used to establish and
identify relationships between tables.
For example, ID is used as a key in the Student table because it is unique for each student. In the
PERSON table, passport_number, license_number, SSN are keys since they are unique for each person.

Types of keys:

1. Primary key:
It is the first key used to identify one and only one instance of an entity uniquely. An entity can contain
multiple keys, as we saw in the PERSON table. The key which is most suitable from those lists becomes
a primary key.
In the EMPLOYEE table, ID can be the primary key since it is unique for each employee. In the
EMPLOYEE table, we can even select License_Number and Passport_Number as primary keys since
they are also unique.
For each entity, the primary key selection is based on requirements and developers.
2. Candidate key:
A candidate key is an attribute or set of attributes that can uniquely identify a tuple.
Except for the primary key, the remaining attributes are considered a candidate key. The candidate keys
are as strong as the primary key.
For example: In the EMPLOYEE table, id is best suited for the primary key. The rest of the attributes,
like SSN, Passport_Number, License_Number, etc., are considered a candidate key.

3. Super Key:
Super key is an attribute set that can uniquely identify a tuple. A super key is a superset of a candidate
key.
For example: In the above EMPLOYEE table, for(EMPLOEE_ID, EMPLOYEE_NAME), the name of
two employees can be the same, but their EMPLYEE_ID can't be the same. Hence, this combination can
also be a key.

The super key would be EMPLOYEE-ID (EMPLOYEE_ID, EMPLOYEE-NAME), etc.

4. Foreign key:
Foreign keys are the column of the table used to point to the primary key of another table.
Every employee works in a specific department in a company, and employee and department are two
different entities. So we can't store the department's information in the employee table. That's why we
link these two tables through the primary key of one table.
We add the primary key of the DEPARTMENT table, Department_Id, as a new attribute in the
EMPLOYEE table.
In the EMPLOYEE table, Department_Id is the foreign key, and both the tables are related.

5. Alternate key:
There may be one or more attributes or a combination of attributes that uniquely identify each tuple in a
relation. These attributes or combinations of the attributes are called the candidate keys. One key is
chosen as the primary key from these candidate keys, and the remaining candidate key, if it exists, is
termed the alternate key. In other words, the total number of the alternate keys is the total number of
candidate keys minus the primary key. The alternate key may or may not exist. If there is only one
candidate key in a relation, it does not have an alternate key.
For example, employee relation has two attributes, Employee_Id and PAN_No, that act as candidate
keys. In this relation, Employee_Id is chosen as the primary key, so the other candidate key, PAN_No,
acts as the Alternate key.

6. Composite key:
Whenever a primary key consists of more than one attribute, it is known as a composite key. This key is
also known as Concatenated Key.

For example, in employee relations, we assume that an employee may be assigned multiple roles, and
an employee may work on multiple projects simultaneously. So the primary key will be composed of all
three attributes, namely Emp_ID, Emp_role, and Proj_ID in combination. So these attributes act as a
composite key since the primary key comprises more than one attribute.

Attribute closure: Attribute closure of an attribute set can be defined as set of attributes which can be
functionally determined from it.
To find attribute closure of an attribute set-
 add elements of attribute set to the result set.
 recursively add elements to the result set which can be functionally determined from the elements of
result set.
Normalization:
 Normalization is the process of organizing the data in the database.
 Normalization is used to minimize the redundancy from a relation or set of relations. It is also used
to eliminate the undesirable characteristics like Insertion, Update and Deletion Anomalies.
 Normalization divides the larger table into the smaller table and links them using relationship.
 The normal form is used to reduce redundancy from the database table.
Types of Normal Forms:
There are the six types of normal forms:
1. 1NF
2. 2NF
3. 3NF
4. BCNF
5. 4NF
6. 5NF
First Normal Form (1NF):
 A relation will be 1NF if it contains an atomic value.
 It states that an attribute of a table cannot hold multiple values. It must hold only single-valued
attribute.
 First normal form disallows the multi-valued attribute, composite attribute, and their combinations.
Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute EMP_PHONE.
EMPLOYEE table:
EMP_ID EMP_NAME EMP_PHONE EMP_STATE
14 John 7272826385, UP
9064738238
20 Harry 8574783832 Bihar
12 Sam 7390372389, Punjab
8589830302

The decomposition of the EMPLOYEE table into 1NF has been shown below:
EMP_ID EMP_NAME EMP_PHONE EMP_STATE
14 John 7272826385 UP
14 John 9064738238 UP
20 Harry 8574783832 Bihar
12 Sam 7390372389 Punjab
12 Sam 8589830302 Punjab

Partial Dependency:
Partial Dependency occurs when a non-prime attribute is functionally dependent on part of a candidate
key.

Second Normal Form (2NF):


 In the 2NF, relational must be in 1NF.
 In the second normal form, all non-key attributes are fully functional dependent on the primary key
Example:
<StudentProject>

In the above table, we have partial dependency


• The prime key attributes are StudentID and ProjectNo, and
As stated, the non-prime attributes i.e. StudentName and ProjectName should be functionally
dependent on part of a candidate key, to be Partial Dependent.
• The StudentName can be determined by StudentID, which makes the relation Partial Dependent.
• The ProjectName can be determined by ProjectNo, which makes the relation Partial Dependent.
• Therefore, the <StudentProject> relation violates the 2NF in Normalization and is considered a bad
database design.
• To remove Partial Dependency and violation on 2NF, decompose the tables

Third Normal Form (3NF):


 A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency.
 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
 If there is no transitive dependency for non-prime attributes, then the relation must be in third
normal form.
A relation is in third normal form if it holds atleast one of the following conditions for every non-trivial
function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
EMPLOYEE_DETAIL table:
EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY
222 Harry 201010 UP Noida
333 Stephan 02228 US Boston
444 Lan 60007 US Chicago
555 Katharine 06389 UK Norwich
666 John 462007 MP Bhopal
Super key in the table above:
{EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so on
Candidate key: {EMP_ID}
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.
Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID. The
non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super key(EMP_ID). It
violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new <EMPLOYEE_ZIP> table,
with EMP_ZIP as a Primary key.

EMPLOYEE table:
EMP_ID EMP_NAME EMP_ZIP
222 Harry 201010
333 Stephan 02228
444 Lan 60007
555 Katharine 06389
666 John 462007
EMPLOYEE_ZIP table:
EMP_ZIP EMP_STATE EMP_CITY
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal

Boyce Codd normal form (BCNF):


 BCNF is the advance version of 3NF. It is stricter than 3NF.
 A table is in BCNF if every functional dependency X → Y, X is the super key of the table.
 For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one department.
EMPLOYEE table:
EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO
264 India Designing D394 283
264 India Testing D394 300
364 UK Stores D283 232
364 UK Developing D283 549
In the above table Functional dependencies are as follows:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
364 UK
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
Designing D394 283
Testing D394 300
Stores D283 232
Developing D283 549
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
264 Designing
264 Testing
364 Stores
364 Developing
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional dependencies is a key.

Fourth normal form (4NF):


 A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued dependency.
 For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation
will be a multi-valued dependency.
Example

STUDENT Table:
STU_ID COURSE HOBBY
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity.
Hence, there is no relationship between COURSE and HOBBY.
In the STUDENT relation, a student with STU_ID, 21 contains two courses, Computer and Math and
two hobbies, Dancing and Singing. So there is a Multi-valued dependency on STU_ID, which leads to
unnecessary repetition of data.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE:
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY:
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey

Fifth normal form (5NF):


 A relation is in 5NF if it is in 4NF and not contains any join dependency and joining should be
lossless.
 5NF is satisfied when all the tables are broken into as many tables as possible in order to avoid
redundancy.
 5NF is also known as Project-join normal form (PJ/NF).
Example:
SUBJECT LECTURER SEMESTER
Computer Anshika Semester 1
Computer John Semester 1
Math John Semester 1
Math Akash Semester 2
Chemistry Praveen Semester 1
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take Math
class for Semester 2. In this case, combination of all these fields required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who will be taking
that subject so we leave Lecturer and Subject as NULL. But all three columns together acts as a primary
key, so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
P1:
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2:
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3:
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen

Decomposition of Relation:
Decomposition of a relation is done when a relation in relational model is not in appropriate normal
form. Relation R is decomposed into two or more relations if decomposition is lossless join as well as
dependency preserving.
Lossless Decomposition:
 If the information is not lost from the relation that is decomposed, then the decomposition will be
lossless.
 The lossless decomposition guarantees that the join of relations will result in the same relation as it
was decomposed.
 The relation is said to be lossless decomposition if natural joins of all the decomposition give the
original relation.
Example:
EMPLOYEE_DEPARTMENT table:
EMP_ID EMP_NAME EMP_AGE EMP_CITY DEPT_ID DEPT_NAME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
The above relation is decomposed into two relations EMPLOYEE and DEPARTMENT

EMPLOYEE table:
EMP_ID EMP_NAME EMP_AGE EMP_CITY
22 Denim 28 Mumbai
33 Alina 25 Delhi
46 Stephan 30 Bangalore
52 Katherine 36 Mumbai
60 Jack 40 Noida
DEPARTMENT table:
DEPT_ID EMP_ID DEPT_NAME
827 22 Sales
438 33 Marketing
869 46 Finance
575 52 Production
678 60 Testing

Lossless Join Decomposition:


If we decompose a relation R into relations R1 and R2,
 Decomposition is lossy if R1 ⋈ R2 ⊃ R
 Decomposition is lossless if R1 ⋈ R2 = R
To check for lossless join decomposition using FD set, following conditions must hold:
1. Union of Attributes of R1 and R2 must be equal to attribute of R. Each attribute of R must be
either in R1 or in R2.
Att(R1) U Att(R2) = Att(R)
2. Intersection of Attributes of R1 and R2 must not be NULL.
Att(R1) ∩ Att(R2) ≠ Φ
3. Common attribute must be a key for at least one relation (R1 or R2)
Att(R1) ∩ Att(R2) ->Att(R1) or Att(R1) ∩ Att(R2) ->Att(R2)
For Example, A relation R (A, B, C, D) with FD set{A->BC} is decomposed into R1(ABC) and
R2(AD) which is a lossless join decomposition as:
1. First condition holds true as Att(R1) U Att(R2) = (ABC) U (AD) = (ABCD) = Att(R).
2. Second condition holds true as Att(R1) ∩ Att(R2) = (ABC) ∩ (AD) ≠ Φ
3. Third condition holds true as Att(R1) ∩ Att(R2) = A is a key of R1(ABC) because A->BC is
given.
Example:
In the above two relations are joined on the common column "EMP_ID", then the resultant relation will
look like:
Employee ⋈ Department:
EMP_ID EMP_NAME EMP_AGE EMP_CITY DEPT_ID DEPT_NAME
22 Denim 28 Mumbai 827 Sales
33 Alina 25 Delhi 438 Marketing
46 Stephan 30 Bangalore 869 Finance
52 Katherine 36 Mumbai 575 Production
60 Jack 40 Noida 678 Testing
Hence, the decomposition is Lossless join decomposition.
Dependency Preserving Decomposition:
If we decompose a relation R into relations R1 and R2, All dependencies of R either must be a part of
R1 or R2 or must be derivable from combination of FD’s of R1 and R2.
For Example, A relation R (A, B, C, D) with FD set{A->BC} is decomposed into R1(ABC) and
R2(AD) which is dependency preserving because FD A->BC is a part of R1(ABC).

You might also like