Professional Documents
Culture Documents
Final A Thesis Report E
Final A Thesis Report E
Final A Thesis Report E
on
Automated Database Normalization Up-To Third Normal Form
Using Data Dictionary
For Partial Fulfillment of the Requirements for the Degree of Master of Science in
Computer Science Awarded by Pokhara University
Submitted by:
Ashok Chand
MSc.CS
Roll No: 14531
NEPAL COLLEGE OF
INFORMATION TECHNOLOGY
Balkumari, Lalitpur,Nepal.
September 2018
DECLARATION
I hereby declare that the work done in thesis entitled "Automated Database
Normalization Up-To Third Normal Form Using Data Dictionary" submitted to
Nepal College of Information Technology, Pokhara University, is my original work
performed in the form of partial requirement for the degree of Master of Science in
Computer Science (MSc.CS).
……………………….
Ashok Chand
Date: 6 September 2018
i
ACKNOWLEDGEMENT
First of all I would like to express my sincere gratitude to respected supervisor, Assoc.
Prof. Roshan Chitrakar, PhD, for this advice, support, guidance and valuable time for
discussion that provided ideas and impetus in each and every phase of my thesis work.
I would like to thank respected Mr. Saroj Shakya who give me full support from
proposing the topic to this stage. I would always be grateful for this insightful and
guidance. I would like to special thank Mr. Sanjeev Kumar Pandey for his valuable
suggestion, guidance and support in every stage of my thesis work. Also thank to Mr.
Madan Kadariya for all his suggestion in initial phase of my dissertation.
My sincere admiration goes to Mr. Shashidhar Ram Joshi, Mr. Niranjan Khakurel,
Mr. Kumar Pudashaini, and Mr. Sanjay Kushwaha for providing me such a broad
knowledge and inspirations within the time period of two years. I would also like to thank
all the teachers and staff members of Nepal College of Information Technology.
I would like to specially thank to my all friends for their support in each and every
challenging step of this thesis work.
Finally I would like to express my profound gratitude to my dearest parent, sister and
brothers who support and encourage me a lot in every moment of my life to achieve my
goal.
Ashok Chand
September 2018
ii
ABSTRACT
iii
TABLE OF CONTENTS
1. INTRODUCTION .......................................................................................................... 1
1.1 Database Normalization ............................................................................................ 1
1.2 Anomalies in Databases ............................................................................................ 1
1.3 Normal forms ............................................................................................................ 2
1.3.1. First Normal form ............................................................................................. 2
1.3.2. Second Normal form ......................................................................................... 3
1.3.3. Third Normal form ............................................................................................ 4
1.4 Motivation ................................................................................................................. 6
2. PROBLEM STATEMENT ............................................................................................. 7
3. OBJECTIVES ................................................................................................................. 8
4. LITERATURE REVIEW ............................................................................................... 9
4.1. Dependency Graph Diagram.................................................................................... 9
4.2. Dependency Matrix ................................................................................................ 10
4.3. Directed Graph Matrix ........................................................................................... 11
4.4. Closure dependency ............................................................................................... 12
5. METHODOLOGY ....................................................................................................... 15
5.1 Research Model ...................................................................................................... 15
5.1.1 Literature Review............................................................................................. 15
5.1.2 Problem Formulation: ...................................................................................... 16
5.1.3 Design of Algorithms:...................................................................................... 16
Working Model ............................................................................................................. 18
5.1.4 Implementation: ............................................................................................... 18
5.1.5 Analysis............................................................................................................ 18
5.2. Data Analysis ......................................................................................................... 19
6. EXPERIMENTS ........................................................................................................... 20
6.1. Tools and Environment .......................................................................................... 20
6.1.1. Matplotlib........................................................................................................ 20
iv
6.1.2. Networkx......................................................................................................... 21
6.2. Validation Testing - workflow ............................................................................... 21
7. RESULTS AND DISCUSSION ................................................................................... 23
7.1. Output Analysis ..................................................................................................... 23
7.2. Findings and Discussion ........................................................................................ 27
8. VALIDATION OF SYSTEM ....................................................................................... 30
8.1. Time Complexity ................................................................................................... 30
8.2. Space Complexity .................................................................................................. 30
9. CONCLUSION AND FUTURE WORKS ................................................................... 32
10. REFERENCES AND BIBLIOGRAPHY ................................................................... 33
11. APPENDIX (SOURCE CODE) ................................................................................. 35
v
LIST OF TABLE
vi
LIST OF FIGURE
vii
1. INTRODUCTION
The three types of anomalies are described here: Update, Insertion and Deletion anomalies.
Insertion anomaly is a failure to place information about a new database entry into all the
places in the database where information about the new entry needs to be stored. In a
properly normalized database, information about a new entry needs to be inserted into only
one place in the database, in an inadequately normalized database, information about a new
entry may need to be inserted into more than one place, and human fallibility being what
it is, some of the needed additional insertions may be missed.
Deletion anomaly is a failure to remove information about an existing database entry when
it is time to remove that entry. In a properly normalized database, information about an old,
to-be-gotten-rid-of entry needs to be deleted from only one place in the database, in an
inadequately normalized database, information about that old entry may need to be deleted
from more than one place.
1
Update Anomaly involves modifications that may be additions, deletions, or both. Thus
“update anomalies” can be either of the kinds discussed above.
All three kinds of anomalies are highly undesirable, since their occurrence constitutes
corruption of the database. Properly normalized database are much less susceptible to
corruption than are non normalized databases.
The first three normal forms are described in this topic: 1NF, 2NF, 3NF.
Definition: A relation is said to be in First Normal Form (1NF) if and only if each attribute
of the relation is atomic with a primary key defined for every row. More simply, to be in
1NF, each row must be a unique Tuple.
Example: The following table is NOT in First Normal Form:
Table 1: Manager-Employee Table
Manager Employees
Renee Mike
Manager Employee
Jim Susan
Jim Rob
Jim Beth
2
Mary Alice
Mary John
Mary Asim
Renee Mike
Joe Alan
Joe Tim
Definition: In order to be in Second Normal Form, a relation must first fulfill the
requirements to be in First Normal Form. Additionally, each non-key attribute in the
relation must be functionally dependent upon the primary key.
Example: The following relation is in First Normal Form, but not Second Normal Form:
Table 3: Customer-info Table in 1NF
In the table above, the order number serves as the primary key. Notice that the customer
and total amount are dependent upon the order number -- this data is specific to each order.
However, the contact person is dependent upon the customer. An alternative way to
accomplish this would be to create two tables:
3
Table 4: Customer-info(1) in 2NF
The creation of two separate tables eliminates the dependency problem experienced in the
previous case. In the first table, contact person is dependent upon the primary key –
customer name. The second table only includes the information unique to each order.
Someone interested in the contact person for each order could obtain this information by
performing a JOIN operation.
For the third normal form the following criteria needed to be fulfilled:
Remove columns that are not fully dependent upon the primary key.
4
Table 6: Order-info Table
Customer
Order Number Unit Price Quantity Total
Number
1 241 $10 2 $20
2 842 $9 20 $180
Our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are
there any duplicative columns? No. Do we have a primary key? Yes, the order number.
Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to
multiple rows? No, so we also satisfy the requirements of 2NF.
Now, are all of the columns fully dependent upon the primary key? The customer number
varies with the order number and it doesn't appear to depend upon any of the other fields.
What about the unit price? This field could be dependent upon the customer number in a
situation where we charged each customer a set price. However, looking at the data above,
it appears we sometimes charge the same customer different prices. Therefore, the unit
price is fully dependent upon the order number. The quantity of items also varies from
order to order, so we're OK there.
What about the total? It looks like we might be in trouble here. The total can be derived by
multiplying the unit price by the quantity, therefore it's not fully dependent upon the
primary key. We must remove it from the table to comply with the third normal form:
5
Table 7: Order-info Table in 3NF
1 241 $10 2
2 842 $9 20
3 919 $19 1
4 919 $12 10
Now our table is in 3NF. But, what about the total? This is a derived field and it's best not
to store it in the database at all. We can simply compute it "on the fly" when performing
database queries. For example, we might have previously used this query to retrieve order
numbers and totals:
1.4 Motivation
Database Normalization has always been a field of interest for set theorist and Computer
Scientists. With the growth of data in every field, optimization of data representation is
must. Database Normalization in some sense, is a form of optimization. Data redundancy
is handled by normalization. And, it’s a good approach to handle anomalies which occurs
in database.
The automation, if Database Normalization has not been studied much and is an interesting
field to do research with it. Transitive Dependency is one of the entity needed to translate
a table into Third normal form.
6
2. PROBLEM STATEMENT
Various methods for automated database normalization have found in Literature. The
proposed method is use of Data dictionary for Database normalization. Is it practical to use
Data dictionary is the main research concern for this work.
7
3. OBJECTIVES
8
4. LITERATURE REVIEW
For the detection of Transitive dependency, Amir Hassan Bahmani have presented a
method in [1]. According to them the following procedure will result the transitive
dependency:
10
Table 8: Dependency Matrix(DM) for Example 1 in [1]
A B C D E F G
A 2 1 1 1 0 0 0
C 0 0 2 1 1 0 0
D 0 0 0 2 0 0 1
EF 0 0 0 1 2 2 1
11
Figure 3: Algorithm flowchart for DG Matrix
12
}
Another research paper for the database normalization authored by Chetneti Srisa-an has
presented a method in [2]. According to this paper a new complete automated relational
database normalization method has been presented, which produces the directed graph and
spanning tree, first. It then proceeds with generating the 2NF, 3NF normal forms.
This paper use two structures, Function Dependency Graph (DG), and Spanning tree Graph
(STG) to manipulate dependencies among attributes of a relation.
The example 1 in [2] is considered here:
Fds= {A → BCD, C → DE, EF → DG, D → G}
After applying DG and STG, the database attributes forms a forest; where, every individual
tree represents a table which is in third normal form. We will consider the example in [2]:
Fds= {A → BCD, C → DE, EF → DG, D → G}
13
Figure 5: Spanning Tree Graph (STG)
14
5. METHODOLOGY
15
5.1.2 Problem Formulation:
The next part of Research is based on Problem formulation. After the literature survey, an
attempt has been made is this thesis to propose the detection of Transitive dependencies
using Data dictionary. This proposal was an Exploratory part of overall research because
it was uncertain about the result of research. The research might be failed because nothing
was predefined about the Data Dictionary representation of Graph.
Algorithms were designed for Database normalization using Data Dictionary for Graph
representation. The time complexity of these Algorithms was Quadratic in Nature. The
Algorithms are presented as below.
In this method, the Graph is stored in the form of incoming and outgoing attributes of a
node.
Algorithm 1: INOUT Data Dictionary.
Let us consider the example; Fds= {A → BCD, C → DE, EF → DG, D → G}.
These Fds will be stored in the form of incoming and outgoing attributes as follows:
1: A: {in:Φ; out: B}, A: {in:Φ; out: C}, A: {in:Φ; out: C}
2: B: {in:A}
3: C: {in:A; out:D}, C: {in:A; out:E}
4: D:{in:C; out:G}, D:{in:A; out:G}
5: E:{in:C; out:Φ}
6: F: {in:Φ; out:Φ}
7: EF: {in:Φ, out: D}, EF: {in:Φ, out: G}
After storage of Fds, use a simple algorithm to get the Transitive dependencies.
16
Figure 7: Flowchart Getting Transitive Dependencies from INOUT Dictionary
Next, All the FDs will be traced from the given Input. Then, store them in a variable called
ALL_FD. Then applied following method to remove TDs and bring the Database tables in
3NF:
Algorithm 2: Identify and Remove Transitive dependency
1: Normal= []
2: for i in ALL_FD:
3: temp= i
4: for j in TD:
5: if j in temp:
6: remove j.out from temp
7: add temp to Normal
8: Output: Normal.
17
Working Model
5.1.4 Implementation:
The implementation phase is done by writing the Computer programs using python
language. The details about Environment and tools are described in next section.
5.1.5 Analysis
In analysis phase an analysis is done over the run time of different inputs for the Computer
model prepared for this research.
18
5.2. Data Analysis
Data Analysis was done on Run time of different three types of Transitive dependencies:
Linear Transitive dependency, Circular Transitive Dependency and Hybrid Transitive
Dependency.
Linear Transitive Dependency is a TD which is in the form of A → B → C.
Circular Dependency is in the form of A → B → C → A; and Tree Type Transitive
dependency is of the form A → B → C, B → D.
Hybrid Transitive Dependency is a TD where the dependencies form a circle as well as
a line. For example, A → B, B → C, C → D, C → A.
19
6. EXPERIMENTS
In the experiment of proposed model, first need functional dependency (FDs) of a relation
as input. These FDs used by INOUT data dictionary algorithm. Then, use a simple
algorithm to get the Partial dependencies and remove these dependencies. After removing
partial dependencies database tables get in 2NF. Then, use a simple algorithm to get the
Transitive dependencies and remove these TDs. After removing TDs database tables get in
3NF as required output.
Program execution Method:
1. Give functional dependency inputs.
2. Run SecondNormalform Module
3. Obtain second normal form graph
4. This second normal form graph is input for ThirdNormalform module
5. Run ThirdNormalform module
6. Obtain third normal form graph
6.1.1. Matplotlib
20
6.1.2. Networkx
NetworkX is a Python library for studying graphs and networks. NetworkX is free Software
released under the BSD new license [6]. NetworkX is suitable for operation on large real-
world graphs: e.g., graphs in excess of 10 million nodes and 100 million edges [7]. Due to
its dependence on a pure-Python "dictionary of dictionary" data structure, NetworkX is a
reasonably efficient, very scalable, highly portable framework for network and social
network analysis [8].
Unit Testing:
It is a level of programming testing where singular units/ components of software are
tested. The purpose is to validate that each unit of the software performs as designed. A
unit is the smallest testable part of any software.
In this system mainly four unites they are SecondNormalform , ThirdNormalform ,
GraphVisualization for second normal form and GraphVisualization for third normal form.
At first develop second normal form unit and tested it for work properly. Then develop
third normal form unit and tested it. Also develop graph visualization model for second
and third normal form also tested them.
21
Integration Testing:
It is a level of programming testing where singular units are combined and tested as a
group. The purpose of this level of testing is to expose faults in the interaction between
integrated units. In this system combine the two or more unites and tested them. Combine
SecondNormal form unit and GraphVisualization for second normal form. And test that
combined model. And Combine ThirdNormalform unit and
GraphVisualizationThirdNormalform unit and test that combined model.
System Testing:
it is a level of programming testing where a complete and integrated software is tested. The
purpose of this test is to evaluate the system’s consistence with the specified requirements.
22
7. RESULTS AND DISCUSSION
23
The Table created from the above graph will be as follows:
Hence, the partial dependency of D with BC has been removed by creating A as Primary
key(The starting node of Directed graph is considered as Primary Key). But still there is a
normal form. The Third Normal form module identify the Transitive Dependencies (if
exists) from the graph presented above and removes them. After removal of Transitive
Her we have a forest of Graphs. We will choose the Graphs from this forest in such a way
that the union of all the nodes of graphs gives all the attributes presented in FDs and each
graph has at-least one member in another graph.
24
So we will choose (A → B, A → C) and (C → D). Therefore, we will have two tables as
follows:
C[Primary key] D
Example 2:
Functional Dependency: Fds= ["A → BCD", "C → D", "EF → DG", "D → G"]
Output of Module SecondNormalform at
Output: [['A', 'B', 1], ['A', 'C', 1], ['A', 'D', 1], ['C', 'D', 1], ['EF', 'E', 0], ['EF', 'F', 0],
['EF', 'D', 1], ['EF', 'G', 1], ['D', 'G', 1]]
Output: [['A', 'B', 1], ['A', 'C', 1], ['A', 'D', 1], ['C', 'D', 1], ['EF', 'E', 0], ['EF', 'F', 0],
['EF', 'D', 1], ['EF', 'G', 1], ['D', 'G', 1]]
Output: [[(('A', 'B', 1), ('A', 'C', 1), ('C', 'D', 1), ('A', 'D', 1), ('D', 'G', 1)), (('EF', 'D',
1), ('D', 'G', 1), ('EF', 'G', 1))] which is the directed graph for the attributes after
removal of Partial dependencies. The directed graph of Functional dependencies is
used to create Table of Second Normal form. Figures are presented in the next page.
25
Figure 12: Directed Graph for Second Normal form of Example 2
There will be two tables for second normal form for this example. A will be the primary
key for Table 1 and EF will be the primary key for Table 2. D will act as foreign key in
Table 1. The tables are listed below:
There are lot of Transitive Dependencies in the above two tables which will be removed
by the Third Normal form module to bring these tables into Third Normal form. The
Directed graph created after applying this module is shown in the figure next page:
26
Figure 13: Directed Graph for Third Normal form of Example 2
6 9 7 12
7 10 8 15
8 12 10 17
9 12 11 18
10 15 10 17
11 14 12 17
12 17 14 19
13 16 15 17
14 20 18 24
15 19 17 21
16 21 20 25
17 23 19 25
28
Figure 14: Time Comparison of different Transitive Dependencies
29
8. VALIDATION OF SYSTEM
Validation done by comparison Amir Hassan Bahmani method and proposed model.
But, when we go through the algorithm presented in this thesis, we can find that none of
the module and sub module have complexity greater than O(n2). This is obtained from
the analysis done over the source code presented in appendix.
Amir Hassan Bahmani has used Dependency Matrix and Directed Graph Matrix.
Dependency matrix requires n*m units space, where n is the number of determinant keys
m is the number of simple keys. Similarly, the representation of directed graph matrix
requires n*n units of space. Therefore, total space required is (m+n)*n units of space.
The model presented here in this thesis (Data dictionary model) uses a linear array which
consists of information about a particular key (for both determinant as well as simple keys).
The information is all about in and out degree of the graph formed through the functional
dependencies of the keys. If the maximum limit for in and out degree for a graph formed
by functional dependencies is C (where C is a constant), then total space required for the
model used in this thesis is m*C.
30
Theoretically, the value of C is not a universal constant but it keeps on changing with a
new database Table. Or, it can be seen that the value of C is assumed to be a constant for a
particular Database. When switching the Database, value of C does not remain a constant.
Generally, the value of C is less than n because for any Determinant key, it is very rare that
this determinant key is functionally dependent to all of the simple keys. And, in general
the value of m<n in real scenario of Databases because, we know that determinant keys
(fundamentally a primary key or composite key) have less attributes in comparison to
overall attributes present in a database table which are not a part of determinant keys.
From, the above discussion it can be concluded that C< n and m<n in practical cases of
database design. This implies that m*C < (m+n)*n.
Therefore, conclude that the space complexity of the model presented in this thesis is better
than that of Amir Hassan Bahmani.
31
9. CONCLUSION AND FUTURE WORKS
In this research, Data dictionary has been used to automate the Database Normalization
process. The Attributes of tables in a Database are represented as variables and their
functional dependencies are provided by the user. This Research has been done for
database normalization up to third normal form. A test has been made for different sort of
Transitive dependencies which are termed as Linear, Circular and Hybrid dependencies.
The performance of module has been analyzed on the basis of different types of transitive
dependencies and the findings are: the average time taken by Circular TDs was minimum
and Hybrid TDs took the maximum time.
This research was bounded within Third Normal form. Further extensions can be made by
automating the normalization process of BCNF, 4NF, 5NF and so on. Strong
recommendations are there to use python Networkx module for getting the optimum results
from the Graphs.
32
10. REFERENCES AND BIBLIOGRAPHY
[3] A Note on Exploratory research; aweshkar Vol. XVII Issue 1 March 2014 WeSchool
[6] NetworkX first public release (NX-0.2), From: Aric Hagberg, Date: 12 April 2005,
Python-announce-list mailing list
[7] Aric Hagberg, Drew Conway, "Hacking social networks using the Python
programming language (Module II – Why do SNA in NetworkX)",
Sunbelt 2010: International Network for Social Network Analysis.
[8] Aric A. Hagberg, Daniel A. Schult, Pieter J. Swart, Exploring Network Structure,
Dynamics, and Function using NetworkX, Proceedings of the 7th Python in Science
conference (SciPy 2008), G. Varoquaux, T. Vaught, J. Millman (Eds.), pp. 11–15.
33
[12] https://en.wikipedia.org/wiki/Database_normalization
[13] Yazici, A., and Z. Karakaya, Normalizing Relational Database Schemas Using
Mathematica, LNCS, Springer-Verlag, Vol.3992, pp. 375-382,2006
34
11. APPENDIX (SOURCE CODE)
Module: SecondNormalform
Fds= ["A-->BC", "BC-->D", "C-->D"]
#Fds= ["AB-->CDEF","A-->EF"]
#Fds= ["A-->BCD", "C-->DE", "EF-->DG", "D-->G"]
#Fds= ["A-->BCD","D-->E","F-->E","E-->G","FE-->G"]
#Fds= ["A-->BCDEF","BCD-->GH","D-->H"]
#Fds= ["A-->BD","B-->C","B-->D","D-->E"]
#Fds= ["A-->BCD","BCD-->E","B-->E"]
def intersection(x,y):
list1= []
list2= []
for i in x[0]:
list1.append(i)
for i in x[1]:
list1.append(i)
for i in y[0]:
list2.append(i)
for i in y[1]:
list2.append(i)
if set(list1).intersection(set(list2))== set(list2):
return True
graph= []
directed_graph= [] # (Node, Node, weight)
for i in Fds:
X= i[0:i.index(">")-3+1]
Y= i[i.index(">")+1:len(i)]
if len(X)>1:
35
for j in X:
directed_graph.append([X,j,0])
for j in Y:
directed_graph.append([X,j,1])
print directed_graph
for i in range(len(directed_graph)):
for j in range(len(directed_graph)):
if i!=j and directed_graph[i][2]!= 0 and directed_graph[j][2]!= 0:
if intersection(directed_graph[i], directed_graph[j]):
directed_graph[i][2]= 0.5
print directed_graph
final_ans= []
for i in directed_graph:
if i[2]==1:
final_ans.append(i)
starting_elements= []
for i in final_ans:
xx= i[0]
count= 0
for j in final_ans:
if xx== j[1]:
break
else:
count= count+1
if count==len(final_ans):
starting_elements.append(i[0])
starting_elements= list(set(starting_elements))
tables= []
for i in starting_elements:
temp= []
36
for j in final_ans:
if i== j[0]:
temp.append(j)
for k in final_ans:
if j[1]== k[0]:
temp.append(k)
tables.append(temp)
final_tables= []
for i in tables:
xx= []
for j in i:
xx.append(tuple(j))
final_tables.append(tuple(xx))
list_of_elems= []
for i in final_ans:
list_of_elems.append(i[0])
list_of_elems.append(i[1])
list_of_elems= list(set(list_of_elems))
print final_tables
Module: ThirdNormalform
Module: GraphVisualization
Module: GraphVisualizationThirdNormalform
39