Gaspar, Vanessa C. 2bsais-3 Assignment #2 (Mon. 6-9PM)

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

GASPAR, VANESSA C.

2- BSAIS- 3
MON(6:00PM-9:00PM)

ASSIGNMENT NO.2

1. What is Normalization data?


It’s safe to say that we live in the era of big data. Collecting, storing, and analyzing information has
become a top priority for organizations, which means that companies are building and utilizing
databases to handle all that data. In the ongoing effort to use big data, you may have come across the
term “data normalization.” Understanding this term and knowing why it is so important to business
operations today can give a company a real advantage as they go further in-depth with big data in the
future.
Data Normalization
So what is normalized data in the first place? A data normalization definition isn’t hard to find, but
settling on a specific one can be a bit tricky. Taking into account all the different explanations out there,
data normalization is essentially a type of process wherein data within a database is reorganized in such
a way so that users can properly utilize that database for further queries and analysis.
There are some goals in mind when undertaking the data normalization process. The first one is to get
rid of any duplicate data that might appear within the data set. This basically goes through the database
and eliminates any redundancies that may occur. Redundancies can adversely affect analysis of data
since they are values which aren’t exactly needed. Expunging them from the database helps to clean up
the data, making it easier to analyze. The other goal is to logically group data together. You want data
that relates to each other to be stored together. This will occur in a database which has undergone data
normalization. If data is dependent on each other, they should be in close proximity within the data set.
With that general overview in mind, let’s take a closer look at the process itself. While the process can
vary depending on the type of database you have and what type of information you collect, it usually
involves several steps. One such step is eliminating duplicate data as discussed above. Another step is
resolving any conflicting data. Sometimes, datasets will have information that conflicts with each other,
so data normalization is meant to address this conflicting issue and solve it before continuing. A third
step is formatting the data. This takes data and converts it into a format that allows further processing
and analysis to be done. Finally, data normalization consolidates data, combining it into a much more
organized structure.
Consider of the state of big data today and how much of it consists of unstructured data. Organizing it
and turning it into a structured form is needed now more than ever, and data normalization helps with
that effort.
The Importance of Data Normalization
Now that you know the basics of what is normalizing data, you may wonder why it’s so important to do
so. Put in simple terms, a properly designed and well-functioning database should undergo data
normalization in order to be used successfully. Data normalization gets rid of a number of anomalies
that can make analysis of the data more complicated. Some of those anomalies can crop up from
deleting data, inserting more information, or updating existing information. Once those errors are
worked out and removed from the system, further benefits can be gained through other uses of the
data and data analytics.
It is usually through data normalization that the information within a database can be formatted in such
a way that it can be visualized and analyzed. Without it, a company can collect all the data it wants, but
most of it will simply go unused, taking up space and not benefiting the organization in any meaningful
way. And when you consider how much money businesses are willing to invest in gathering data and
designing databases, not making the most of that data can be a serious detriment.

2. What are the different levels of normalization?

The inventor of the relational model Edgar Codd proposed the theory of normalization with the
introduction of First Normal Form, and he continued to extend theory with Second and Third Normal
Form. Later he joined with Raymond F. Boyce to develop the theory of Boyce-Codd Normal Form. 

Theory of Data Normalization in SQL is still being developed further. For example, there are discussions
even on 6th Normal Form. However, in most practical applications, normalization achieves its best in
3rd Normal Form. The evolution of Normalization theories is illustrated below-

Database Normalization Examples -

Assume a video library maintains a database of movies rented out. Without any normalization, all
information is stored in one table as shown below.

Here you see Movies Rented column has multiple values.


Database Normal Forms

Now let's move into 1st Normal Forms

I. 1NF (First Normal Form) Rules

 Each table cell should contain a single value.


 Each record needs to be unique.

The above table in 1NF-

1NF Example

Before we proceed let's understand a few things --

What is a KEY?

A KEY is a value used to identify a record in a table uniquely. A KEY could be a single column or
combination of multiple columns

Note: Columns in a table that are NOT used to identify a record uniquely are called non-key columns.

What is a Primary Key?

A primary is a single column value used to identify a database record uniquely.

It has following attributes

 A primary key cannot be NULL


 A primary key value must be unique
 The primary key values should rarely be changed
 The primary key must be given a value when a new record is inserted.

What is Composite Key?


A composite key is a primary key composed of multiple columns used to identify a record uniquely

In our database, we have two people with the same name Robert Phil, but they live in different places.
Hence, we require both Full Name and Address to identify a record uniquely. That is a composite key.
Let's move into second normal form 2NF

II. 2NF (Second Normal Form) Rules

 Rule 1- Be in 1NF
 Rule 2- Single Column Primary Key

It is clear that we can't move forward to make our simple database in 2 nd Normalization form unless we
partition the table above.

We have divided our 1NF table into two tables viz. Table 1 and Table2. Table 1 contains member
information. Table 2 contains information on movies rented.

We have introduced a new column called Membership_id which is the primary key for table 1. Records
can be uniquely identified in Table 1 using membership id

Database - Foreign Key

In Table 2, Membership_ID is the Foreign Key

Foreign Key references the primary key of another Table! It helps connect your Tables
 A foreign key can have a different name from its primary key
 It ensures rows in one table have corresponding rows in another
 Unlike the Primary key, they do not have to be unique. Most often they aren't
 Foreign keys can be null even though primary keys can not 

Why do you need a foreign key?

Suppose, a novice inserts a invalid record in Table B


You will only be able to insert values into your foreign key that exist in the unique key in the parent
table. This helps in referential integrity

The above problem can be overcome by declaring membership id  from Table2  as foreign key of
membership id from Table1
Now, if somebody tries to insert a value in the membership id field that does not exist in the parent
table, an error will be shown!

What are transitive functional dependencies?


A transitive functional dependency is when changing a non-key column, might cause any of the other
non-key columns to change

Consider the table 1. Changing the non-key column Full Name may change Salutation.

Let's move into 3NF

III. 3NF (Third Normal Form) Rules

 Rule 1- Be in 2NF
 Rule 2- Has no transitive functional dependencies

To move our 2NF table into 3NF, we again need to again divide our table.

3NF Example

We have again divided our tables and created a new table which stores Salutations. 

There are no transitive functional dependencies, and hence our table is in 3NF

In Table 3 Salutation ID is primary key, and in Table 1 Salutation ID is foreign to primary key in Table 3
Now our little example is at a level that cannot further be decomposed to attain higher forms of
normalization. In fact, it is already in higher normalization forms. Separate efforts for moving into next
levels of normalizing data are normally needed in complex databases.  However, we will be discussing
next levels of normalizations in brief in the following.

IV. Boyce-Codd Normal Form (BCNF)

Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more than
one Candidate Key.

Sometimes is BCNF is also referred as 3.5 Normal Form.

V. 4NF (Fourth Normal Form) Rules

If no database table instance contains two or more, independent and multivalued data describing the
relevant entity, then it is in 4th Normal Form.

VI. 5NF (Fifth Normal Form) Rules

A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of smaller
tables without loss of data.

VII. 6NF (Sixth Normal Form) Proposed

6th Normal Form is not standardized, yet however, it is being discussed by database experts for some
time. Hopefully, we would have a clear & standardized definition for 6 th Normal Form in the near
future...

That's all to Normalization!!!

3. What is the difference of normalized data and unnormalized data and provide examples
for each.

Normalization and denormalization are the methods used in databases. The terms are differentiable
where Normalization is a technique of minimizing the insertion, deletion and update anomalies through
eliminating the redundant data. On the other hand, Denormalization is the inverse process of
normalization where the redundancy is added to the data to improve the performance of the specific
application and data integrity.
Normalization prevents the disk space wastage by minimizing or eliminating the redundancy.

Content: Normalization Vs Denormalization

1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion

Comparison Chart

BASIS FOR
NORMALIZATION DENORMALIZATION
COMPARISON

Basic Normalization is the process of creating a Denormalization is the process of


set schema to store non-redundant and combining the data so that it can be
consistent data. queried speedily.

Purpose To reduce the data redundancy and To achieve the faster execution of
inconsistency. the queries through introducing
redundancy.

Used in OLTP system, where the emphasize is on OLAP system, where the emphasis is
making the insert, delete and update on making the search and analysis
anomalies faster and storing the quality faster.
data.

Data integrity Maintained May not retain

Redundancy Eliminated Added

Number of Increases Decreases


tables

Disk space Optimized usage Wastage

Definition of Normalization
Normalization is the method of arranging the data in the database efficiently. It involves constructing
tables and setting up relationships between those tables according to some certain rules. The
redundancy and inconsistent dependency can be removed using these rules in order to make it more
flexible.

Redundant data wastes disk space, increases data inconsistency and slows down the DML queries. If the
same data is present in more than one place and any updation is committed on that data, then the
change must be reflected in all locations. Inconsistent data can make data searching and access harder
by losing the path to it.

There are various reasons behind performing the normalization such as to avoid redundancy, updating
anomalies, unnecessary coding, keeping the data into the form that can accommodate change more
easily and accurately and to enforce the data constraint.

Normalization includes the analysis of functional dependencies between attributes. The relations
(tables) are decomposed with anomalies to generate relations with a structure. It helps in deciding
which attributes should be grouped in a relation.

The normalization is basically based on the concepts of normal forms. A relation table is said to be in a
normal form if it fulfils a certain set of constraints. There are 6 defined normal forms: 1NF, 2NF, 3NF,
BCNF, 4NF and 5NF. Normalization should eliminate the redundancy but not at the cost of integrity.

Definition of Denormalization

Denormalization is the inverse process of normalization, where the normalized schema is converted
into a schema which has redundant information. The performance is improved by using redundancy and
keeping the redundant data consistent. The reason for performing denormalization is
the overheads produced in query processor by an over-normalized structure.

Denormalization can also be defined as the method of storing the join of superior normal form relations
as a base relation, which is in a lower normal form. It reduces the number of tables, and complicated
table joins because a higher number of joins can slow down the process. There are various
denormalization techniques such as: Storing derivable values, pre-joining tables, hard-coded values and
keeping details with master, etc.

Here the denormalization approach, emphasizes on the concept that by placing all the data in one place,
could eliminate the need of searching those multiple files to collect this data. The basic strategy is
followed in denormalization is, where the most ruling process is selected to examine those
modifications that will ultimately improve the performance. And the most basic alteration is that adding
multiple attributes to the existing table to reduce the number of joins.

Key Differences Between Normalization and Denormalization


1. Normalization is the technique of dividing the data into multiple tables to reduce data
redundancy and inconsistency and to achieve data integrity. On the other hand, Denormalization
is the technique of combining the data into a single table to make data retrieval faster.
2. Normalization is used in OLTP system, which emphasizes on making the insert, delete and
update anomalies faster. As against, Denormalization is used in OLAP system, which emphasizes
on making the search and analysis faster.
3. Data integrity is maintained in normalization process while in denormalization data integrity
harder to retain.
4. Redundant data is eliminated when normalization is performed whereas denormalization
increases the redundant data.
5. Normalization increases the number of tables and joins. In contrast, denormalization reduces
the number of tables and join.
6. Disk space is wasted in denormalization because same data is stored in different places. On the
contrary, disk space is optimized in a normalized table.

When to denormalize a database

What is database denormalization? Before diving into the subject, let’s emphasize that normalization
still remains the starting point, meaning that you should first of all normalize a database’s structure. The
essence of normalization is to put each piece of data in its appropriate place; this ensures data integrity
and facilitates updating. However, retrieving data from a normalized database can be slower, as queries
need to address many different tables where different pieces of data are stored. Updating, to the
contrary, gets faster as all pieces of data are stored in a single place.

The majority of modern applications need to be able to retrieve data in the shortest time possible. And
that’s when you can consider denormalizing a relational database. As the name suggests,
denormalization is the opposite of normalization. When you normalize a database, you organize data to
ensure integrity and eliminate redundancies. Database denormalization means you deliberately put the
same data in several places, thus increasing redundancy.

“Why denormalize a database at all?” you may ask. The main purpose of denormalization is to
significantly speed up data retrieval. However, denormalization isn’t a magic pill. Developers should use
this tool only for particular purposes:

# 1 To enhance query performance

Typically, a normalized database requires joining a lot of tables to fetch queries; but the more joins, the
slower the query. As a countermeasure, you can add redundancy to a database by copying values
between parent and child tables and, therefore, reducing the number of joins required for a query.

#2 To make a database more convenient to manage


A normalized database doesn’t have calculated values that are essential for applications. Calculating
these values on-the-fly would require time, slowing down query execution.

You can denormalize a database to provide calculated values. Once they’re generated and added to
tables, downstream programmers can easily create their own reports and queries without having in-
depth knowledge of the app’s code or API.

#3 To facilitate and accelerate reporting

Often, applications need to provide a lot of analytical and statistical information. Generating reports
from live data is time-consuming and can negatively impact overall system performance.

Denormalizing your database can help you meet this challenge. Suppose you need to provide a total
sales summary for one or many users; a normalized database would aggregate and calculate all invoice
details multiple times. Needless to say, this would be quite time-consuming, so to speed up this process,
you could maintain the year-to-date sales summary in a table storing user details.

Database denormalization techniques

Now that you know when you should go for database denormalization, you’re probably wondering how
to do it right. There are several denormalization techniques, each appropriate for a particular situation.
Let’s explore them in depth:

Storing derivable data

If you need to execute a calculation repeatedly during queries, it’s best to store the results of it. If the
calculation contains detail records, you should store the derived calculation in the master table.
Whenever you decide to store derivable values, make sure that denormalized values are always
recalculated by the system.

Here are situations when storing derivable values is appropriate:

 When you frequently need derivable values

 When you don’t alter source values frequently

Disadvantages
Advantages

No need to look up source values Running data manipulation language (DML) statements against
each time a derivable value is the source data requires recalculation of the derivable data
needed
Disadvantages
Advantages

No need to perform a calculation for Data inconsistencies are possible due to data duplication
every query or report

Example

As an example of this denormalization technique, let’s suppose we’re building an email messaging
service. Having received a message, a user gets only a pointer to this message; the pointer is stored in
the User_messages table. This is done to prevent the messaging system from storing multiple copies of
an email message in case it’s sent to many different recipients at a time. But what if a user deletes a
message from their account? In this case, only the respective entry in the User_messages table is
actually removed. So to completely delete the message, all User_messages records for it must be
removed.

Denormalization of data in one of the tables can make this much simpler: we can add
ausers_received_count to the Messages table to keep a record of User_messages kept for a specific
message. When a user deletes this message (read: removes the pointer to the actual message),
the users_received_count column is decremented by one. Naturally, when
the users_received_count equals zero, the actual message can be deleted completely.

Using pre-joined tables

To pre-join tables, you need to add a non-key column to a table that bears no business value. This way,
you can dodge joining tables and therefore speed up queries. Yet you must ensure that the
denormalized column gets updated every time the master column value is altered.
This denormalization technique can be used when you have to make lots of queries against many
different tables – and as long as stale data is acceptable.

Advantages Disadvantages

No need to use multiple joins DML is required to update the non-denormalized


column

You can put off updates as long as stale data is tolerable An extra column requires additional working and disk s

Example

Imagine that users of our email messaging service want to access messages by category. Keeping the
name of a category right in the User_messages table can save time and reduce the number of necessary
joins.

In the denormalized table above, we introduced the category_name column to store information about


which category each record in the User_messages table is related to. Thanks to denormalization, only a
query on the User_messages table is required to enable a user to select all messages belonging to a
specific category. Of course, this denormalization technique has a downside − this extra column may
require a lot of storage space.

Using hardcoded values


If there’s a reference table with constant records, you can hardcode them into your application. This
way, you don’t need to join tables to fetch the reference values.

However, when using hardcoded values, you should create a check constraint to validate values against
reference values. This constraint must be rewritten each time a new value in the A table is required.
This data denormalization technique should be used if values are static throughout the lifecycle of your
system and as long as the number of these values is quite small. Now let’s have a look at the pros and
cons of this technique:

Advantages Disadvantages

No need to implement a lookup table

Recoding and restating are required if look-up values are altered

No joins to a lookup table

Example

Suppose we need to find out background information about users of an email messaging service, for
example the kind, or type, of user. We’ve created a User_kinds table to store data on the kinds of users
we need to recognize.

The values stored in this table aren’t likely to be changed frequently, so we can apply hardcoding. We
can add a check constraint to the column or build the check constraint into the field validation for the
application where users sign in to our email messaging service.

Keeping details with the master

There can be cases when the number of detail records per master is fixed or when detail records are
queried with the master. In these cases, you can denormalize a database by adding detail columns to
the master table. This technique proves most useful when there are few records in the detail table.
Advantages Disadvantages

No need to use joins

Increased complexity of DML

Saves space

Example

Imagine that we need to limit the maximum amount of storage space a user can get. To do so, we need
to implement restraints in our email messaging service − one for messages and another for files. Since
the amount of allowed storage space for each of these restraints is different, we need to track each
restraint individually. In a normalized relational database, we could simply introduce two different tables
− Storage_types and Storage_restraints − that would store records for each user.

Instead, we can go a different way and add denormalized columns to the Users table:

message_space_allocated

message_space_available

file_space_allocated

file_space_available

In this case, the denormalized Users table stores not only the actual information about a user but the
restraints as well, so in terms of functionality the table doesn’t fully correspond to its name.

Repeating a single detail with its master


When you deal with historical data, many queries need a specific single record and rarely require other
details. With this database denormalization technique, you can introduce a new foreign key column for
storing this record with its master. When using this type of denormalization, don’t forget to add code
that will update the denormalized column when a new record is added.

Advantages Disadvantages

No need to create joins for queries that need a single Data inconsistencies are possible as a record value must b
record repeated

Example
Often, users send not only messages but attachments too. The majority of messages are sent either
without an attachment or with a single attachment, but in some cases users attach several files to a
message.

We can avoid a table join by denormalizing the Messages table through adding


thefirst_attachment_name column. Naturally, if a message contains more than one attachment, only
the first attachment will be taken from the Messages table while other attachments will be stored in a
separate Attachments table and, therefore, will require table joins. In most cases, however, this
denormalization technique will be really helpful.
Adding short-circuit keys

If a database has over three levels of master detail and you need to query only records from the lowest
and highest levels, you can denormalize your database by creating short-circuit keys that connect the
lowest-level grandchild records to higher-level grandparent records. This technique helps you reduce
the number of table joins when queries are executed.

Advantages Disadvantages

Need to use more foreign keys

Fewer tables are joined during queries

Need extra code to ensure consistency of values

Example

Now let’s imagine that an email messaging service has to handle frequent queries that require data from
the Users and Messages tables only, without addressing the Categoriestable. In a normalized database,
such queries would need to join the Users and Categoriestables.

To improve database performance and avoid such joins, we can add a primary or unique key from
the Users table directly to the Messages table. This way we can provide information about users and
messages without querying the Categories table, which means we can do without a redundant table
join.
Drawbacks of database denormalization

Now you’re probably wondering: to denormalize or not to denormalize?

Though denormalization seems like the best way to increase performance of a database and,
consequently, an application in general, you should resort to it only when other methods prove
inefficient. For instance, often insufficient database performance can be caused by incorrectly written
queries, faulty application code, inconsistent index design, or even improper hardware configuration.

Denormalization sounds tempting and extremely efficient in theory, but it comes with a number of
drawbacks that you must be aware of before going with this strategy:

 Extra storage space

When you denormalize a database, you have to duplicate a lot of data. Naturally, your database
will require more storage space.

 Additional documentation

Every single step you take during denormalization must be properly documented. If you change
the design of your database sometime later, you’ll need to revise all rules you created before:
you may not need some of them or you may need to upgrade particular denormalization rules.

 Potential data anomalies

When denormalizing a database, you should understand that you get more data that can be
modified. Accordingly, you need to take care of every single case of duplicate data. You should
use triggers, stored procedures, and transactions to avoid data anomalies.

 More code

When denormalizing a database you modify select queries, and though this brings a lot of
benefits it has its price − you need to write extra code. You also need to update values in new
attributes that you add to existing records, which means even more code is required.

 Slower operations

Database denormalization may speed up data retrievals but at the same time it slows down
updates. If your application needs to perform a lot of write operations to the database, it may
show slower performance than a similar normalized database. So make sure to implement
denormalization without damaging the usability of your application.

Database denormalization tips

As you can see, denormalization is a serious process that requires a lot of effort and skill. If you want to
denormalize databases without any issues, follow these useful tips:

1. Instead of trying to denormalize the whole database right away, focus on particular parts that
you want to speed up.
2. Do your best to learn the logical design of your application really well to understand what parts
of your system are likely to be affected by denormalization.

3. Analyze how often data is changed in your application; if data changes too often, maintaining
the integrity of your database after denormalization could become a real problem.

4. Take a close look at what parts of your application are having performance issues; often, you
can speed up your application by fine-tuning queries rather than denormalizing the database.

5. Learn more about data storage techniques; picking the most relevant can help you do without
denormalization.

Reasons for Database Normalization

There are three main reasons to normalize a database.  The first is to minimize duplicate data, the
second is to minimize or avoid data modification issues, and the third is to simplify queries. 

As we go through the various states of normalization we’ll discuss how each form addresses these
issues, but to start, let’s look at some data which hasn’t been normalized and discuss some potential
pitfalls. 

I think once you understand the issues, you better appreciate normalization. Consider the following
table:

Note: The primary key columns are underlined


The first thing to notice is this table serves many purposes including:

1. Identifying the organization’s salespeople

2. Listing the sales offices and phone numbers

3. Associating a salesperson with an sales office


4. Showing each salesperson’s customers
As a DBA this raises a red flag.  In general I like to see tables that have one purpose.  Having the table
serve many purposes introduces many of the challenges; namely, data duplication, data update issues,
and increased effort to query data.

Data Duplication and Modification Anomalies

Notice that for each SalesPerson we have listed both the SalesOffice and OfficeNumber.  There is
duplicate sales person data.  Duplicated information presents two problems:

1. It increases storage and decrease performance.

2. It becomes more difficult to maintain data changes.


For example:

Consider if we move the Chicago office to Evanston, IL.  To properly reflect this in our table, we need to
update the entries for all the SalesPersons currently in Chicago.  Our table is a small example, but you
can see if it were larger, that potentially this could involve hundreds of updates.

These situations are modification anomalies. Database normalization fixes them. There are three
modification anomalies that can occur:

Insert Anomaly

There are facts we cannot record until we know information for the entire row.  In our example we
cannot record a new sales office until we also know the sales person.  Why?  Because in order to create
the record, we need provide a primary key.  In our case this is the EmployeeID.

Update Anomaly
In this case we have the same information in several rows. For instance if the office number changes,
then there are multiple updates that need to be made.  If we don’t update all rows, then inconsistencies
appear.

Deletion Anomaly

Deletion of a row causes removal of more than one set of facts.  For instance, if John Hunt retires, then
deleting that row cause us to lose information about the New York office.

Search and Sort Issues

The last reason we’ll consider is making it easier to search and sort your data.  In the SalesStaff table if
you want to search for a specific customer such as Ford, you would have to write a query like

SELECT SalesOffice

FROM SalesStaff

WHERE Customer1 = ‘Ford’ OR

Customer2 = ‘Ford’ OR

Customer3 = ‘Ford’

Clearly if the customer were somehow in one column our query would be simpler.  Also, consider if you
want to run a query and sort by customer. 
Our current table makes this tough. You would have to use three separate UNION queries! You can
eliminate or reduce these anomalies by separating the data into different tables. This puts the data into
tables serving a single purpose.

The process to redesign the table is database normalization

You might also like