Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 438

UNIVERSITY OF CALOOCAN CITY

Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 1:

THE DATABASE ENVIRONMENT AND

DEVELOPMENT PROCESS

Researched and presented by:

Acogido, Neil Angeli


Gregorio, Juvelyn

1|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Definitions

 Data- A “given fact; a number, a statement or a picture. Stored

representations of meaningful objects and events. Meaningful facts, text

graphics, images, sound, video segments. A collection of individual

responses from a marketing research

(1) Structured: numbers, text dates

(2) Unstructured: images, video, documents

 Database- organized collection of logically data.

 Information- data that have meaning within a context. Data processed to

increase knowledge in the person using the data.

 Metadata- data that describes the properties and context of user data.

Data that describes data.

 Database System- collection of electronic data.  Central repository of

shared data. Stored in a standardized, convenient form. Requires a

Database Management System (DBMS)

2|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CONVENTIONAL FILE PROCESSING

Limitation of File Processing

 Program- Data Dependence- All programs maintain metadata for each file

they use.

 Duplication of Data- Different systems/ programs have separate copies of

the same data.

 Limited Data Sharing – No centralized control of data.

 Lengthy Development times- Programmers must design their own file

formats.

 Excessive Program Maintenance- 80% of information systems budget.

Problems with Data Dependency

 Non-standard file formats, lack of coordination and central control.

 Each application programmer must maintain his/ her own data

 Each application program needs to include code for the metadata of each
file. .

 Each application program must have its own processing routines for

reading, inserting, updating, and deleting data.

Problems with Data Redundancy

 duplicate data, data changes in one file could cause inconsistencies

3|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Waste of space to have duplicate data.

 Causes more maintenance headaches.

 Compromises in data integrity

THE DATABASE APPROACH

Requires Database Management System that is used to create, maintain

and provide controlled access to user databases. Central repository of shared

data. Data is managed by a controlling agent. Stored in a standardized,

convenient form.

Database Management System

A database management system manages

data resources like an operating system manages

hardware resources.

Elements of Database Approach

 Data Models- Graphical diagram capturing

the nature and relationship of data.

 Rational Databases- database technology involving table representing

entities and primary representing relationships.

 Entities- Noun from describing person, place, object, event, or concept.

 Relationships- one-to-may, many-to-many, one-to-one.

4|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Advantages of the Database Approach

Program-data independence
Planned data redundancy
Improved data consistency
Improved data sharing
Program-data independence
Planned data redundancy
Improved data consistency
Improved data sharing
 Program- data independence  Enforcement of standard

 Planned data redundancy  Improved data quality

 Improved data consistency  Improved data accessibility

 Improved data sharing and responsiveness

 Increased application  Reduced program

development productivity maintenance

 Improved decision support

5|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Database Approach vs. Traditional File System

Costs and Risk of the Database Approach

 New, specialized personnel

Frequently, organizations that adopt the database approach need to

hire or train individuals to design and implement databases. This personnel

increase seems to be expensive, but an organization should not minimize the

6|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

need for these specialized skills. Installing such a system may also require

upgrades to the hardware and data communications systems in the

organization.  

 Installation and management costs complexity

A multi-user database management system is large and complex

software that has a high initial cost. It requires trained personnel to install and

operate, and also has annual maintenance costs.

 Conversion costs

The term “legacy systems” is used to refer to older applications in an

organization that are based on file processing. The cost of converting these

older systems to modern database technology may seem prohibitive to an

organization.

 Need for explicit backup and recovery

A shared database must be accurate and available at all times. This

raises the need to have backup copies of data for restoring a database when

damage occurs.   A modern database management system normally

automates recovery tasks. 

 Organizational conflict

A database requires an agreement on data definitions and ownership

as well as responsibilities for accurate data maintenance. 

7|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Components of Database Management System

DBMS have several components, each performing very significant tasks in

the database management system environment. Below is a list of components

within the database and its environment.

 Software

This is the set of programs used to control and manage the overall

database. This includes the DBMS software itself, the Operating System,

the network software being used to share the data among users, and the

application programs used to access data in the DBMS.

 Hardware

Consists of a set of physical electronic devices such as computers, I/O

devices, storage devices, etc., this provides the interface between

computers and the real-world systems.

 Data

DBMS exists to collect, store, process and access data, the most

important component. The database contains both the actual or

operational data and the metadata.

 Procedures

These are the instructions and rules that assist on how to use the DBMS,

and in designing and running the database, using documented

procedures, to guide the users that operate and manage it.

8|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Database_Access_Language

This is used to access the data to and from the database, to enter new

data, update existing data, or retrieve required data from databases. The

user writes a set of appropriate commands in a database access

language, submits these to the DBMS, which then processes the data and

generates and displays a set of results into a user readable form.

 Query_Processor

This transforms the user queries into a series of low level instructions.

This reads the online user’s query and translates it into an efficient series

of operations in a form capable of being sent to the run time data manager

for execution.

 Data_Manager

Also called the cache manger, this is responsible for handling of data in

the database, providing a recovery to the system that allows it to recover

the data after a failure.

 Database_Engine

The core service for storing, processing, and securing data, this provides

controlled access and rapid transaction processing to address the

requirements of the most demanding data consuming applications. It is

often used to create relational databases for online transaction processing

or online analytical processing data.

 Data Dictionary

This is a reserved space within a database used to store information about

9|Page
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

the database itself. A data dictionary is a set of read-only table and views,

containing the different information about the data used in the enterprise

to ensure that database representation of the data follow one standard as

defined in the dictionary.

 Report Writer

Also referred to as the report generator, it is a program that extracts

information from one or more files and presents the information in a

specified format. Most report writers allow the user to select records that

meet certain conditions and to display selected fields in rows and

columns, or also format the data into different charts.

Four Types of Database Management Systems

 Relational Database Management System

A relational database (RDB) is a collective set of multiple data sets

organized by tables, records and columns. RDBs establish a well-defined

relationship between database tables. Tables communicate and share

information, which facilitates data searchability, organization and

reporting. RDBs use Structured Query Language (SQL), which is a

standard user application that provides an easy programming interface for

database interaction. RDB is derived from the mathematical function

concept of mapping data sets and was developed by Edgar F. Codd.

RDBs organize data in different ways. Each table is known as a

relation, which contains one or more data category columns. Each table

10 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

record (or row) contains a unique data instance defined for a

corresponding column category. One or more data or record

characteristics relate to one or many records to form functional

dependencies. These are classified as follows:

 One to One: One table record relates to another record in another

table.

 One to Many: One table record relates to many records in another

table.

 Many to One: More than one table record relates to another table

record.

 Many to Many: More than one table record relates to more than one

record in another table.

RDB performs "select", "project" and "join" database operations,

where select is used for data retrieval, project identifies data attributes,

and join combines relations. RDBs have many other advantages,

including:

 Easy extendability, as new data may be added without

modifying existing records. This is also known as scalability.

 New technology performance, power and flexibility with

multiple data requirement capabilities.

11 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Data security, which is critical when data sharing is based on

privacy. For example, management may share certain data privileges

and access and block employees from other data, such as

confidential salary or benefit information.

These relations form functional dependencies within the database.

Some common examples of relational databases include MySQL,

Microsoft SQL Server, and Oracle.

 Hierarchical Database Systems

Hierarchical database model resembles a tree structure, similar to a

folder architecture in your computer system. The relationships between

records are pre-defined in a one-to-one manner, between 'parent and

child' nodes. They require the user to pass a hierarchy in order to access

needed data. Due to limitations, such databases may be confined to

specific uses.

 Network Database Systems

Network database models also have a hierarchical structure.

However, instead of using a single-parent tree hierarchy, this model

supports many to many relationships, as child tables can have more than

one parent.

 Object-Oriented Database Systems

12 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In object-oriented databases, the information is represented as

objects, with different types of relationships possible between two or more

objects. Such databases use an object-oriented programming language

for development.

Systems Development Life Cycle

The

SDLC

is a

complete set of steps that a team of information systems professionals,

including database designers and programmers, follow in an organization

to specify, develop, maintain, and replace information systems. According

to Gillis (2019), the systems development life cycle (SDLC) is a

conceptual model used in project management that describes the stages

13 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

involved in an information system development project, from an initial

feasibility study through maintenance of the completed application. Gillis

(2019) added, that SDLC can be applied to technical and non-technical

systems. That in most use cases, a system is an IT technology such as

hardware and software. Project and program managers typically take part

in SDLC, along with system and software engineers, development teams

and end users.

 PLANNING—ENTERPRISE MODELING

The database development process begins with a review of

the enterprise modeling components that were developed during

the information systems planning process. During this step,

analysts review current databases and information systems;

analyze the nature of the business area that is the subject of the

development project; and describe, in general terms, the data

needed for each information system under consideration for

development. They determine what data are already available in

existing databases and what new data will need to be added to

support the proposed new project. Only selected projects move into

the next phase based on the projected value of each project to the

organization.

 PLANNING—CONCEPTUAL DATA MODELING

14 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

For an information systems project that is initiated, the

overall data requirements of the proposed information system must

be analyzed. This is done in two stages. First, during the Planning

phase, the analyst develops a diagram similar to Figure 1-3a, as

well as other documentation, to outline the scope of data involved

in this particular development project without consideration of what

databases already exist. Only high-level categories of data

(entities) and major relationships are included at this point. This

step in the SDLC is critical for improving the chances of a

successful development process. The better the definition of the

specific needs of the organization, the closer the conceptual model

should come to meeting the needs of the organization, and the less

recycling back through the SDLC should be needed.

 ANALYSIS—CONCEPTUAL DATA MODELING

During the Analysis phase of the SDLC, the analyst

produces a detailed data model that identifies all the organizational

data that must be managed for this information system. Every data

attribute is defined, all categories of data are listed, every business

relationship between data entities is represented, and every rule

that dictates the integrity of the data is specified. It is also during

the Analysis phase that the conceptual data model is checked for

consistency with other types of models developed to explain other

15 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

dimensions of the target information system, such as processing

steps, rules for handling data, and the timing of events.

 DESIGN—LOGICAL DATABASE DESIGN

Logical database design approaches database development

from two perspectives. First, the conceptual schema must be

transformed into a logical schema, which describes the data in

terms of the data management technology that will be used to

implement the database. For example, if relational technology will

be used, the conceptual data model is transformed and

represented using elements of the relational model, which include

tables, columns, rows, primary keys, foreign keys, and constraints.

 DESIGN—PHYSICAL DATABASE DESIGN AND DEFINITION

A physical schema is a set of specifications that describe

how data from a logical schema are stored in a computer’s

secondary memory by a specific database management system.

There is one physical schema for each logical schema. Physical

database design requires knowledge of the specific DBMS that will

be used to implement the database. In physical database design

and definition, an analyst decides on the organization of physical

records, the choice of file organizations, the use of indexes, and so

on.

 IMPLEMENTATION—DATABASE IMPLEMENTATION

16 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In database implementation, a designer write, tests, and

installs the programs/scripts that access, create, or modify the

database. The designer might do this using standard programming

languages or in special database processing languages or use

special purpose nonprocedural languages to produce stylized

reports and displays, possibly including graphs. Also, during

implementation, the designer will finalize all database

documentation, train users, and put procedures into place for the

ongoing support of the information system (and database) users.

The last step is to load data from existing information sources (files

and databases from legacy applications plus new data now

needed). Loading is often done by first unloading data from existing

files and databases into a neutral format (such as binary or text

files) and then loading these data into the new database. Finally,

the database and its associated applications are put into production

for data maintenance and retrieval by the actual users. During

production, the database should be periodically backed up and

recovered in case of contamination or destruction.

 MAINTENANCE—DATABASE MAINTENANCE

The database evolves during database maintenance. In this step, the

designer adds, deletes, or changes characteristics of the structure of a

database in order to meet changing business conditions, to correct errors in

database design, or to improve the processing speed of database


17 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

applications. The designer might also need to rebuild a database if it

becomes contaminated or destroyed due to a program or computer system

malfunction. This is typically the longest step of database development,

because it lasts throughout the life of the database and its associated

applications. Each time the database evolves, view it as an abbreviated

database development process in which conceptual data modeling, logical

and physical database design, and database implementation occur to deal

with proposed changes.

Prototyping and Agile-development approaches

 Prototyping

i. It is an information-gathering technique useful for supplementing

the traditional SDLC; however, both agile methods and human–

computer interaction share roots in prototyping. When systems

analysts use prototyping, they are seeking user reactions,

suggestions, innovations, and revision plans to make improvements

to the prototype, and thereby modify system plans with a minimum

of expense and disruption. The four major guidelines for developing

a prototype are to (1) work in manageable modules, (2) build the

prototype rapidly, (3) modify the prototype, and (4) stress the user

interface.

ii. Although prototyping is not always necessary or desirable, it should

be noted that there are three main, interrelated advantages to using

18 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

it: (1) the potential for changing the system early in its development,

(2) the opportunity to stop development on a system that is not

working, and (3) the possibility of developing a system that more

closely addresses users’ needs and expectations. Users have a

distinct role to play in the prototyping process and systems analysts

must work systematically to elicit and evaluate users’ reactions to

the prototype.

iii. One particular use of prototyping is rapid application development

(RAD). It is an object-oriented approach with three phases:

requirements planning, the RAD design workshop, and

implementation.

 Agile modeling

i. It is a software development approach that defines an overall

plan quickly, develops and releases software quickly, and then

continuously revises software to add additional features. The

values of the agile approach that are shared by the customer as

well as the development team are communication, simplicity,

feedback, and courage. Agile activities include coding, testing,

listening, and designing. Resources available include time, cost,

quality, and scope.

ii. Agile core practices distinguish agile methods, including a type

of agile method called extreme programming (XP), from other

systems development processes. The four core practices of the

19 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

agile approach are (1) short releases, (2) 40-hour workweek, (3)

onsite customer, and (4) pair programming. The agile

development process includes choosing a task that is directly

related to a customer-desired feature based on user stories,

choosing a programming partner, selecting and writing

appropriate test cases, writing the code, running the test cases,

debugging it until all test cases run, implementing it with the

existing design, and integrating it into what currently exists.

Roles of an individual in Databases

 Data Administrators

The database and the DBMS are corporate resources that must be

managed like any other resource. The Data Administrator (DA) is

responsible for defining data elements, data names and their relationship

with the database. They are also known as Data Analyst.

 Database Administrators (DBA)

A Database Administrator (DBA) is an IT professional who works

on creating, maintaining, querying, and tuning the database of the

organization. They are also responsible for maintaining data security and

integrity. A DBA has many responsibilities. A good performing database is

in the hands of DBA.

DBA Responsibilities

20 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 The life cycle of database starts from designing, implementing to

administration of it. A database for any kind of requirement needs

to be designed perfectly so that it should work without any issues.

 Once all the design is complete, it needs to be installed. Once this

step is complete, users start using the database. The database

grows as the data grows in the database. When the database

becomes huge, its performance comes down.

 Also accessing the data from the database becomes challenge.

These administration and maintenance of database is taken care

by database Administrator – DBA.

 Installing and upgrading the DBMS Servers

DBA is responsible for installing a new DBMS server for the

new projects. He is also responsible for upgrading these servers as

there are new versions comes in the market or requirement.

 Design and implementation

He should be able to decide proper memory management,

file organizations, error handling, log maintenance for the database.

 Performance Tuning

Since database is huge and it will have lots of tables, data,

constraints and indices, there will be variations in the performance

from time to time. It is responsibility of the DBA to tune the

database performance.

21 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Backup & Recovery

Proper backup and recovery programs need to be developed

by DBA and has to be maintained him. This is one of the main

responsibilities of DBA. Data should be backed up regularly so that

if there is any crash, it should be recovered without much effort and

data loss.

 Documentation

DBA should basically maintain all his installation, backup,

recovery, security methods. He should keep various reports about

database performance.

 Security

DBA is responsible for creating various database users and

roles, and giving them different levels of access rights.

 Database Designers

 Logical Database Designers

The logical database designer is concerned with identifying

the data (that is, the entities and attributes), the relationships

between the data, and the constraints on the data that is to be

stored in the database.

The logical database designer must have a thorough and

complete understanding of the organization’s data and any

constraints on this data.

22 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Physical Database Designers

The physical database designer decides how the logical

database design is to be physically realized.

 mapping the logical database design into a set of

tables and integrity constraints.

 selecting specific storage structures and access

methods for the data to achieve good performance.

 Application Developers

Once the database has

been implemented, the

application programs that

provide the required

functionality for the end-users

must be implemented. This is the responsibility of the application

developers.They are the developers who interact with the database by

means of DML queries. These DML queries are written in the application

programs like C, C++, JAVA, Pascal etc.

 End Users

The end-users are the ‘clients’ for the database, which has been

designed and implemented, and is being maintained to serve their

information needs.

23 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Sophisticated Users:The sophisticated end-user is familiar with the

structure of the database and the facilities offered by the DBMS.

 Naive Users: These are the users who use the existing application

to interact with the database. For example, online library system,

ticket booking systems, ATMs etc

The three schemas

 Internal Level/Schema

The internal schema defines the physical storage structure of the

database. The internal schema is a very low-level representation of the

entire database. It contains multiple occurrences of multiple types of

internal record. In the ANSI term, it is also called “stored record’.

Facts about Internal schema:

 The internal schema is the lowest level of data abstraction

 It helps you to keeps information about the actual representation of

the entire database. Like the actual storage of the data on the disk

in the form of records

 The internal view tells us what data is stored in the database and

how

 It never deals with the physical devices. Instead, internal schema

views a physical device as a collection of physical pages

 Conceptual Schema/Level

24 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The conceptual schema describes the Database structure of the

whole database for the community of users. This schema hides

information about the physical storage structures and focuses on

describing data types, entities, relationships, etc.

This logical level comes between the user level and physical

storage view. However, there is only single conceptual view of a single

database.

Facts about Conceptual schema:

 Defines all database entities, their attributes, and their

relationships

 Security and integrity information

 In the conceptual level, the data available to a user must be

contained in or derivable from the physical level

 External Schema/Level

An external schema describes the part of the database which

specific user is interested in. It hides the unrelated details of the database

from the user. There may be “n” number of external views for each

database. Each external view is defined using an external schema, which

consists of definitions of various types of external record of that specific

view. An external view is just the content of the database as it is seen by

some specific particular user. For example, a user from the sales

department will see only sales related data.


25 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Facts about external schema:

 An external level is only related to the data which is viewed

by specific end users.

 This level includes some external schemas.

 External schema level is nearest to the user

 The external schema describes the segment of the database

which is needed for a certain user group and hides the remaining

details from the database from the specific user group

 Goal of 3 level/schema of Database

Objectives of using Three Schema Architecture:

 Every user should be able to access the same data but able to see

a customized view of the data.

 The user need not to deal directly with physical database storage

detail.

 The DBA should be able to change the database storage structure

without disturbing the user’s views

 The internal structure of the database should remain unaffected

when changes made to the physical aspects of storage.

 Advantages Database Schema

 You can manage data independent of the physical storage and

Faster Migration to new graphical environments

26 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 DBMS Architecture allows you to make changes on the

presentation level without affecting the other two layers

 As each tier is separate, it is possible to use different sets of

developers

 It is more secure as the client doesn’t have direct access to the

database business logic

 In case of the failure of the one-tier no data loss as you are always

secure by accessing the other tier

 Disadvantages Database Schema

 Complete DB Schema is a complex structure which is difficult to

understand for every one

 Difficult to set up and maintain; also, the physical separation of the

tiers can affect the performance of the Database

27 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 2:

MODELING DATA IN THE

ORGANIZATION

28 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Researched and presented by:

Almadin, Catherine M.
Laxamana, Marlon F.

DATA MODELING 

What is data modelling?

Data modeling is the process of creating a simple diagram of a complex software

system, using text and symbols to represent the way data will flow. The diagram

can be used to ensure efficient use of data as a blueprint for the construction of

new software or for reengineering a legacy application.

Data modeling is an important skill for data scientists and others involved with

data analysis. Traditionally, data models were built during the analysis and

design phases of a project to ensure that the requirements for a new application

are understood. A data model can become the basis for building a more detailed

data schema. 

29 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Data modeling is the process of creating a visual representation of either a whole

information system or parts of it to communicate connections between data

points and structures. The goal is to illustrate the types of data used and stored

within the system, the relationships among these data types, the ways the data

can be grouped and organized and its formats and attributes.

Data models are built around business needs. Rules and requirements are

defined upfront through feedback from business stakeholders so they can be

incorporated into the design of a new system or adapted in the iteration of an

existing one.

Data can be modeled at various levels of abstraction. The process begins by

collecting information about business requirements from stakeholders and end

users. These business rules are then translated into data structures to formulate

a concrete database design. A data model can be compared to a roadmap, an

architect’s blueprint or any formal diagram that facilitates a deeper understanding

of what is being designed.

Ideally, data models are living documents that evolve along with changing

business needs. They play an important role in supporting business processes

and planning IT architecture and strategy. Data models can be shared with

vendors, partners, and/or industry peers.

Why use Data Model?

The primary goal of using data model are:

30 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Ensures that all data objects required by the database are accurately

represented. Omission of data will lead to creation of faulty reports and

produce incorrect results.

 A data model helps design the database at the conceptual, physical and

logical levels.

 Data Model structure helps to define the relational tables, primary and

foreign keys and stored procedures.

 It provides a clear picture of the base data and can be used by database

developers to create a physical database.

 It is also helpful to identify missing and redundant data.

 Though the initial creation of data model is labor and time consuming, in

the long run, it makes your IT infrastructure upgrade and maintenance

cheaper and faster.

Data modeling is an essential step in the process of creating any complex

software. It helps developers understand the domain and organize their work

accordingly.

Higher Quality

Just as architects consider blueprints before constructing a building, you should

consider data before building an app. On average, about 70 percent of software

development efforts fail, and a major source of failure is premature coding. A

data model helps define the problem, enabling you to consider different

approaches and choose the best one.

31 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Reduced cost

You can build applications at lower cost via data models. Data modeling typically

consumes less than 5-10 percent of a project budget, and can reduce the 65-75

percent of budget that is typically devoted to programming. Data modeling

catches errors and oversights early, when they are easy to fix. This is better than

fixing errors once the software has been written or – worse yet – is in customer

hands.

Clearer scope

A data model provides a focus for determining scope. It provides something

tangible to help business sponsors and developers agree over precisely what is

included with the software and what is omitted. Business users can see what the

developers are building and compare it with their understanding. Models promote

consensus among developers, customers and other stakeholders.

A data model also promotes agreement on vocabulary and jargon. The model

highlights the chosen terms so that they can be driven forward into software

artifacts. The resulting software becomes easier to maintain and extend.

Faster performance

A sound model simplifies database tuning. A well-constructed database typically

runs fast, often quicker than expected. To achieve optimal performance, the

concepts in a data model must be crisp and coherent. Then the proper rules

must be used for translating the model into a database design.

32 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

It is seldom a problem of the database software (Oracle, SQL Server, MySQL,

etc.) – but, rather, that the database is being used improperly. Once that problem

is fixed, the performance is just fine. Modeling provides a means to understand a

database so that you are able to tune it for fast performance.

Better documentation

Models document important concepts and jargon, proving a basis for long-term

maintenance. The documentation will serve you well through staff turnover.

Today, most application vendors can provide a data model of their application

upon request. That is because the IT industry recognizes that models are

effective at conveying important abstractions and ideas in a concise and

understandable manner.

Fewer application errors

A data model causes participants to crisply define concepts and resolve

confusion. As a result, application development starts with a clear vision.

Developers can still make detailed errors as they write application code, but they

are less likely to make deep errors that are difficult to resolve.

Fewer data errors

Data errors are worse than application errors. It is one thing to have an

application crash, necessitating a restart. It is another thing to corrupt data in a

large database.

33 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A data model not only improves the conceptual quality of an application, it also

lets you leverage database features that improve data quality. Developers can

weave constraints into the fabric of a model and the resulting database. For

example, every table should normally have a primary key. The database can

enforce other unique combinations of fields. Referential integrity can ensure that

foreign keys are bona fide and not dangling.

Managed risk

You can use a data model to estimate the complexity of software, and gain

insight into the level of development effort and project risk. You should consider

the size of a model, as well as the intensity of inter-table connections.

Robert Hillard wrote an excellent book, “Information-Driven Business” in which he

equates a data model to a mathematical graph. He uses the graph as a basis for

assessing software complexity. An application database with heavily

interconnected tables is more complex and therefore prone to more risk of

development failure.

A good start for data mining

The documentation inherent in a model serves as a starting point for analytical

data mining. You can take day-to-day business data and load it into a dedicated

database, known as a “data warehouse.” Data warehouses are constructed

specifically for the purpose of data analysis, leveraging that data from routine

operations.

34 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Why should you consider data modeling in your business?

The better data modeling you have, the more business benefits you receive on

the subject of productivity, efficiency, customer satisfaction, profitability, and a

better understanding of your core business needs. However, you have to

carefully consider the discovered data types to avoid the over-modeling issues

regarding the costs and speed of development optimization.

BUSINESS RULE

A business rule is a statement that describes a business policy or procedure.

Business rules are usually expressed at the atomic level -- that is, they cannot be

broken down any further. It imposes some form of constraint on a specific aspect

of the database, such as the elements within a field specification for a particular

field or the characteristics of a given relationship. You base a business rule on

the way the organization perceives and uses its data, which you determine from

the way the organization functions or conducts its business.

Business rules, the foundation of data models, are derived from policies,

procedures, events, functions, and other business objects, and they state

constraints on the organization. Business rules represent the language and

fundamental structure of an organization (Hay, 2003). Business rules formalize

the understanding of the organization-by-organization owners, managers, and

leaders with that of information systems architects.

35 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Business rules are important in data modeling because they govern how data are

handled and stored. Examples of basic business rules are data names and

definitions.

SCOPE OF BUSINESS RULE 

  We are concerned with business rules that impact only an organization’s

databases. Most organizations have a host of rules and/or policies that fall

outside this definition. For example, the rule “Friday is business casual dress

day” may be an important policy statement, but it has no immediate impact on

databases. In contrast, the rule “A student may register for a section of a course

only if he or she has successfully completed the prerequisites for that course” is

within our scope because it constrains the transactions that may be processed

against the database. It causes any transaction that attempts to register a

student who does not have the necessary prerequisites to be rejected. Some

business rules cannot be represented in common data modeling notation; those

rules that cannot be represented in a variation of an entity-relationship diagram

are stated in natural language, and some can be represented in the relational

data model.

Business rules can be applied to computing systems and are designed to help an

organization achieve its goals. Software is used to automate business rules using

business logic.

Business rules can also be generated by internal or external necessity. For

example, a business can come up with business rules that are self-imposed to

36 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

meet leadership’s own goals, or in the pursuit of compliance with external

standards. Experts also point out that while there is a system of strategic

processes governing business rules, the business rules themselves are not

strategic, but simply directive in nature.

ENTITY RELATIONSHIP MODEL

ER model defines entity sets, not individual entities, but entity sets described in

terms of their attributes

An entity-relationship model (e-r model) is a detailed, logical representation of the

data for an organization or for a business area. The E-R model is expressed in

terms of entities in the business environment, the relationships (or associations)

among those entities, and the attributes (or properties) of both the entities and

their relationships. An E-R model is normally expressed as an entity-relationship

37 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

diagram (e-r diagram, or erD), which is a graphical representation of an E-R

model.

Entity-Relationship Model is the diagrammatical representation of a database

structure which is called an ER diagram. The ER diagram is considered a

blueprint of a database which has mainly two components i.e., relationship set,

and entity set. The ER diagram is used to represent the relationship exists

among the entity set. The entity set is considered as a group of entities of similar

type which contains attributes. According to the database system management

system the entity is considered as a table and attributes are columns of a table.

So, the ER diagram shows the relationship among tables in the database. The

entity is considered a real-world object which is stored physically in the database.

The entities have attributes that help to uniquely identify the entity. The entity set

can be considered as a collection of similar types of entities.

Why do we use the Entity diagram?

The entity diagram is used to represent the database in the diagram form. It

helps to properly understand the database. All the necessary details of the

database can be represented in the form of the ER diagram. The entities

represent all the tables of the database, attributes are the columns of tables and

the relationship represented the association among the tables of a database.

38 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The figure represents the ER diagram of

the college student database. The

student, college, mechanical, electronics

and computer science are entities and the

enrolls in and specialized in are the

relationship. The attributes are name,

age, gender, DOB, affiliation, address.

Components of Entity-Relationship Model

The ER model is used as a conceptual view of the database. The ER model

consist of real-world entities and the related associations exist between them.

The ER model gives the complete idea of a database used for any application

and it is very easy to understand. The below section contains information about

the components of the ER diagram.

1. Entity

An entity is a person, a place, an object, an event, or a concept in the user

environment about which the organization wishes to maintain data. Thus, an

entity has a noun name. Some examples of each of these kinds of entities follow:

Person: Employee, Student, Patient; Place: Store, Warehouse, State; Object:

Machine, Building, Automobile; Event: Sale, Registration, Renewal; Concept:

39 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Account, Course, Work Center. All type of entities has some attributes or the

properties which will help to give the proper idea of the entity. The entity set can

be considered as a collection of similar types of entities. In the entity set, there

can be some entities exist which can contain similar type of values. For example,

the employee set will contain information from all employees. The entity set does

not require to be disjoint.

An entity is an object or event in our environment that we want to keep track of. A

person is an entity. So is a building, a piece of inventory sitting on a shelf, a

finished product ready for sale, and a sales meeting (an event). An attribute is a

property or characteristic of an entity.

 Weak entity: The weak entity is considered an entity that can’t be easily

chosen by its attribute and which required some relationship with some

other entity. This type of entity is known as a weak entity. In the ER

diagram, the double rectangle is used for representing a weak entity. For

example- if there is only a bank account then it is considered as a weak

entity as the bank account cannot be identified which bank the bank

account belongs to.

An entity types whose existence depends on some other entity type.

(Some data modeling software, in fact, use the term dependent entity.) A

weak entity type has no business meaning in an E-R diagram without the

entity on which it depends. The entity type on which the weak entity type

40 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

depends is called the identifying owner (or simply owner for short). A weak

entity type does not typically have its own identifier. Generally, on an E-R

diagram, a weak entity type has an attribute that serves as a partial

identifier.

 Strong Entity- A strong entity type is one that exists independently of

other entity types. (Some data modeling software, in fact, use the term

independent entity.) Examples include Student, Employee, Automobile,

and Course. Instances of a strong entity type always have a unique

characteristic (called an identifier)—that is, an attribute or a combination of

attributes that uniquely distinguish each occurrence of that entity.

ENTITY TYPE VS. ENTITY INSTANCE

There is an important distinction between entity types and entity instances.

An entity type is a collection of entities that share common properties or

characteristics. Each entity type in an E-R model is given a name. Because the

name represents a collection (or set) of items, it is always singular. We use

capital letters for names of entity type(s). In an E-R diagram, the entity name is

placed inside the box representing the entity type. 

It is the fundamental building block for describing the structure of data with the

Entity Data Model. In a conceptual model, entity types are constructed from

properties and describe the structure of top-level concepts, such as a customers

41 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

and orders in a business application. In the same way that a class definition in a

computer program is a template for instances of the class, an entity type is a

template for entities.

An entity instance is a single occurrence of an entity type. An entity type is

described just once (using metadata) in a database, whereas many instances of

that entity type may be represented by data stored in the database. For example,

there is one EMPLOYEE entity type in most organizations, but there may be

hundreds (or even thousands) of instances of this entity type stored in the

database. We often use the single term entity rather than entity instance when

the meaning is clear from the context of our discussion.

It is a manifestation of an entity within that category. For example, Cell could be

the entity type, but Cell_1 , Cell_2 , and Cell_3 would represent the actual items

within the network.

In simple words:

ENTITY- A person, a place, an object, an event, or a concept in the user

environment about which the organization wishes to maintain data

ENTITY TYPE- A collection of entities that share common properties or

characteristics

ENTITY INSTANCE- A single occurrence of an entity type.


42 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

2. Attributes

The entities are represented using some properties and these properties are

known as attributes. All the attributes have some value. For example- the

employee entity can have the following attributes – employee name, employee

age, employee contact details. For the attributes, there can be considered as a

domain of values that can be allocated to the attribute. For example, the

employee’s name cannot be assigned some numeric value. The employee’s

name should always be alphabetic. The employee age cannot be in a negative

number it should always be positive.

Attributes are facts or description of entities. They are also often nouns and

become the columns of the table. For example, for entity student, the attributes

can be first name, last name, email, address, and phone numbers.

Types of Attribute

The types of attributes are given below:

1. Simple attribute: The simple attribute can be considered as atomic values

that can’t be further segregated. For example- the employee phone

number cannot be further segregate to some other attribute. an attribute

that cannot be broken down into smaller components that are meaningful

43 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

for the organization. For example, all the attributes associated with

AUTOMOBILE are simple: Vehicle ID, Color, Weight, and Horsepower

2. Composite attribute: The composite attribute contains more than one

attribute in the group. For example, the employee’s name attribute can be

considered as a composite attribute as the employee’s name can be

further segregated to a first name and last name.

Composite attributes provide considerable flexibility to users, who can

either refer to the composite attribute as a single unit or else refer to

individual components of that attribute. Thus, for example, a user can

either refer to Address or refer to one of its components, such as Street

Address. The decision about whether to subdivide an attribute into its

component parts depends on whether users will need to refer to those

individual components, and hence, they have organizational meaning. Of

course, the designer must always attempt to anticipate possible future

usage patterns for the database.

3. Derived attribute: The derived attribute is the type of attribute which does

not exist in the database physically, however, the values derived are from

the other database which is present in the database physically. For eg; the

average salary of an employee is derived attribute as it is directly not

stored in the database. The value can e derived from other attributes

present in the database physically.

44 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

an attribute whose values can be calculated from related attribute values

(plus possibly data not in the database, such as today’s date, the current

time, or a security code provided by a system user). We indicate a derived

attribute in an E-R diagram by using square brackets around the attribute

name, as shown in Figure 2-8 for the Years Employed attribute. Some E-R

diagramming tools use a notation of a forward slash (/) in front of the

attribute name to indicate that it is derived. (This notation is borrowed from

UML for a virtual attribute.)

4. Single value attribute: The single attribute contains a single value. For

example -the security number. 

5. Multi-value attribute: The multi-value attribute means the attribute which

contains more than value. For example, the employee can have more than

one email id and phone number. A multivalued attribute is an attribute that

may take on more than one value for a given entity (or relationship)

instance. In this text, we indicate a multivalued attribute with curly brackets

around the attribute name, as shown for the Skill attribute in the

EMPLOYEE. In Microsoft Visio, once an attribute is placed in an entity,

you can edit that attribute (column), select the Collection tab and choose

one of the options. (Typically, Multiset will be your choice, but one of the

other options may be more appropriate for a given situation.) Other E-R

diagramming tools may use an asterisk (*) after the attribute name, or you

45 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

may have to use supplemental documentation to specify a multivalued

attribute.

Primary Key

Primary Key* or identifier is an attribute or a set of attributes that uniquely

identifies an instance of the entity. For example, for a student entity, student

number is the primary key since no two students have the same student number.

We can have only one primary key in a table. It identifies uniquely every row and

it cannot be null.

Foreign key

A foreign key+ (sometimes called a referencing key) is a key used to link two

tables together. Typically, you take the primary key field from one table and insert

it into

the

other

table

where it

becomes a foreign key (it remains a primary key in the original table). We can

have more than one foreign key in a table.

46 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

How many entities are there in this diagram and what are they?

There are seven entities: STUDENT, COURSE, INSTRUCTOR, SEAT, CLASS,

SECTION and PROFESSOR.

What are the attributes for entity STUDENT?

The attributes for Entity STUDENT are: student_id, student_name and

student_address

What is the primary key for STUDENT?

The primary key for STUDENT is: student_id

What is the primary key for COURSE?

Not a trick question! There is only one primary key, but it is made up of two

attributes. This is called a compound key.

What foreign keys do STUDENT and COURSE contain?

STUDENT and COURSE contain no foreign keys in this diagram. This might

suggest that there are problems with the design... among them is the many to

many relationships here. This usually requires that we create a separate table to

describe the relationship. This type of table usually connects foreign ids to each

other.

In this

case, let's

add an

47 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

entity called REGISTRATION in the middle of the "takes" relationship. Since

students probably sit in different seats for each course they are registered in, lets

relate SEAT to REGISTRATON instead of STUDENT:

3. Relationship

The relationship is another type of component of the ER diagram which is used

to show the dependency among the entities of the database. In the ER diagram,

the relationship is represented by a diamond-shaped box. All the relationship

which exist between the entities is connected by a line which shows in the ER

diagram.

There are different type of relationship which are discussed below:

One-to-one: In this relationship, the one entity is related to some other entity is a

one-to-one relationship. For eg; an

individual has a passport and the

passport is allocated to one individual.

48 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Many-to-one: In this relationship when many instances of an entity are linked to

one entity. For eg; many students can read

in one college.

One-to-many: When one entity is linked to more than one entity is a one-to-

many relationship. For eg; one customer placed

multiple orders.

Many-to-many: When many entities are linked to many entities is known as

many-to-many relationships. For eg; students

can have multiple projects and the project is

allocated to multiple students.

DEGREE OF A RELATIONSHIP

49 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The degree of a relationship is the number of entity types that participate in that

relationship. Thus, the relationship Completes in Figure 2-11 is of degree 2,

because there are two entity types: EMPLOYEE and COURSE. The three most

common relationship

degrees in E-R models

are unary (degree 1),

binary (degree 2), and

ternary (degree 3).

Higher-degree

relationships are

possible, but they are

rarely encountered in

practice, so we restrict

our discussion to these

three cases. Examples

of unary, binary, and ternary relationships appear in Figure 2-12. (Attributes are

not shown in some figures for simplicity.) As you look at Figure 2-12, understand

that any particular data model represents a specific situation, not a

generalization. For example, consider the Manages relationship in Figure 2-12a.

In some organizations, it may be possible for one employee to be managed by

many other employees (e.g., in a matrix organization). It is important when you

develop an E-R model that you understand the business rules of the particular

organization you are modeling.

50 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

UNARY RELATIONSHIP

A unary relationship is a relationship between the instances of a single entity

type. (Unary relationships are also called recursive relationships.) Three

examples are shown in Figure 2-12a. In the first example, Is Married To is shown

as a one-to-one relationship between instances of the PERSON entity type.

Because this is a one-to-one relationship, this notation indicates that only the

current marriage, if one exists, needs to be kept about a person. What would

change if we needed to retain the history of marriages for each person? See

Review Question 2-20 and Problem and Exercise 2-34 for other business rules

and their effect on the Is Married To relationship representation. In the second

example, Manages is shown as a one-to-many relationship between instances of

the EMPLOYEE entity type. Using this relationship, we could identify, for

example, the employees who report to a particular manager. The third example is

one case of using a unary relationship to represent a sequence, cycle, or priority

list. In this example, sports teams are related by their standing in their league

(the Stands After relationship). (Note:  In these examples, we ignore whether

these are mandatory- or optional-cardinality relationships or whether the same

entity instance can repeat in the same relationship instance; we will introduce

mandatory and optional cardinality in a later section of this chapter.)

 Figure 2-13 shows an example of another unary relationship, called a bill-

ofmaterials structure. Many manufactured products are made of assemblies,

which in turn are composed of subassemblies and parts, and so on. As shown in

51 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Figure 2-13a, we can represent this structure as a many-to-many unary

relationship. In this figure, the entity type ITEM is used to represent all types of

components, and we use Has Components for the name of the relationship type

that associates lower-level items with higher-level items. 

Two occurrences of this bill-of-materials structure are shown in Figure 2-13b.

Each of these diagrams shows the immediate components of each item as well

as the quantities of that component. For example, item TX100 consists of item

BR450 (quantity 2) and item DX500 (quantity 1). You can easily verify that the

associations are in fact many-to-many. Several of the items have more than one

component type (e.g., item MX300 has three immediate component types:

HX100, TX100, and WX240). Also, some of the components are used in several

higher-level assemblies. For example, item WX240 is used in both item MX300

and item WX340, even at different levels of the billof-materials. The many-to-

many relationship guarantees that, for example, the same subassembly structure

of WX240 (not shown) is used each time item WX240 goes into making some

other item. 

52 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The presence of the attribute Quantity on the relationship suggests that the

analyst consider converting the relationship Has Components to an associative

entity.  Figure 2-13c shows the entity type BOM STRUCTURE, which forms an

association between instances of the ITEM entity type. A second attribute

(named Effective

Date) has been

added to BOM

STRUCTURE to

record the date

when this

component was

first used in the

related assembly.

Effective dates are

often needed when

a history of values

is required. Other

data model structures can be used for  unary relationships involving such

hierarchies; 

BINARY RELATIONSHIP

A binary relationship is a relationship between the instances of two entity types

and is the most common type of relationship encountered in data modeling.

53 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Figure 2-12b shows three examples. The first (one-to-one) indicates that an

employee is assigned one parking place, and that each parking place is assigned

to one employee. The second (one-to-many) indicates that a product line may

contain several products, and that each product belongs to only one product line.

The third (many-to-many) shows that a student may register for more than one

course, and that each course may have many student registrants.

 CONCEPTS IN ACTION

2-A THE WALT DISNEY COMPANY

The Walt Disney Company is world-famous for its many entertainment ventures

but it is especially identified with its theme parks. First there was Disneyland in

Los Angeles, then the mammoth Walt Disney World in Orlando. These were

followed by parks in Paris and Tokyo, and one now under development in Hong

Kong. The Disney theme parks are so well run that they create a wonderful

feeling of natural harmony with everyone and everything being in the right place

at the right time. When you're there, it's too much fun to stop to think about how

all this is organized and carried off with such precision. But, is it any wonder to

learn that databases play a major part?

One of the Disney theme parks' interesting database applications keeps track of

all of the costumes worn by the workers or “cast members” in the parks. The

system is called the Garment Utilization System or GUS (which was also the

name of one of the mice that helped Cinderella sew her dress!). Managing these

costumes is no small task. Virtually all of the cast members, from the actors and

54 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

dancers to the ride operators, wear some kind of costume. Disneyland in Los

Angeles has 684,000 costume parts (each costume is typically made up of

several garments), each of which is uniquely bar-coded, for its 46,000 cast

members. The numbers in Orlando are three million garments and 90,000 cast

members. Using bar-code scanning, GUS tracks the life cycle of every garment.

This includes the points in time when a garment is in the storage facility, is

checked out to a cast member, is in the laundry, or is being repaired (in house or

at a vendor). In addition to managing the day-to-day movements of the

costumes, the system also provides a rich data analysis capability. The industrial

engineers in Disney's business planning group use the accumulated data to

decide how many garments to keep in stock and how many people to have

staffing the garment checkout windows based on the expected wait times. They

also use the data to determine whether certain fabrics or the garments made by

specific manufacturers are not holding up well through a reasonable number of

uses or of launderings. 

GUS, which was inaugurated at Disneyland in Los Angeles in 1998 and then

again at Walt Disney World in Orlando in 2002, replaced a manual system in

which the costume data was written on index cards. It is implemented in

Microsoft's SQL Server DBMS and runs on a Compaq server. It is also linked to

an SAP personnel database to help maintain the status of the cast members. If

GUS is ever down, the process shifts to a Palm Pilot-based backup system that

can later update the database. In order to keep track of the costume parts and

cast members, not surprisingly, there is a relational table for costume parts with

55 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

one record for each garment and there is a table for cast members with one

record for each cast member. The costume parts records include the type of

garment, its size, color, and even such details as whether its use is restricted to a

particular cast member and whether it requires a special laundry detergent.

Correspondingly, the cast member records include the person's clothing sizes

and other specific garment requirements.

Ultimately, GUS's database precision serves several purposes in addition to its

fundamental managerial value. The Walt Disney Company feels that consistency

in how its visitors or “guests” look at a given ride gives them an important comfort

level. Clearly, GUS provides that consistency in the costuming aspect. In

addition, GUS takes the worry out of an important part of each cast member's

workday. One of Disney's creeds is to strive to take good care of its cast

members so that they will take good care of Disney's guests. Database

management is a crucial tool in making this work so well.

FIGURE 2.2 A binary relationship

Cardinality

One-to-One Binary Relationship Figure 2.3 shows three binary relationships of

different cardinalities, representing the maximum number of entities that can be

involved in a particular relationship. Figure 2.3a shows a one-to-one (1-1) binary

56 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

relationship, which means that a single occurrence of one entity type can be

associated with a single occurrence of the other entity type and vice versa. A

particular salesperson is assigned to one office. Conversely, a particular office (in

this case they are all private offices!) has just one salesperson assigned to it.

Note the “bar” or “one” symbol on either end of the relationship in the diagram

indicating the maximum one cardinality. The way to read these diagrams is to

start at one entity, read the relationship on the connecting line, pick up the

cardinality on the other side of the line near the second entity, and then finally

reach the other entity. Thus, Figure 2.3a, reading from left to right, says, “A

salesperson works in one (really at most one, since it is a maximum) office.” The

bar or one symbol involved in this statement is the one just to the left of the office

entity box. Conversely, reading from right to left, “An office is occupied by one

salesperson.”

FIGURE 2.3 Binary

relationships with

cardinalities

One-to-Many Binary

Relationship Associations can also be multiple in nature. Figure 2.3b shows a

57 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

one-to-many (1-M) binary relationship between salespersons and customers.

The “crowy's foot” device attached to the customer entity box represents the

multiple association. Reading from left to right, the diagram indicates that a

salesperson sells to many customers. (Note that “many,” as the maximum

number of occurrences that can be involved, means a number that can be 1, 2,

3, …n. It also means that the number is not restricted to being exactly one, which

would require the “one” or “bar” symbol instead of the crow's foot.) Reading from

right to left, Figure 2.3b says that a customer buys from only one salesperson.

This is reasonable, indicating that in this company each salesperson has an

exclusive territory and thus each customer can be sold to by only one

salesperson from the company.

Many-to-Many Binary Relationship Figure 2.3c shows a many-to-many (M-M)

binary relationship among salespersons and products. A salesperson is

authorized to sell many products; a product can be sold by many salespersons.

By the way, in some circumstances, in either the 1-M or M-M case, “many” can

be either an exact number or have a known maximum value. For example, a

company rule may set a limit of a maximum of ten customers in a sales territory.

Then the “many” in the 1-M relationship of Figure 2.3b can never be more than

10 (a salesperson can have many customers but not more than 10). Sometimes

people include this exact number or maximum next to or even instead of the

crow's foot in the E-R diagram.

Modality

58 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Figure 2.4 shows the addition of the modality, the minimum number of entity

occurrences that can be involved in a relationship. In our particular salesperson

environment, every salesperson must be assigned to an office. On the other

hand, a given office might be empty or it might be in use by exactly one

salesperson. This situation is recorded in Figure 2.4a, where the “inner” symbol,

which can be a zero or a one, represents the modality—the minimum—and the

“outer” symbol, which can be a one or a crow's foot, represents the cardinality—

the maximum. Reading Figure 2.4a from left to right tells us that a salesperson

works in a minimum of one and a maximum of one office, which is another way of

saying exactly one office. Reading from right to left, an office may be occupied by

or assigned to a minimum of no salespersons (i.e. the office is empty) or a

maximum of one salesperson.

Similarly, Figure 2.4b indicates that a salesperson may have no customers or

many customers. How could a salesperson have no customers? (What are we

paying her for?!?) Actually, this allows for the case in which we have just hired a

new salesperson and have not as yet assigned her a territory or any customers.

On the other hand, a customer is always assigned to exactly one salesperson.

We never want customers to be without a salesperson—how would they buy

anything from us when they need to? We never want to be in a position of losing

sales! If a salesperson leaves the company, the company's procedures require

that another salesperson or, temporarily, a sales manager be immediately

assigned the departing salesperson's customers. Figure 2.4c says that each

salesperson is authorized to sell at least one or many of our products and each

59 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

product can be sold by at least one or many of our salespersons. This includes

the extreme, but not surprising, case in which each salesperson is authorized to

sell all the products and each product can be sold by all the salespersons.

FIGURE2.4 Binary

relationships with

cardinalities

(maximums) and

modalities (minimums)

More About Many-to-

Many Relationships

Intersection Data Generally, we think of attributes as facts about entities. Each

salesperson has a salesperson number, a name, a commission percentage, and

a year of hire. At the entity occurrence level, for example, one of the

salespersons has salesperson number 528, the name Jane Adams, a

commission percentage of 15 %, and the year of hire of 2003. In an E-R diagram,

these attributes are written or drawn together with the entity, as in Figure 2.1 and

the succeeding figures. This certainly appears to be very natural and obvious.

Are there ever any circumstances in which an attribute can describe something

other than an entity?

60 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Consider the many-to-many relationship between salespersons and products

in Figure 2.4c. As usual, salespersons are described by their salesperson

number, name, commission percentage, and year of hire. Products are described

by their product number, name, and unit price. But, what if there is a requirement

to keep track of the number of units (call it “quantity”) of a particular product that

a particular salesperson has sold? Can we add the quantity attribute to the

product entity box? No, because for a particular product, while there is a single

product number, product name, and unit price, there would be lots of “quantities,”

one for each salesperson selling the product. Can we add the quantity attribute to

the salesperson entity box? No, because for a particular salesperson, while there

is a single salesperson number, salesperson name, commission percentage, and

year of hire, there will be lots of “quantities,” one for each product that the

salesperson sells. It makes no sense to try to put the quantity attribute in either

the salesperson entity box or the product entity box. While each salesperson has

a single salesperson number, name, commission percentage, and year of hire,

each salesperson has many “quantities,” one for each product he sells. Similarly,

while each product has a single product number, product name, and unit price,

each product has many “quantities,” one for each salesperson who sells that

product. But an entity box in an E-R diagram is designed to list the attributes that

simply and directly describe the entity, with no complications involving other

entities. Putting quantity in either the salesperson entity box or the product entity

box just will not work.

61 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The quantity attribute doesn't describe either the salesperson alone or the

product alone. It describes the combination of a particular salesperson and a

particular product. In general, we can say that it describes the combination of a

particular occurrence of one entity type and a particular occurrence of the other

entity type. Let's say that since salesperson number 137 joined the company, she

has sold 170 units of product number 24 013. The quantity 170 doesn't make

sense as a description or characteristic of salesperson number 137 alone. She

has sold many different kinds of products. To which one does the quantity 170

refer? Similarly, the quantity 170 doesn't make sense as a description or

characteristic of product number 24 013 alone. It has been sold by many different

salespersons.

In fact, the quantity 170 falls at the intersection of salesperson number 137 and

product number 24013. It describes the combination of or the association

between that particular salesperson and that particular product and it is known

as intersection data. Figure 2.5 shows the many-to-many relationship between

salespersons and products with the intersection data, quantity, represented in a

separate box attached to the relationship line. That is the natural place to draw it.

Pictorially, it looks as if it is at the intersection between the two entities, but there

is more to it than that. The intersection data describes the relationship between

the two entities. We know that an occurrence of the Sells relationship specifies

that salesperson 137 has sold some of product 24013. The quantity 170 is an

attribute of this occurrence of that relationship, further describing this occurrence

of the relationship. Not only do we know that salesperson 137 sold some of

62 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

product 24013 but we know how many units of that product that salesperson

sold.

FIGURE 2.5 Many-to-many

binary relationship with

intersection data

 
The Unique Identifier in

Many-to-Many

Relationships Since, as we

have just seen, a many-to-many relationship can appear to be a kind of an entity,

complete with attributes, it also follows that it should have a unique identifier, like

other entities. (If this seems a little strange or even unnecessary here, it will

become essential later in the book when we actually design databases based on

these E-R diagrams.) In its most basic form, the unique identifier of the many-to-

many relationship or the associative entity is the combination of the unique

identifiers of the two entities in the many-to-many relationship. So, the unique

identifier of the many-to-many relationship of Figure 2.5 or, as shown in Figure

2.6, of the associative entity, is the combination of the Salesperson Number and

Product Number attributes.

Sometimes, an additional attribute or attributes must be added to this

combination to produce uniqueness. This often involves a time element. As

currently constructed, the E-R diagram in Figure 2.6 indicates the quantity of a

particular product sold by a particular salesperson since the salesperson joined

63 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

the company. Thus, there can be only one occurrence of SALES combining a

particular salesperson with a particular product. But if, for example, we wanted to

keep track of the sales on an annual basis, we would have to include a year

attribute and the unique identifier would be Salesperson Number, Product

Number, and Year. Clearly, if we want to know how many units of each product

were sold by each salesperson each year, the combination of Salesperson

Number and Product Number would not be unique because for a particular

salesperson and a particular product, the combination of those two values would

be the same each year! Year must be added to produce uniqueness, not to

mention to make it clear in which year a particular value of the Quantity attribute

applies to a particular salesperson-product combination.

The third and last possibility occurs when the nature of the associative entity is

such that it has its own unique identifier. For example, a company might specify a

unique serial number for each sales record. Another example would be the

many-to-many relationship between motorists and police officers who give traffic

tickets for moving violations. (Hopefully it's not too many for each motorist!) The

unique identifier could be the combination of police officer number and motorist

driver's license number plus perhaps date and time. But, typically, each traffic

ticket has a unique serial number and this would serve as the unique identifier.

 TERNARY RELATIONSHIP

A ternary relationship is a simultaneous relationship among the instances of

three entity types. A typical business situation that leads to a ternary relationship

is shown in Figure 2-12c. In this example, vendors can supply various parts to

64 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

warehouses. The relationship Supplies is used to record the specific parts that

are supplied by a given vendor to a particular warehouse. Thus, there are three

entity types: VENDOR, PART, and WAREHOUSE. There are two attributes on

the relationship Supplies: Shipping Mode and Unit Cost. For example, one

instance of Supplies might record the fact that vendor X can ship part C to

warehouse Y, that the shipping mode is next-day air, and that the cost is $5 per

unit.

Don’t be confused: A ternary relationship is not the same as three binary

relationships. For example, Unit Cost is an attribute of the Supplies relationship

in Figure 2-12c. Unit Cost cannot be properly associated with any one of the

three possible binary relationships among the three entity types, such as that

between PART and WAREHOUSE. 

Thus, for example, if we were told that vendor X can ship part C for a unit cost of

$8, those data would be incomplete because they would not indicate to which

warehouse the parts would be shipped. As usual, the presence of an attribute on

the relationship Supplies in Figure 2-12c suggests converting the relationship to

an associative entity type. Figure 2-14 shows an alternative (and preferable)

representation of the ternary relationship shown in Figure 2-12c. In Figure 2-14,

the (associative) entity type SUPPLY SCHEDULE is used to replace the Supplies

relationship from Figure 2-12c. Clearly, the entity type SUPPLY SCHEDULE is of

independent interest to users. However, notice that an identifier has not yet been

assigned to SUPPLY SCHEDULE. This is acceptable. If no identifier is assigned

to an associative entity during E-R modeling, an identifier (or key) will be

65 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

assigned during logical modeling (discussed in Chapter 4). This will be a

composite identifier whose components will consist of the identifier for each of

the participating entity types (in this example, PART, VENDOR, and

WAREHOUSE). Can you think of other attributes that might be associated with

SUPPLY SCHEDULE?

As noted earlier, we do not label the lines from SUPPLY SCHEDULE to the three

entities. This is because these lines do not represent binary relationships. To

keep the same meaning as the ternary relationship of Figure 2-12c, we cannot

break the Supplies relationship into three binary relationships, as we have

already mentioned. So, here is a guideline to follow: Convert all ternary (or

higher) relationships to associative entities, as in this example. Song et al. (1995)

shows that participation constraints (described in a following section on

66 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

cardinality constraints) cannot be accurately represented for a ternary

relationship, given the notation with attributes on the relationship line. However,

by converting to an associative entity, the constraints can be accurately

represented. Also, many E-R diagram drawing tools, including most CASE tools,

cannot represent ternary relationships. So, although not semantically accurate,

you must use these tools to represent the ternary or higher order relationship

with an associative entity and three binary relationships, which have a mandatory

association with each of the three related entity types.

Convert many-to-many Relationships into one-to-many Relationships

Entities in a many-to-many relationship must be linked in a special way, that is

through a third entity, called a composite entity  also known as an associative


[1]

entity. A composite entity has only one function: to provide an indirect link

between two entities in a M:N relationship.

67 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In the language of tables, a composite entity is termed a linking table. A

composite entity has no key attribute of

its own; rather, it receives the key

attributes from each of the two entities it

links and combines them to form

a composite key attribute.

In the language of tables, a composite

key attribute is termed a composite

primary key.

The following graphic illustrates a composite entity that now indirectly links the

STUDENT and CLASS entities:

Create a composite entity called STUDENT CLASSES from a STUDENT entity

and CLASS entity. 

The M:N relationship between STUDENT and CLASS has been dissolved into

two one-to-many relations:

1. The 1:N relationship between STUDENT and STUDENT CLASSES reads

this way: for one instance of STUDENT, there exists zero, one, or many

instances of STUDENT CLASSES; but for one instance of STUDENT

CLASSES, there exists zero or one instance of STUDENT.

2. The 1:N relationship between CLASS and STUDENT CLASSES reads

this way: For one instance of CLASS, there exists zero, one, or many

68 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

instances of STUDENT CLASSES; but for one instance of STUDENT

CLASSES, there exists zero or one instance of CLASS.

Sometimes, but by no means always, the composite entity will “swipe”

attributes from one or both entities it links, because those attributes would be

more logically placed in the composite entity. In the case of STUDENT

CLASSES, however, none of the non-key attributes from STUDENT or

CLASS should be removed to the composite entity. The designer makes this

decision on a case-by-case basis. The next lesson describes types of

participation in relationships.

Many-to-Many Relationships

Intersection Data Generally, we think of attributes as facts about entities. Each

salesperson has a salesperson number, a name, a commission percentage, and

a year of hire. At the entity occurrence level, for example, one of the

salespersons has salesperson number 528, the name Jane Adams, a

commission percentage of 15 %, and the year of hire of 2003. In an E-R diagram,

these attributes are written or drawn together with the entity, as in Figure 2.1 and

the succeeding figures. This certainly appears to be very natural and obvious.

Are there ever any circumstances in which an attribute can describe something

other than an entity?

Consider the many-to-many relationship between salespersons and products

in Figure 2.4c. As usual, salespersons are described by their salesperson

69 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

number, name, commission percentage, and year of hire. Products are described

by their product number, name, and unit price. But, what if there is a requirement

to keep track of the number of units (call it “quantity”) of a particular product that

a particular salesperson has sold? Can we add the quantity attribute to the

product entity box? No, because for a particular product, while there is a single

product number, product name, and unit price, there would be lots of “quantities,”

one for each salesperson selling the product. Can we add the quantity attribute to

the salesperson entity box? No, because for a particular salesperson, while there

is a single salesperson number, salesperson name, commission percentage, and

year of hire, there will be lots of “quantities,” one for each product that the

salesperson sells. It makes no sense to try to put the quantity attribute in either

the salesperson entity box or the product entity box. While each salesperson has

a single salesperson number, name, commission percentage, and year of hire,

each salesperson has many “quantities,” one for each product he sells. Similarly,

while each product has a single product number, product name, and unit price,

each product has many “quantities,” one for each salesperson who sells that

product. But an entity box in an E-R diagram is designed to list the attributes that

simply and directly describe the entity, with no complications involving other

entities. Putting quantity in either the salesperson entity box or the product entity

box just will not work.

The quantity attribute doesn't describe either the salesperson alone or the

product alone. It describes the combination of a particular salesperson and a

particular product. In general, we can say that it describes the combination of a

70 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

particular occurrence of one entity type and a particular occurrence of the other

entity type. Let's say that since salesperson number 137 joined the company, she

has sold 170 units of product number 24 013. The quantity 170 doesn't make

sense as a description or characteristic of salesperson number 137 alone. She

has sold many different kinds of products. To which one does the quantity 170

refer? Similarly, the quantity 170 doesn't make sense as a description or

characteristic of product number 24 013 alone. It has been sold by many different

salespersons.

In fact, the quantity 170 falls at the intersection of salesperson number 137 and

product number 24013. It describes the combination of or the association

between that particular salesperson and that particular product and it is known

as intersection data. Figure 2.5 shows the many-to-many relationship between

salespersons and products with the intersection data, quantity, represented in a

separate box attached to the relationship line. That is the natural place to draw it.

Pictorially, it looks as if it is at the intersection between the two entities, but there

is more to it than that. The intersection data describes the relationship between

the two entities. We know that an occurrence of the Sells relationship specifies

that salesperson 137 has sold some of product 24013. The quantity 170 is an

attribute of this occurrence of that relationship, further describing this occurrence

of the relationship. Not only do we know that salesperson 137 sold some of

product 24013 but we know how many units of that product that salesperson

sold.

71 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

FIGURE 2.5 Many-to-many

binary relationship with

intersection data

CHAPTER 3:

72 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

THE ENHANCED E-R MODEL

Researched and presented by:

Antolino Jr, Mike F.


Mendador, Jonnabelle

Definitions

 Entity–relationship model (or ER model) - describes interrelated things of

interest in a specific domain of knowledge. A basic ER model is composed

of entity types (which classify the things of interest) and specifies

73 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

relationships that can exist between entities (instances of those entity

types).

 Supertype - is an entity type that has got relationship (parent to child

relationship) with one or more subtypes and it contains attributes that are

common to its subtypes.

 Subtypes - are subgroups of the supertype entity and have unique

attributes, but they will be different from each subtype.

 Generalization - It works on the principle of bottom up approach.

 Specialization - is a top-down approach where higher level entity is

specialized into two or more lower level entities.

 Disjointness constraints - You will need to decide whether a supertype

instance may simultaneously be a member of two or more subtypes.

 Disjoint rule - an instance of a supertype may not simultaneously be a

member of two (or more) subtypes.

 Overlapping Rule - an instance of a supertype may simultaneously be a

member of two (or more) subtypes.

 Completeness constraints - decide whether a supertype instance must

also be a member of at least one subtype.

 Total Specialization Rule -   Each entity instance of a supertype must also

be a member of some subtype.

 Partial Specialization Rule - An entity instance of a supertype may or may

not belong to any subtype.

74 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Supertype/Subtype Hierarchy - a structure comprises a combination of

supertype/subtype relationships, that structure

 Subtype Discriminator - is an attribute of a supertype whose values

determine the target subtype or subtypes.

 Universal data model - is a generic or template data model that can be

reused as

a starting point for a data modeling project.

ENHANCED E-R MODEL

The basic E-R model described in the previous chapter was first introduced

during the mid-1970s. It has been suitable for modeling most common business

problems and has enjoyed widespread use. However, the business environment

has changed dramatically since that time. Business relationships are more

complex, and as a result, business data are much more complex as well. For

75 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

example, organizations must be prepared to segment their markets and to

customize their products, which places much greater demands on organizational

databases. To cope better with these changes, researchers and consultants

have continued to enhance the E-R model so that it can more accurately

represent the complex data encountered in today’s business environment. The

term enhanced entity-relationship (EER) model is used to identify the model that

has resulted from extending the original E-R model with these new modeling

constructs. These extensions make the EER model semantically similar to

object-oriented data modeling

SUPERTYPE AND SUBTYPE

Recognize when to use supertype / subtype relationship in data modelling

At times, few entities in a data model may share some common properties

(attributes) within themselves apart from having one or more distinct attributes.

Based on the attributes, these entities are categorized as Supertype and Subtype

entities.

Supertype is an entity type that has got relationship (parent to child relationship)

with one or more subtypes and it contains attributes that are common to its

subtypes.

Subtypes are subgroups of the supertype entity and have unique attributes, but

they will be different from each subtype.

76 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Supertypes and Subtypes are parent and child entities respectively and the

primary keys of supertype and subtype are always identical.

When designing a data model for PEOPLE, you can have a supertype entity

of PEOPLE and its subtype entities can be vendor, customer, and employee.

People entity will have attributes like Name, Address, and Telephone number,

which are common to its subtypes and you can design entities employee, vendor,

and consumer with their own unique attributes. Based on this scenario, employee

entity can be further classified under different subtype entities like HR employee,

IT employee etc. Here employee will be the superset for the entities HR

Employee and IT employee, but again it is a subtype for the PEOPLE entity.

Let us illustrate supertype/subtype relationships with a simple yet common

example. Suppose that an organization has three basic types of employees:

hourly employees, salaried employees, and contract consultants. The following

are some of the important attributes for each of these types of employees:

 Hourly employees:  Employee Number, Employee Name, Address, Date

Hired,Hourly Rate

 Salaried employees:  Employee Number, Employee Name, Address, Date

Hired,Annual Salary, Stock Option

 Contract consultants: Employee Number, Employee Name, Address, Date

Hired, Contract Number, Billing Rate

Notice that all of the employee types have several attributes in common:

Employee Number, Employee Name, Address, and Date Hired. In addition, each
77 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

type has one or more attributes distinct from the attributes of other types (e.g.,

Hourly Rate is unique to hourly employees). If you were developing a conceptual

data model in this situation, you might consider three choices:

1. Define a single entity type called EMPLOYEE. Although conceptually simple,

this approach has the disadvantage that EMPLOYEE would have to contain all of

the attributes for the three types of employees. For an instance of an hourly

employee (for example), attributes such as Annual Salary and Contract Number

would not apply (optional attributes) and would be null or not used. When taken

to a development environment, programs that use this entity type would

necessarily need to be quite complex to deal with the many variations.

2. Define a separate entity type for each of the three entities. This approach

would fail to exploit the common properties of employees, and users would have

to be careful to select the correct entity type when using the system.

3. Define a supertype called EMPLOYEE with subtypes HOURLY EMPLOYEE,

SALARIED EMPLOYEE, and

CONSULTANT. This approach

exploits the common properties of

all employees, yet it recognizes

the distinct properties of each

type.

    The below figure shows a

representation of the EMPLOYEE supertype with its three subtypes, using

78 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

enhanced E-R notation. Attributes shared by all employees are associated with

the EMPLOYEE entity type. Attributes that are peculiar to each subtype are

included with that subtype only.

Purpose of the Supertypes and Subtypes

- Supertypes and subtypes occur frequently in the real world:

 food order types (eat in, to go)

 grocery bag types (paper, plastic)

 payment types (check, cash, credit)

- You can typically associate ‘choices’ of something with supertypes and

subtypes.

- For example, what will be the method of payment – cash, check or credit card?

- Understanding real world examples helps us understand how and when to

model them.

Subdivide an Entity

 Sometimes it makes sense to subdivide an entity into subtypes.

 This may be the case when a group of instances has special properties,

such as attributes or relationships that exist only for that group.

 In this case, the entity is called a “supertype” and each group is called a

“subtype”.

Subtype Characteristics

79 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A subtype:

 Inherits all attributes of the supertype

 Inherits all relationships of the supertype

 Usually has its own attributes or relationships

 Is drawn within the supertype

 Never exists alone

 May have subtypes of its own

Always More Than One Subtype

 When an ER model is complete, subtypes never stand alone. In other

words, if an entity has a subtype, a second subtype must also exist.

 A single subtype is exactly the same as the supertype.

 This idea leads to the two subtype rules:

 Exhaustive: Every instance of the supertype is also an instance of one of

the subtypes. All subtypes are listed without omission.

 Mutually Exclusive: Each instance of a supertype is an instance of only

one possible subtype.

At the conceptual modeling stage, it is good practice to include an OTHER

subtype to make sure that your subtypes are exhaustive — that you are handling

every instance of the supertype.

Subtypes Always Exist

80 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Any entity can be subtyped by making up a rule that subdivides the

instances into groups.

- But being able to subtype is not the issue—having a reason to subtype is the

issue.

- When a need exists within the business to show similarities and differences

between instances, then subtype.

Correctly Identifying Subtypes

When modeling supertypes and subtypes, you can use three questions to

see if the subtype is correctly identified:

1. Is this subtype a kind of supertype?

2. Have I covered all possible cases? (exhaustive)

3. Does each instance fit into one and only one subtype? (mutually

exclusive)

SPECIALIZATION AND GENERALIZATION

Specialization and generalization as techniques for defining supertype /

subtype relationships.

Generalization

It works on the principle of bottom up approach. In Generalization lower level

functions are combined to form higher level function which is called as entities.

This process is repeated further to make advanced level entities.

81 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In the Generalization process properties are drawn from particular entities

and thus we can create generalized entity. We can summarize Generalization

process as it combines subclasses to form superclass.

An example of generalization is shown in below figure. In the upper figure, three

entity types have been defined: CAR, TRUCK, and MOTORCYCLE. At this

stage, the data modeler intends to represent these separately on an E-R

diagram. However, on closer examination, we see that the three entity types

have a number of attributes in common: Vehicle ID (identifier), Vehicle Name

(with components Make and Model), Price, and Engine Displacement. This fact

(reinforced by the presence of a common identifier) suggests that each of the

three entity types is really a version of a more general entity type.This more

general entity type (named VEHICLE) together with the resulting 

supertype/subtype relationships is shown in Figure b. The entity CAR has the

specific attribute No Of Passengers, whereas TRUCK has two specific attributes:

Capacity and Cab Type. Thus, generalization has allowed us to group entity

types along with their

common attributes and at

the same time preserve

specific attributes that

are peculiar to each

subtype.

82 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Notice that the entity type MOTORCYCLE is not included in the relationship. Is

this simply an omission? No. Instead, it is deliberately not included because it

does not satisfy the conditions for a subtype discussed earlier. Comparing the

two figures  you will notice that the only attributes of MOTORCYCLE are those

that are common to all vehicles; there are no attributes specific to motorcycles.

Furthermore, MOTORCYCLE does not have a relationship to another entity type.

Thus, there is no need to create a MOTORCYCLE subtype.

    The fact that there is no MOTORCYCLE subtype suggests that it must be

possible to have an instance of VEHICLE that is not a member of any of its

subtypes.

Specialization

We can say that Specialization is opposite of Generalization. In

Specialization things are broken down into smaller things to simplify it further. We

can also say that in Specialization a particular entity gets divided into sub entities

and it’s done on the basis of it’s characteristics. Also in Specialization Inheritance

takes place.

An example of specialization is shown in Figure 3-5. Figure 3-5a shows an

entity type named PART, together with several of its attributes. The identifier is

Part No, and other attributes are Description, Unit Price, Location, Qty On Hand,

Routing Number,and Supplier. (The last attribute is multivalued and composite

because there may be more than one supplier with an associated unit price for a

83 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

part.)

    In discussions with users, we discover that there are two possible sources for

parts: Some are

manufactured internally,

whereas others are

purchased from outside

suppliers. Further, we

discover that some parts

are obtained from both

sources. In this case, the choice depends on factors such as manufacturing

capacity, unit price of the parts, and so on.

    Some of the attributes in Figure 3-5a apply to all parts, regardless of source.

However,others depend on the source. Thus, Routing Number applies only to

manufactured parts, whereas Supplier ID and Unit Price apply only to purchased

parts. These factors suggest that PART should be specialized by defining the

subtypes MANUFACTURED PART and PURCHASED PART (Figure 3-5b).

    In Figure 3-5b, Routing Number is associated with MANUFACTURED PART.

The data modeler initially planned to associate Supplier ID and Unit Price with

PURCHASED PART. However, in further discussions with users, the data

modeler suggested instead that they create a SUPPLIER entity type and an

associative entity linking PURCHASED PART with SUPPLIER. This associative

entity (named SUPPLIES in Figure 3-5b) allows users to more easily associate

purchased parts with their suppliers. Notice that the attribute Unit Price is now

84 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

associated with the associative entity so that the unit price for a part may vary

from one supplier to another. In this example, specialization has permitted a

preferred representation of the problem domain.

Figure 3-5a

 Figure 3-5b

COMPLETENESS AND DISJOINT CONSTRAINTS

Completeness constraints and disjointness constraints in modelling

supertype / subtype relationships.

Disjointness constraints - You will need to decide whether a supertype

instance may simultaneously be a member of two or more subtypes. It has two

rules. The disjoint rule forces subclasses to have disjoint sets of entities. The

overlap rule forces a subclass (also known as a supertype instance) to have

overlapping sets of entities.

DISJOINT RULE

            

85 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

(an instance of a supertype may not simultaneously be a member of two (or

more) subtypes.)

OVERLAP RULE

            

an instance of a

supertype may simultaneously be a member of two (or more) subtypes

Completeness constraints - decide whether a supertype instance must also be

a member of at least one subtype. The total specialization rule demands that

every entity in the superclass belong to some subclass. Just as with a regular

ERD, total specialization is symbolized with a double line connection between

entities. The partial specialization rule allows an entity to not belong to any of the

subclasses. It is represented with a single line connection.

TOTAL SPECIALIZATION RULE

       

  Each entity instance of a supertype

must also be a member of some

subtype.

PARTIAL SPECIALIZATION RULE

86 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

An entity instance of a supertype may or may not belong to any subtype.

SUPERTYPE AND SUBTYPE HIERARCY

A supertype entity in one

relationship may be a subtype

entity in another relationship.

When a structure comprises a

combination of supertype/subtype

relationships, that structure is

called a supertype/subtype

hierarchy, or generalization

hierarchy.

Generalization can also be described in terms of inheritance, which specifies that

all the attributes of a supertype are propagated down the hierarchy to entities of a

lower type. Generalization may occur when a generic entity, which we call the

supertype entity, is partitioned by different values of a common attribute.

SUBTYPE DISCRIMINATOR

A subtype discriminator is an attribute of the supertype that indicates an entity's

subtype. The attribute's values are what determine the target subtype.

Disjoint subtypes - simple attributes that must have alternative values to indicate

any possible subtypes.

87 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Overlapping subtypes - composite attributes whose subparts pertain to various

subtypes. Each subpart has a Boolean value that indicates whether or not the

instance belongs to the associated subtype.

SUBTYPE DISCRIMINATION: DISJOINT SUBTYPES

 Specialization and

Disjoint

 Employee: Hourly,

Salaried, Consultant

 Employee Type = The

Discriminator

 Code: “H” = hourly

 Code: “S” = Salaried

 Code: “C” = Consultant

SUBTYPES DISCRIMINATION: OVERLAPPING SUBTYPES

- More than one subtype

- The components are

Manufactured? And

Purchased?

- Where to be stored?

88 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The code will be:

ENTITY CLUSTER

 EER diagrams are difficult to read

when there are too many entities

and relationship.

 Solution: Group entities and

relationships into Entity Clusters.

 Entity Cluster: Set of one or more

entity types and associated

relationships grouped into a single

abstract entity type. 

89 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Turn into this after clustering.

Manufacturing Cluster

90 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

PACKAGE DATA MODEL

a. The age of the data modeler as engineer is dawning

b. Key strategic for long-term success, game changer for data modeling

c. Acquiring a packaged or predefined data model

d. NOT fixed can be customized to fit the business rules

e. Best-practices data model for the industry choose functional area

f. Not inexpensive, but found in publications

g. Data model patterns code for programs(just a good start for success)

ADVANTAGE OF DATA PACKAGE MODEL

 Use proven model components

 Save time and cost

 Less likelihood of data model errors

 Easier to evolve and modify over time

 Aid in requirements determination

 Easier to read

 Supertype/subtype hierarchies promote reuse

 Many-to-many relationships enhance model flexibility

91 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Vendor-supplied data model fosters integration with vendor’s applications

 Universal models support inter-organizational systems

CHAPTER 4:

LOGICAL DATABASE ESIGN AND THE


RELATIONAL MODEL

Researched and presented by:

Baccol, Jonalyn G.
Mequin, Mary Joyce M.

92 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

1. List five properties of relations.

PROPERTIES OF RELATIONS 

We have defined relations as two-dimensional tables of data. However, not all

tables are relations. Relations have several properties that distinguish them from

non-relational tables. We summarize these properties next:

 1. Each relation (or table) in a database has a unique name.

 2. An entry at the intersection of each row and column is atomic (or single

valued).

There can be only one value associated with each attribute on a specific row of a

table; no multivalued attributes are allowed in a relation.

 3. Each row is unique; no two rows in a relation can be identical.

 4. Each attribute (or column) within a table has a unique name.

 5. The sequence of columns (left to right) is insignificant. The order of the

columns in a relation can be changed without changing the meaning or use of the

relation; the sequence of rows (top to bottom) is insignificant. As with columns,

the order of the rows of a relation may be changed or stored in any sequence.

REMOVING MULTIVALUED ATTRIBUTES FROM TABLES 

The second property of relations listed in the preceding segment states that no

multivalued attributes are allowed in a relation. Thus, a table that contains one or

93 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

more multivalued attributes is not a relation. For example, Figure 1(a) shows the

employee data from the EMPLOYEE1 relation extended to include courses that

may have been taken by those employees. Because a given employee may have

taken more than one course, Course Title and Date Completed are multivalued

attributes. For example, the employee with EmpID 100 has taken two courses. If

an employee has not taken any courses, the Course Title and Date Completed

attribute values are null. (See the employee with EmpID 190 for an example.)

We show how to eliminate the multivalued attributes in Figure 1(b) by filling the

relevant data values into the previously vacant cells of Figure 1(a). As a result,

the table in

Figure 1(b) has only single-valued attributes and now satisfies the atomic

property of relations. The name EMPLOYEE2 is given to this relation to

distinguish it from EMPLOYEE1. However, as you will see, this new relation does

have some undesirable properties.

94 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Figure 1 Eliminating multivalued attributes 

2. State two essential properties of a candidate key.

CANDIDATE KEYS 

A candidate key is an attribute, or combination of attributes, that uniquely

identifies a row in a relation. A candidate key must satisfy the following

properties (Dutka and Hanson, 1989), which are a subset of the six properties

of a relation previously listed:

 1. Unique identification: For every row, the value of the key must uniquely

identify that row. This property implies that each non key attribute is

functionally dependent on that key.

 2. Non redundancy: No attribute in the key can be deleted without

destroying the property of unique identification.

Figure 2 Representing

functional

dependencies

95 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

We represent the functional dependencies for a relation using the notation shown

in Figure 2. Figure 2(a) shows the representation for EMPLOYEE1. The

horizontal line in the figure portrays the functional dependencies. A vertical line

drops from the primary key (EmpID) and connects to this line. Vertical arrows

then point to each of the nonkey attributes that are functionally dependent on the

primary key.

For the relation EMPLOYEE2 (Figure 1(b)), notice that (unlike EMPLOYEE1)

EmpID does not uniquely identify a row in the relation. For example, there are

two rows in the table for EmpID number 100. There are two types of functional

dependencies in this relation:

 1. EmpID → Name, Dept Name, Salary

 2. EmpID, Course Title → Date Completed

The functional dependencies indicate that the combination of EmpID and Course

Title is the only candidate key (and therefore the primary key) for EMPLOYEE2.

In other words, the primary key of EMPLOYEE2 is a composite key. Neither

EmpID nor Course Title uniquely identifies a row in this relation and therefore

(according to property 1) cannot by itself be a candidate key. Examine the data in

Figure 1(b) to verify that the combination of EmpID and Course Title does

uniquely identify each row of EMPLOYEE2. We represent the functional

dependencies in this relation in Figure 2(b). Notice that Date Completed is the

96 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

only attribute that is functionally dependent on the full primary key consisting of

the attributes EmpID and Course Title.

We can summarize the relationship between determinants and candidate keys as

follows: A candidate key is always a determinant, whereas a determinant may or

may not be a candidate key. For example, in EMPLOYEE2, EmpID is a

determinant but not a candidate key. A candidate key is a determinant that

uniquely identifies the remaining (nonkey) attributes in a relation. A determinant

may be a candidate key (such as EmpID in EMPLOYEE1), part of a composite

candidate key (such as EmpID in EMPLOYEE2), or a nonkey attribute. We will

describe examples of this shortly.

As a preview to the following illustration of what normalization accomplishes,

normalized relations have as their primary key the determinant for each of the

nonkeys, and within that relation there are no other functional dependencies.

(Determinant: The attribute on the left side of the arrow in a functional

dependency)

3. Give a concise definition of each of the following: First normal form,

second normal form and third normal form.

A normal form is a state of a relation that requires that certain rules regarding

relationships between attributes (or functional dependencies) are satisfied. We

describe these rules briefly in this section and illustrate them in detail in the

following sections:

97 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 1. First normal form. Any multivalued attributes (also called repeating groups)

have been removed, so there is a single value (possibly null) at the intersection

of each row and column of the table (as in Figure 4-2b).

 2. Second normal form. Any partial functional dependencies have been

removed

(i.e., nonkey attributes are identified by the whole primary key).

 3. Third normal form. Any transitive dependencies have been removed (i.e.,

nonkey attributes are identified by only the primary key).

 4. Boyce-Codd normal form. Any remaining anomalies that result from

functional dependencies have been removed (because there was more than one

possible primary key for the same nonkeys).

 5. Fourth normal form. Any multivalued dependencies have been removed.

 6. Fifth normal form. Any remaining anomalies have been removed.

Up to the
Boyce-Codd
normal form,
normalization is
based on the
analysis of
functional
dependencies.

98 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A functional dependency is a constraint between two attributes or two sets of


attributes.
   4. Briefly four problems that may arise when merging relations.

1. Synonyms

   Two or more attributes with different names but same meaning

   Is an alias or alternate name for a table, view, sequence, or other schema

object

 They are used mainly to make it easy for users to access database

objects owned by other users

 Provides an alternative name for another database object, referred to as

the base object, that can exist on a local or remote server

   Choose either of the two attribute names and eliminate the other synonym

or use a new attribute name to replace both synonyms

For example: 

ITEM  (Item no, Color, Supplier code)

SUPPLIER  (Supplier id, Supplier Name)

2. Homonyms

 Attributes with same name but different meanings

 A single attribute may have more than one meaning

 Homonyms are those fields of data that have different values but have

similar names.

99 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 The name of the attribute will be the same but the attribute refers to

different things.

For example: The example below is between student and customer in the

database. In the database for Students we can say that the F Name is for the

First Name of the Father, while in Customer the F Name can be the First name of

that customer. They have the same attributes but different meanings. 

                                   

  STUDENT          CUSTOMER

 Transitive dependencies

 Even if relations are in 3rd Normal Form prior to merging, they may not be

after merging

 An indirect relationship between values in the same table that causes a

functional dependency. 

 To achieve the normalization standard of Third Normal Form (3NF), you

must eliminate any transitive dependency.

 Remove transitive dependencies by creating 3 NF relations

100 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

For example:

4. Supertype/subtype relationships

 May be hidden prior to merging

 Is a generic entity type that has a relationship with one or more subtypes

 Is meaningful to the organization and that shares common attributes or

relationships distinct from other subgroups.

 If there are two or more different types of a relation but they contain some

characteristics common to all

For example:

Patient 1 (Patient No. , Name, Address) 

Patient 2  (Patient No. , Room No.) 

Patient 

 INPATIENT ( Date admitted) 

 OUTPATIENT ( Date Treated) 

5. Transform an ER (EER) diagram into a logically equivalent set of

relations (table)

           An Entity–relationship model (ER model) describes the structure of a

database with the help of a diagram, which is known as Entity Relationship

Diagram (ER Diagram). An ER model is a design or blueprint of a database that

can later be implemented as a database. 


101 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

           Entity Relationship (ER) Model, when conceptualized into diagrams, gives

a good overview of entity-relationship, which is easier to understand. ER

diagrams can be mapped to relational schema, that is, it is possible to create

relational schema using ER diagram. We cannot import all the ER constraints

into a relational model, but an approximate schema can be generated.

           An ER diagram shows the relationship among entity sets. An entity set is

a group of similar entities and these entities can have attributes. In terms of

DBMS, an entity is a table or attribute of a table in a database, so by showing

relationship among tables and their attributes, ER diagram shows the complete

logical structure of a database. 

Facts about ER Diagram Model 

 ER model allows you to draw Database Design 

 It is an easy-to-use graphical tool for modeling data 

 Widely used in Database Design 

 It is a GUI representation of the logical structure of a Database 

 It helps you to identifies the entities which exist in a system and the

relationships between those entities 

Why use ER Diagrams? 

Here, are prime reasons for using the ER Diagram 

 Helps you to define terms related to entity relationship modeling 

102 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Provide a preview of how all your tables should connect, what fields are

going to be on each table 

 Helps to describe entities, attributes, relationships 

 ER diagrams are translatable into relational tables which allows you to

build databases quickly 

 ER diagrams can be used by database designers as a blueprint for

implementing data in specific software applications

           For us to understand the transformation of ER diagrams let us first define

what is Logical design and   relational model?

Logical design

 Logical design is an entity design without regard to a relational database

management system. 

 Logical design the same, regardless of the DBMS

 Limitation for features of a particular DBMS should not be considered

 A logical design is a conceptual, abstract design. You do not deal with the

physical implementation details yet; you deal only with defining the types

of information that you need.

 The process of logical design involves arranging data into a series of

logical relationships called entities and attributes. 

103 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Relational Database Model

 Data represented as a set of related tables or relations 

 Relations: 

 A named, two-dimensional table of data. Each relation consists of a set of

named columns and an arbitrary number of unnamed rows 

 Properties 

 Entries in cells are simple 

 Entries in columns are from the same set of values 

 Each row is unique 

 The sequence of columns can be interchanged without changing the

meaning or use of the relation 

 The rows may be interchanged or stored in any sequence

 Well-Structured Relation

 A relation that contains a minimum amount of redundancy and allows

users to insert, modify and delete the rows without errors or

inconsistencies

A simple ER Diagram

          In the following

diagram we have two

entities Student and

College and their

104 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

relationship. The relationship between Student and College is many to one as a

college can have many students however a student cannot study in multiple

colleges at the same time. Student entities have attributes such as Stu_Id,

Stu_Name & Stu_Addr and College entities have attributes such as Col_ID &

Col_Name.

          Here are the geometric shapes and their meaning in an E-R Diagram. We

will discuss these terms in detail in the next section (Components of an ER

Diagram) of this guide so don’t worry too much about these terms now, just go

through them once.

 Rectangle: Represents Entity sets. 

 Ellipses: Attributes 

 Diamonds: Relationship Set 

 Lines: They link attributes to Entity Sets and Entity sets to Relationship

Set 

 Double Ellipses: Multivalued Attributes 

 Dashed Ellipses: Derived Attributes 

 Double Rectangles: Weak Entity Sets 

 Double Lines: Total participation of an entity in a relationship set

105 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

As  shown in the above diagram, an ER diagram has three main


components: 

A. Entity

B. Attribute 

C. Relationship

Conversion of ER Diagram to Relational model

A.  Entity 

 An entity is an object or component of data. 

 An entity is represented as a rectangle in an ER diagram. 

 Is an object that can exist ( a single thing, person, object, place)

 Set is a group of similar entities and these entities can have attributes

For example:  In the following ER diagram we have two entities Student and

College and these two entities have many to one relationship as many students

study in a single college. We will read more about relationships later, for now

focus on entities.

106 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Mapping strong entity (2 cases)

           For each strong entity set creates a new relational independent table that

includes all attributes as columns. For composite attributes include only

component attributes. There are the following:

1. Case: For Strong Entity Set with Only Simple Attributes

 A strong entity set with only simple attributes will require only one table in

the relational model. 

 Attributes of the table will be the attributes of the entity set. The primary

key of the table will be the key attribute of the entity set.

2. Case: For Strong Entity Set With Composite Attributes

107 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 A strong entity set with any number of composite attributes will require

only one table in relational model.

 While conversion, simple attributes of the composite attributes are taken

into account and not the composite attribute.

108 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Mapping weak entity

 Convert every weak entity set into a table where we take the

discrimination attribute of the weak entity set and takes the primary key of

the strong entity set as a foreign key and then declared the combination of

discriminator attribute and foreign key as a primary key.

 Weak entity set always appears in association with identifying

relationships with total participation constraint.

 Weak entities are represented with double rectangular box in the ER

Diagram and the identifying relationships are represented with double

diamond. Partial Key attributes are represented with dotted lines. 

 Entities cannot be identified by the values of their attributes

 There is no primary key made from its own attributes 

 An entity can be identified by a combination of their attributes

(“discriminator”) and the relationship they have with another entity set

(“identifying relationship”)

109 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 A weak entity is a type of entity which doesn't have its key attribute. It can

be identified uniquely by considering the primary key of another entity. For

that, weak entity sets need to have participation

B. Attribute

 An attribute describes the property of an entity. 

 The information about the entity that needs to be stored

 An attribute is represented as Oval in an ER diagram. 

There are four types of attributes: 

1. Key attribute 2. Composite attribute 

110 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

3. Multivalued attribute  4. Derived attribute

1. Key attributes

 A key attribute can uniquely identify an entity from an entity set.

 Used to establish relationships between the different tables and columns

of a relational database. 

 a set of attributes that help to uniquely identify a tuple (or row) in a

relation (or table). 

For example: Student roll numbers can uniquely identify a student from a set of

students. Key attribute is

represented by oval same

as other attributes

however the text of key

attribute is underlined.

2. Composite attribute 

 An attribute that is a combination of other attributes is known as a

composite attribute. 

 is an attribute where the values of that attribute can be further subdivided

into meaningful sub-parts

 There are values that are to be stored in an attribute that can be further

divided into meaningful values (sub-values).

111 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

For example: In student entities,

the student address is a

composite attribute as an address

is composed of other attributes

such as pin code, state, country.

3. Multivalued attribute

 An attribute that can hold multiple values is known as a multivalued

attribute. It is represented with double ovals in an ER Diagram. 

 For every multi-valued attribute, we will make a new table where we will

take the primary key of the main table as a foreign key and multi-valued

attribute as a primary key.

 A strong entity set with any number of multi valued attributes will require

two tables in relational model.

 One table will contain all the simple attributes with the primary key.

 Other table will contain the primary key and all the multi valued attributes.

For example: A person can have more than one

phone number so the phone number attribute is

multivalued.

4. Derived attribute

 A derived attribute is one whose value is dynamic and derived from

another attribute. It is represented by a dashed oval in an ER Diagram. 

112 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

  an attribute whose values are calculated from other attributes.

 are the attributes that do not exist in the physical database,

For example: Person age is a derived

attribute as it changes over time and can

be derived from another attribute (Date

of birth).

B. Relationship

Cardinality: Defines the numerical attributes of the relationship between two

entities or entity sets. 

           A relationship is represented by a diamond shape in ER diagram; it shows

the relationship among entities. There are four types of cardinal relationships: 

1. One to One  3. Many to One 

2. One to Many  4. Many to Man

1. One-to-one relationships

 In a one-to-one relationship, one record in a table is associated with one

and only one record in another table. 

 When a single instance of an entity is associated with a single instance of

another entity then it is called one to one relationship.

113 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

For example 1

 In a school database, each student has only one student ID, and each

student ID is assigned to only one person.

 In this example, the key field in each table, Student ID, is designed to

contain unique values. In the Students table, the Student ID field is the

primary key; in the Contact Info table, the Student ID field is a foreign key.

 This relationship returns related records when the value in the Student ID

field in the Contact Info table is the same as the Student ID field in the

Students table.

Example 2: An employee can work in at most one department, and a department

can have at most one employee.

114 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

For example 3: a person has only one passport and a passport is given to one

person.

2.

One-to-

many relationship

 When a single instance of an entity is associated with more than one

instance of another entity then it is called one into many relationships.

 In a one-to-many relationship, one record in a table can be associated

with one or more records in another table. 

For example: each customer can have many sales orders. A customer can

place many orders but an order cannot be placed by many customers.

3. Many to One

Relationship 

 When more than one

instance of an entity is

115 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

associated with a single instance of another entity then it is called many to

one relationship. 

For example: many students can study in a single college but a student cannot

study in many colleges at the same time.

4. Many-to-many relationship

 When more than one instance of an entity is associated with more than

one instance of another entity then it is called many into many

relationships.

 A many-to-many relationship occurs when multiple records in a table are

associated with multiple records in another table. 

 A many-to-many relationship exists between customers and products:

customers can purchase various products, and products can be

purchased by many

customers.

For example 1: a student can be

assigned to many projects and a

project can be assigned to many

students.

6. Create relational tables that incorporate entity integrity and referential

integrity constraints.

116 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Constraints

   Are the rules enforced on the data columns of a table

   Here is used to limit the type of data that can go into a table. 

   May apply to each attribute or they may apply to relationships between

tables

 This ensures the accuracy and reliability of the data in the database.

 Constraints could be either on a column level or a table level. The column

level constraints are applied only to one column, whereas the table level

constraints are applied to the whole table.

Integrity constraints

 A set of rules that the database is not permitted to violate.

 Ensure that changes (update, deletion, insertion) made to the database by

authorized users do not result in a loss of data consistency.

 Integrity constraints guard against accidental damage to the database.

 An important functionality of DBMS

Example: A blood type group must be A, B, AB or O only cannot have any other

values.

Types of integrity constraints:

1. Entity integrity 

 Focuses on Primary keys.

117 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Each table should have a primary key and each record must be unique

and not null.

 This makes sure that records in a table are not duplicated and remain

intact during insert, update and retrieval.

 Describes a condition in which all tuples within a table are uniquely

identified by their primary key. The unique value requirement prohibits a null

primary key value, because nulls are not unique.

 To ensure entity integrity, it is required that every table has a primary key.

Neither the PK nor any part of it can contain null values. This is because

null values for the primary key mean we cannot identify some rows. 

For example 1:

Example 2:

2. Referential integrity 

 Focuses on Foreign

keys.

 Specified between two table

 Null  

118 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Is the total absence of a value in a certain field and means that the

field value in unknown

 Null is not the same as a zero value for a numerical field or space

value

 Implies that a database field value has not been stored

 Foreign keys are designed to keep relationships between records of a

table to records of another table.

 Referential integrity requires that a foreign key must have a matching

primary key or it must be null. This constraint is specified between two tables

(parent and child); it maintains the correspondence between rows in these

tables.  It means the reference from a row in one table to another table must

be valid.

 Referential integrity can be enforced by working with primary and foreign

keys. Each foreign key must have a matching primary key so that reference

from one table to another must always be valid.

Example 1 

Rule 1: You can’t delete from a primary table if matching records exist in a

related table.

Rule 2: You can’t change a primary key value in the primary table of that records

has related records

119 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Example 2

Rule 3: You can’t insert a

value in the foreign key field of

the related table that doesn’t

exist in the primary key of the

primary table. 

Example 3:

Key Terms

120 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Alternate key: all candidate keys not chosen as the primary key candidate key:

a simple or composite key that is unique (no two rows in a table may have the

same value) and minimal (every column is necessary)

Characteristic entities: entities that provide more information about another

table

Composite attributes: attributes that consist of a hierarchy of attributes

Composite key: composed of two or more attributes, but it must be minimal

Dependent entities: these entities depend on other tables for their meaning

Derived attributes: attributes that contain values calculated from other attributes

Derived entities: see dependent entities

EID: employee identification (ID)

Entity: a thing or object in the real world with an independent existence that can

be differentiated from other objects

Entity relationship (ER) data model: also called an ER schema, are

represented by ER diagrams. These are well suited to data modeling for use with

databases.

Entity relationship schema: see entity relationship data model

Entity set: a collection of entities of an entity type at a point of time

Entity type: a collection of similar entities

Foreign key (FK): an attribute in a table that references the primary key in

another table OR it can be null

Independent entity: as the building blocks of a database, these entities are what

other tables are based on

121 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Kernel: see independent entity

Key: an attribute or group of attributes whose values can be used to uniquely

identify an individual entity in an entity set

Multivalued attributes: attributes that have a set of values for each entity

N-ary: multiple tables in a relationship

Null: a special symbol, independent of data type, which means either unknown

or inapplicable; it does not mean zero or blank

Recursive relationship: see unary relationship

Relationships: the associations or interactions between entities; used to

connect related information between tables

Relationship strength:  based on how the primary key of a related entity is

defined

Secondary key an attribute used strictly for retrieval purposes 

Simple attributes: drawn from the atomic value domains

SIN: social insurance number

Single-valued attributes: see simple attributes

Stored attribute: saved physically to the database

Ternary relationship: a relationship type that involves many to many

relationships between three tables.

122 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 5:

PHYSICAL DATABASE DESIGN AND


PERFORMANCE

123 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Researched and presented by:

Ramada, Julie Mae 


Beñeras, Jhasper M.

PHYSICAL DATABASE DESIGN

The physical design of your database optimizes performance while ensuring data

integrity by avoiding unnecessary data redundancies. The task of building the

physical design is a job that truly never ends. You need to continually monitor the

performance and data integrity as time passes. Many factors necessitate periodic

refinements to the physical design.

124 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Physical database design does not include implementing files and databases

(i.e., creating them and loading data into them). Physical database design

produces the technical specifications that programmers, database administrators,

and others involved in information systems construction will use during the

implementation phase.

Purpose––translate the logical description of data into the of data into the

technical specifications for storing and retrieving data storing and retrieving data.

Goal––create a design for storing data that will provide adequate performance

and insure database integrity, security and recoverability.

Because physical design is related to how data are physically stored, we need to

consider a few underlying concepts about physical storage. One goal of physical

design is optimal performance and storage space utilization. Physical design

includes data structures and 

file organization, keeping in mind that the database software will communicate

with your computer’s operating system.

PHYSICAL DESIGN PROCESS

Designing physical files and databases requires certain information that

should have been collected and produced during prior systems development

phases.The information needed for physical file and database design includes

these requirements:

125 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

• Normalized relations, including estimates for the range of the

number of rows in each table

• Definitions of each attribute, along with physical specifications such

as maximum possible length

• Descriptions of where and when data are used in various ways

(entered, retrieved, deleted, and updated, including typical frequencies of

these events)

• Expectations or requirements for response time and data security,

backup, recovery, retention, and integrity

• Descriptions of the technologies (database management systems)

used for implementing the database.

Physical database design requires several critical decisions that will affect

the integrity and performance of the application system. These key decisions

include the following:

• Choosing the storage format (called data type) for each attribute

from the logical data model. The format and associated parameters are chosen

to maximize data 

integrity and to minimize storage space.

• Giving the database management system guidance regarding how

to group attributes from the logical data model into physical records. You will

discover that 
126 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

although the columns of a relational table as specified in the logical design are a 

natural definition for the contents of a physical record, this does not always form 

the foundation for the most desirable grouping of attributes in the physical

design.

• Giving the database management system guidance regarding how

to arrange similarly structured records in secondary memory (primarily hard

disks), using 

a structure (called a file organization) so that individual and groups of records

can 

be stored, retrieved, and updated rapidly. Consideration must also be given to 

protecting data and recovering data if errors are found.

• Selecting structures (including indexes and the overall database

architecture) for storing and connecting files to make retrieving related data more

efficient.

• Preparing strategies for handling queries against the database that

will optimize performance and take advantage of the file organizations and

indexes that you 

have specified. Efficient database structures will be beneficial only if queries 

and the database management systems that handle those queries are tuned to 

intelligently use those structures.

127 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

DATA VOLUME AND USAGE ANALYSIS

Data volume and frequency-of-use statistics are important inputs to the physical

database design process, particularly in the case of very largescale database

implementations. Thus, it is beneficial to maintain a good understanding of the

size and usage patterns of the database throughout its life cycle. 

Estimates of database size are used to select physical storage devices and

storage costs estimation and estimates of usage paths or pattern are used to

select file organization and access methods. Plans for the use of indexes, and

plan a strategy for database distribution.

Why do we need to estimate?

Data volume and usage estimation is crucial for the proper administration of

databases. As you all know, we need a storage space to store and maintain our

database. In order to make the proper storage size decision for our database we

need to estimate the data volume and usage.

What happens if we don't estimate?

The consequences of NOT estimating data volume and usage frequency is

severe. Think about an e-tailer (web-based retailer). Let's assume that the e-

tailer's management  chose a database storage space using the cost as the sole

criterion. Since the e-tailer wants to save bucks from the initial set-up costs, they

chose the smallest storage space available by the vendor. After a serious

advertising campaign using web and other media , they started their online

128 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

operations. Everything was going fine, until one day they found out that their web

site is crashed due to data overload and high level of usage frequency. Now the

company ended up having:

 upset customers, who are waiting for their orders (most probably the

customer would switch to another provider)

 a bill from the vendor in order to fix the issue (the bill of course includes

the additional storage space. Because, right now the company deems it

necessary to have the proper amount of database storage space)

 lost business because the web site is down

An easy way to show the statistics about data volumes and usage is by adding

notation to the EER diagram that represents the final set of normalized relations

from logical database design. 

129 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Figure 5-1 shows the EER diagram (without attributes) for  a simple inventory

database for Pine Valley Furniture Company. This EER diagram represents the

normalized relations constructed during logical database design for the original

conceptual data model of this situation depicted in Figure 3-5b.

Both data volume and access frequencies are shown in Figure 5-1. For

example, 

there are 3,000 PARTs in this database. The supertype PART has two subtypes, 

MANUFACTURED (40 percent of all PARTs are manufactured) and

PURCHASED (70  percent are purchased; because some PARTs are of both

subtypes, the percentages sum to more than 100 percent). The analysts at Pine

Valley estimate that there are typically 150 SUPPLIERs, and Pine Valley

130 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

receives, on average, 40 SUPPLIES instances from each SUPPLIER, yielding a

total of 6,000 SUPPLIES. The dashed arrows represent access frequencies. So,

for example, across all applications that use this database, there are on average

20,000 accesses per hour of PART data, and these yield, based on subtype

percentages, 14,000 accesses per hour to PURCHASED PART data.

There are an additional 6,000 direct accesses to PURCHASED PART data. Of

this total of 20,000 accesses to PURCHASED PART, 8,000 accesses then also

require SUPPLIES data and of these 8,000 accesses to SUPPLIES, there are

7,000 subsequent accesses to SUPPLIER data. For online and Web-based

applications, usage maps should show the accesses per second. Several usage

maps may be needed to show vastly different usage patterns for different times

of day. Performance will also be affected by network specifications. The volume

and frequency statistics are generated during the systems analysis phase of the

systems development process when systems analysts are studying current and

proposed data processing and business activities. The data volume statistics

represent the size of the business and should be calculated assuming business

growth over a period of at least several years. The access frequencies are

estimated from the 

timing of events, transaction volumes, the number of concurrent users, and

reporting and querying activities. Because many databases support ad hoc

accesses, and such accesses may change significantly over time, and known

database access can peak and dip over a day, week, or month, the access

131 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

frequencies tend to be less certain and even than the volume statistics.

Fortunately, precise numbers are not necessary. What is crucial is the relative

size of the numbers, which will suggest where the greatest attention needs to be

given during physical database design in order to achieve the best possible

performance. For example, in Figure 5-1, notice the following: 

• There are 3,000 PART instances, so if PART has many attributes and

some, like description, are quite long, then the efficient storage of PART might be

important.

  • For each of the 4,000 times per hour that SUPPLIES is accessed via

SUPPLIER, PURCHASED PART is also accessed; thus, the diagram would

suggest possibly combining these two co-accessed entities into a database table

(or file). This act of combining normalized tables is an example of

denormalization, which we discuss later in this chapter.

 • There is only a 10 percent overlap between MANUFACTURED and

PURCHASED parts, so it might make sense to have two separate tables for

these entities and redundantly store data for those parts that are both

manufactured and purchased; such planned redundancy is acceptable if

purposeful. Further, there are a total of 20,000 accesses an hour of

PURCHASED PART data (14,000 from access to 

PART and 6,000 independent access of PURCHASED PART) and only 8,000

accesses of MANUFACTURED PART per hour. Thus, it might make sense to

organize tables for MANUFACTURED and PURCHASED PART data differently

132 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

due to the significantly different access volumes. It can be helpful for subsequent

physical

DESIGNING FIELDS

A field is the smallest unit of application data recognized by system software,

such as 

a programming language or database management system. A field corresponds

to a 

simple attribute in the logical data model, and so in the case of a composite

attribute, a 

field represents a single component.

Basic Decisions in specifying a Field:

  Specification of the type of data used to represent values of the field

 Data integrity controls built into the database

 Describe the mechanisms that the DBMS should use to handle missing

values for the field. 

  Specify the Display Format

CHOOSING DATA TYPES

133 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

As a typical company’s amount of data has grown exponentially it’s become even

more critical to optimize data storage. The size of your data doesn’t just impact

storage size 

and costs, it also affects query performance. A key factor in determining the size

of your data is the data type you select.

Selecting a data type involves four objectives that will have different relative

levels of importance for different applications: 1. Represent all possible values. 2.

Improve data integrity. 3. Support all data manipulations. 4. Minimize storage

134 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

space

135 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 If the data is numeric, favor SMALLINT, INTEGER, BIGINT, or DECIMAL

data types. DECFLOAT and FLOAT are also options for very large

numbers.

 If the data is character, use CHAR or VARCHAR data types.

 If the data is date and time, use DATE, TIME, and TIMESTAMP data

types.

 If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or

DBCLOB data types.

136 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CODING TECHNIQUES

Some attributes have a sparse set of values or are so large that, given data
volumes, considerable storage space will be consumed. A field with a limited
number of possible values can be translated into a code that requires less space.
Consider the example of the ProductFinish field illustrated in Figure 5-2. Products
at Pine Valley Furniture come in only a limited number of woods: Birch, Maple,
and Oak. By creating a code or translation table, each ProductFinish field value
can be replaced by a code, a cross-reference to the lookup table, similar to a
foreign key. This will decrease the amount of space for the ProductFinish field
and hence for the PRODUCT file. There will be additional space for the
PRODUCT FINISH lookup table, and when the ProductFinish field value is
needed, 

137 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

an extra access (called a join) to this lookup table will be required. If the

ProductFinish field is infrequently used or if the number of distinct ProductFinish

values is very large, the relative advantages of coding may outweigh the costs.

Note that the code table would not appear in the conceptual or logical model. The

code table is a physical construct to achieve data processing performance

improvements, not a set of data with business value.

CONTROLLING DATA INTEGRITY

Default Value - A default value is the value a field will assume unless a user

enters an explicit value for an instance of that field. Assigning a default value to a

field can reduce data entry time because entry of a value can be skipped. It can

also help to reduce data entry errors for the most common value.

Range control - A range control limits the set of permissible values a field may

assume. The range may be a numeric lower-to-upper bound or a set of specific

values. Range controls must be used with caution because the limits of the range

may change over time. A combination of range controls and coding led to the

year 2000 problem that many organizations faced, in which a field for year was

represented by only the numbers 00 to 99. It is better to implement any range

controls through a DBMS because range controls in applications may be

inconsistently enforced. It is also more difficult to find and change them in

applications than in a DBMS. 

138 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Null value control - A null value was defined in Chapter 4 as an empty value.

Each primary key must have an integrity control that prohibits a null value. Any

other required field may 

also have a null value control placed on it if that is the policy of the organization. 

Referential integrity - The term referential integrity was defined in Chapter 4.

Referential integrity on a field is a form of range control in which the value of that

field must exist as the value in some field in another row of the same or (most

commonly) a different table. That is, the range of legitimate values comes from

the dynamic contents of a field in a database table, not from some pre-specified

set of values. Note that referential integrity only guarantees that some existing

cross-referencing value is used, not that it is the correct one. A coded field will

have referential integrity with the primary key of the associated lookup table.

HANDLING MISSING DATA

• Substitute an estimate of the missing value. For example, for a missing sales

value when computing monthly product sales, use a formula involving the mean

of the existing monthly sales values for that product indexed by total sales for

that month across all products. Such estimates must be marked so that users

know that these are not actual values. 

• Track missing data so that special reports and other system elements cause

people to resolve unknown values quickly. This can be done by setting up a

trigger in the database definition. A trigger is a routine that will automatically

139 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

execute when some event occurs or time period passes. One trigger could log

the missing entry to a file when a null or other missing value is stored, and

another trigger could run periodically to create a report of the contents of this log

file. 

• Perform sensitivity testing so that missing data are ignored unless knowing a

value might significantly change results (e.g., if total monthly sales for a particular

salesperson are almost over a threshold that would make a difference in that

person’s compensation). This is the most complex of the methods mentioned and

hence requires the most sophisticated programming. Such routines for handling

missing data may be written in application programs. All relevant modern DBMSs

now have more sophisticated programming capabilities, such as case

expressions, user-defined functions, and triggers, so that such logic can be

available in the database for all users without application-specific programming.

FILE ORGANIZATION

File organization refers to the way data is stored in a file. File organization is very

important because it determines the methods of access, efficiency, flexibility and storage

devices to use. 

Some factors to consider the file organization: 

a) Fast data retrieval 

b) High throughput for processing data input and maintenance transactions 

c) Efficient use of storage space 

140 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

d) Protection from failures or data loss 

e) Minimizing need for reorganization 

f) Accommodating growth 

g) Security from unauthorized use 

Types of file organization

1. Sequential 2) Indexed 3) Hashed

Sequential file organizations - In a sequential file organization, the records in the

file are stored in sequence according to a primary key value (see Figure 5-7a).

To locate a particular record, a program must normally scan the file from the

beginning until the desired record is located. A common example of a sequential

file is the alphabetical list of persons in the white pages of a telephone directory

(ignoring any index that may be included with the directory). 

Indexed file organizations - contains records ordered by a record key. A record

key uniquely identifies a record and determines the sequence in which it is

accessed with respect to other records.

Each record contains a field that contains the record key. A record key for a

record might be, for example, an employee number or an invoice number.

An indexed file can also use alternate indexes, that is, record keys that let you

access the file using a different logical arrangement of the records. For example,

141 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

you could access a file through employee department rather than through

employee number.

The possible record transmission (access) modes for indexed files are

sequential, random, or dynamic. When indexed files are read or written

sequentially, the sequence is that of the key values.

142 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Hashed file organization -  Hash File Organization uses the computation of hash
function on some fields of the records. The hash function's output determines the
location of disk block where the records are to be placed.

For example, suppose that an organization has a set of approximately 1,000

employee records to be stored on magnetic disk. A suitable prime number would

be 997, because it is close to 1,000. Now consider the record for employee

12,396. When we divide this number by 997, the remainder is 432. Thus, this

record is stored at location 432 in the file

CLUSTER FILE ORGANIZATION

In this method two or more table which are frequently used to join and get the

results are stored in the same file called clusters. These files will have two or

more tables in the same data block and the key columns which map these tables

are stored only once. This method hence reduces the cost of searching for

various records in different files. All the records are found at one place and hence

making search efficient.

DENORMALIZING AND PARTITIONING DATA

      Modern database management systems have an increasingly important role

in determining how the data are actually stored on the storage media. The

efficiency of database processing is, however, significantly affected by how the

logical relations are structured as database tables. The purpose of this section is

to discuss denormalization as a mechanism that is often used to improve efficient

143 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

processing of data and quick access to stored data. It first describes the best-

known denormalization approach: combining several logical tables into one

physical table to avoid the need to bring related data back together when they

are retrieved from the database. Then the section will discuss another form of

denormalization called partitioning, which also leads to differences between the

logical data model and the physical tables, but in this case one relation is

implemented as multiple tables.

    Denormalization is the process of transforming normalized relations into

nonnormalized physical record specifications. We will review various forms of,

reasons for, and cautions about denormalization in this section. In general,

denormalization may partition a relation into several physical records, may

combine attributes from several relations together into one physical record, or

may do a combination of both.

144 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Denormalization is the process of adding precomputed redundant data to an

otherwise normalized relational database to improve read performance of the

database. Normalizing a database involves removing redundancy so only a

single copy exists of each piece of information. Denormalizing a database

requires data has first been 

normalized. With denormalization, the database administrator selectively adds

back specific instances of redundant data after the data structure has been

normalized. A denormalized database should not be confused with a database

that has never been normalized. Using normalization in SQL, a database will

store different but related types of data in separate logical tables, called relations.

145 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

When a query combines data from multiple tables into a single result table, it is

called a join. The performance of such a join in the face of complex queries is

often the occasion for the administrator to explore the denormalization

alternative.

Another approach is to denormalize the logical data design. With care this can

achieve a similar improvement in query response, but at a cost—it is now the

database designer's responsibility to ensure that the denormalized database

does not become inconsistent. This is done by creating rules in the database

called constraints, that specify how the redundant copies of information must be

kept synchronized, which may easily make the de-normalization procedure

pointless. It is the increase in logical complexity of the database design and the

added complexity of the additional constraints that make this approach

hazardous. Moreover, constraints introduce a trade-off, speeding up reads

(SELECT in SQL) while slowing down writes 

(INSERT, UPDATE, and DELETE). This means a denormalized database under

heavy write load may offer worse performance than its functionally equivalent

normalized counterpart. In a traditional normalized database, we store data in

separate logical tables and attempt to minimize redundant data. We may strive to

have only one 

copy of each piece of data in database. For example, in a normalized database,

we might have a Courses table and a Teachers table. Each entry in Courses

would store the teacherID for a Course but not the teacherName. When we need

146 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

to retrieve a list of all Courses with the Teacher’s name, we would do a join

between these two tables. In some ways, this is great; if a teacher changes his or

her name, we only have to update the name in one place. The drawback is that if

tables are large, we may spend an unnecessarily long time doing joins on tables.

Denormalization, then, strikes a different compromise. Under denormalization,

we decide that we’re okay with some redundancy and some extra effort to update

the database in order to get the efficiency advantages of fewer joins.

It is into this world of normalization with its order and useful arrangement of data

that the issue of denormalization is raised. Denormalization is the evaluated

introduction of instability into the stabilized (normalized) data structure.

If one went to such great lengths to arrange the data in normal form, why would

one change it? In order to improve performance is almost always the answer. In

the relational database environment, denormalization can mean fewer objects,

fewer joins, and faster access paths. These are all very valid reasons for

considering it. It is an 

evaluative decision however and should be based on the knowledge that the

normalized model shows no bias to either update or retrieval but gives advantage

to neither.Overall, denormalization should be justified and documented so future

additions to the database or increased data sharing can address the

denormalization issues. If necessary, the database might have to be

147 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

renormalized and then denormalized with new information.

The Reason for Denormalization

Only one valid reason exists for denormalizing a relational design – to enhance

performance. However, there are several indicators which will help to identify

systems and tables which are potentialdenormalization candidates. These are:

 ●Many critical queries and reports exist which rely upon data from more

than one table. Often times these requests need to be processed in an on-

line environment.

●Repeating groups exist which need to be processed in a group instead of

individually.

●Many calculations need to be applied to one or many columns before

queries can be successfully answered.

●Tables need to be accessed in different ways by different users during

the same timeframe.

●Many large primary keys exist which are clumsy to query and consume a

large amount of DASD when carried as foreign key columns in related

tables.

●Certain columns are queried a large percentage of the time. Consider

60% or greater to be a cautionary number flagging denormalization as an

option.

Advantages of Database denormalization:

148 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Increased query execution speed. As there is no need to use joins

between tables, it is possible to extract the necessary information from

one table, which automatically increases the speed of query execution.

Additionally, this solution saves memory.

 Writing queries is much easier.If the table is properly reorganized for the

most common needs, you can extract data from only one table and not

waste time looking for join keys. However, one should remember about

data redundancy and update the query accordingly.

 No need to obtain data from dictionary tables where the values are

constant over time. Tables with country dictionaries are good examples. If

a company operates in a fixed number of world markets, it seems

unnecessary to make continuous joins with the dictionary table with

countries. In this case, it is worth adding a column with the name of the

country to, for example, a sales table.

 Ability to add aggregate data, which can be used for more efficient

reporting. Certain statistics, such as the number of sales actions, average

sales, etc., are very necessary to analyze various areas of the company’s

operation. Therefore, it may be easier to define key statistics and include

them in one table than to retrieve them by joining multiple tables.

 Reduction of the number of tables in a relational database. In case of a

complex relational database architecture, obtaining data from the multiple

tables can be tricky. If the database is properly denormalized, the number

149 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

of these tables can be effectively reduced and, consequently, the

database architecture can be simplified.

Disadvantages of Database denormalization:

 Increased processing size. due to data redundancy and possible data

duplication, the size of query processing increases.

 Increased table sizes. As a result of the denormalization of the database,

the table may significantly increase its size, which may be associated with

the load on the storage space.

 Increased costs of updating tables and inserts. In a table where data has

undergone redundancy due to the database denormalization, data update

may be a problem. For example, let’s assume that an additional column

that contains data about customer’s address has been added. Updating

this data can be burdensome and costly if the customer changes the

address. If the database is normalized, updating can only be done in the

dictionary table at a much lower cost. It is similar with inserts. Due to the

redundancy of data as a result of joining multiple tables, obtaining many

data for one table may be burdensome.

 Data may be inconsistent. Before executing the query, it is necessary to

get to know the table thoroughly and to take into account data duplication.

The query that will extract the necessary data without a risk of data

inconsistency should be comprehensively prepared.

150 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Partitioning

A reserved part of a storage drive (hard disk, SSD) that is treated as a separate

drive. Even a single drive that takes all the storage space is assigned a partition.

For example, early Windows PCs came with the entire disk partitioned as drive

C:. New Windows PCs often come with the storage drive partitioned into C: and

D:. The main drive is C:and D: contains a recovery system in the event Windows

has to be re-installed. In addition, users may wish to have several drives for

organizational purposes, and utility programs come with every computer for

adding and modifying partitions. See primary partition, extended partition, basic

disk and dynamic disk.

On Microsoft operating systems, a hard disk is divided into drives. The first drive

has one drive in the partition called the primary drive and is generally "C:", which

is the active partition that boots the OS. Extended partitions can be added such

as "D:" and "E:" have more than one drive and are used for other storage such as

programs, data files, CD-ROM, or USB drives.

A Unix OS such as Linux and some older versions of Mac OS X use multiple

partitions on a disk from secondary storage called swap partitioning or paging.

This type of partition scheme allows directories with a file system hierarchy

standard (FHS) or home directory to be assigned their own file systems. A typical

Linux system has two partitions 

that hold a file system that is attached to “/”, which is located in the root directory

or swap partition. Generally, an unlimited number of partitions can be created in

151 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

a Linux OS. A Mac OS X system uses one partition for the whole file system. It

uses a swap file method within the file system instead of a swap partition.

The partitioning can be done by either building separate smaller databases (each

with its own tables, indices, and transaction logs), or by splitting selected

elements, for example just one table.

 Horizontal partitioning involves putting different rows into different


tables. For example, customers with ZIP codes less than 50000 are stored in
CustomersEast, while customers with ZIP codes greater than or equal to
50000 are stored in CustomersWest. The two partition tables are then
CustomersEast and CustomersWest, while a view with a union might be
created over both of them to provide a complete view of all customers.
 Vertical partitioning involves creating tables with fewer columns and
using additional tables to store the remaining columns.  Generally, this
[1]

practice is known as normalization. However, vertical partitioning extends


further and partitions columns even when already normalized. This type of
partitioning is also called "row splitting", since rows get split by their columns,
and might be performed explicitly or implicitly. Distinct physical machines
might be used to realize vertical partitioning: Storing infrequently used or very
wide columns, taking up a significant amount of memory, on a different
machine, for example, is a method of vertical partitioning. A common form of
vertical partitioning is to split static data from dynamic data, since the former
is faster to access than the latter, particularly for a table where the dynamic
data is not used as often as the static. Creating a view across the two newly
created tables restores the original table with a performance penalty, but
accessing the static data alone will show higher performance. A columnar
database can be regarded as a database that has been vertically partitioned
until each column is stored in its own table.

Indexing makes columns faster to query by creating pointers to where data is

stored within a database.

Imagine you want to find a piece of information that is within a large database. To

get this information out of the database the computer will look through every row

until it finds it. If the data you are looking for is towards the very end, this query

152 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

would take a long time to run.

If the table was ordered alphabetically, searching for a name could happen a lot

faster because we could skip looking for the data in certain rows. If we wanted to

search for “Zack” and we know the data is in alphabetical order we could jump

down to halfway through the data to see if Zack comes before or after that row.

We could then half the remaining rows and make the same comparison.

153 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

An index is a structure that holds the field the index is sorting and a pointer from

each record to their corresponding record in the original table where the data is

actually stored. Indexes are used in things like a contact list where the data may

be physically stored in the order you add people’s contact information but it is

easier to find people when listed out in alphabetical order.

Let’s look at the index from the previous example and see how it maps back to

the original Friends table:

We can see here that the table has the data stored ordered by an incrementing id

based on the order in which the data was added. And the Index has the names

stored in alphabetical order.

When to use Indexes?

Indexes are meant to speed up the performance of a database, so use indexing

whenever it significantly improves the performance of your database. As your


154 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

database becomes larger and larger, the more likely you are to see benefits from

indexing.

When not to use Indexes?

When data is written to the database, the original table (the clustered index) is

updated first and then all of the indexes off of that table are updated. Every time

a write is made to the database, the indexes are unusable until they have

updated. If the database is constantly receiving writes, then the indexes will

never be usable. This is why indexes are typically applied to databases in data

warehouses that get new data updated on a scheduled basis (off-peak hours)

and not production databases which might be receiving new writes all the time.

155 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 6:

INTRODUCTION TO SQL AND


ADVANCED SQL

Researched and presented by:

Briones, Joshua
Ramos, , Eden Marie C.
Reyes, Ana Marie

156 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Introduction and History of SQL

SQL - the most common language for relational systems.

SQL stands for Structured Query Language

 Initially called SEQUEL (Structured English Query Language) and based

on their original language called SQUARE (Specifying Queries As

Relational Expressions). SEQUEL was later renamed to SQL by dropping

the vowels, because SEQUEL was a trade mark registered by the Hawker

Siddeley aircraft company.

 TABLE is also called the relation or data set that is organized w/ rows and

columns

Pronounced “S-Q-L” by some and “sequel” by others

 SQL stands for Structured Query Language

 SQL lets you access and manipulate databases

 SQL became a standard of the American National Standards Institute

(ANSI) in 1986, and of the International Organization for Standardization

(ISO) in 1987

157 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 is a domain-specific language used in programming and designed for

managing data held in a relational database management system

(RDBMS), or for stream processing in a relational data stream

management system (RDSMS). It is particularly useful in handling

structured data, i.e. data incorporating relations among entities and

variables.

The first commercial DBMS that supported SQL was Oracle in 1979. Oracle is

now available in mainframe, client/server, and PC-based platforms for many

operating systems, including various UNIX, Linux, and Microsoft Windows

operating systems. IBM’s DB2, Informix, and Microsoft SQL Server are available

for this range of operating systems also.

The concepts of relational database technology were first articulated in 1970

They used a language called Sequel, also developed at the San Jose IBM

Research Laboratory.

Purposes of SQL

The following were the original purposes of the SQL standard:

1. To specify the syntax and semantics of SQL data definition and


manipulation

languages

2. To define the data structures and basic operations for designing,


accessing,

maintaining, controlling, and protecting an SQL database

158 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

3. To provide a vehicle for portability of database definition and application

modules between conforming DBMSs

4. To specify both minimal (Level 1) and complete (Level 2) standards, which


permit

different degrees of adoption in products

5. To provide an initial standard, although incomplete, that will be enhanced

later to include specifications for handling such topics as referential

integrity, transaction management, user-defined functions, join operators

beyond the equi-join, and national character sets

Advantages of SQL 

 Reduced training costs Training in an organization can concentrate on


one

language. A large labor pool of IS professionals trained in a common language

reduces retraining for newly hired employees.

 Productivity IS professionals can learn SQL thoroughly and become


proficient

with it from continued use. An organization can afford to invest in tools to help

IS professionals become more productive. Because they are familiar with the

language in which programs are written, programmers can more quickly maintain

existing programs.

 Application portability Applications can be moved from one context to

another when each environment uses SQL. Further, it is economical for

159 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

the computer software industry to develop off-the-shelf application

software when there

is a standard language.

 Application longevity A standard language tends to remain so for a long

time; hence there will be little pressure to rewrite old applications. Rather,

applications will simply be updated as the standard language is enhanced

or new versions of

DBMSs are introduced.

 Reduced dependence on a single vendor When a nonproprietary


language is

used, it is easier to use different vendors for the DBMS, training and educational

services, application software, and consulting assistance; further, the market for

such vendors will be more competitive, which may lower prices and improve

service.

 Cross-system communication Different DBMSs and application


programs can

more easily communicate and cooperate in managing data and processing user

programs.

Disadvantages of SQL

 A standard may be difficult to change (because so many vendors have a

vested interest in it), so fixing deficiencies may take considerable effort. 

160 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Standards that can be extended with proprietary features is that using

special features added to SQL by a particular vendor may result in the

loss of some advantages, such as application portability

Data Table 

Writing single-table queries using SQL Commands. 

Name Surname Subject Age PassMark


Joshua Briones PCE007 16 1.00
Ana Marie Reyes PCE 007 21 1.25
Eden Ramos PCE 007 21 1.25
Mike Antolino PCE007 21 1.50

SELECT – used to get data from tables in a database. It is also one of the most

important commands in SQL. 

Elements of the SELECT Statement

The purpose of a SELECT statement is to query tables, apply some logical

manipulation, and return a result. In this section, I talk about the phases involved

in logical query processing. I describe the logical order in which the different

query clauses are processed, and what happens in each phase.

Note that by “logical query processing,” I’m referring to the conceptual way in

which standard SQL defines how a query should be processed and the final

result achieved. Don’t be alarmed if some logical processing phases that I

describe here seem inefficient. The Microsoft SQL Server engine doesn’t have to

161 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

follow logical query processing to the letter; rather, it is free to physically process

a query differently by rearranging processing phases, as long as the final result

would be the same as that dictated by logical query processing. SQL Server can

—and in fact, often does—make many shortcuts in the physical processing of a

query.

The FROM Clause

The FROM clause is the very first query clause that is logically processed. In this

clause, you specify the names of the tables that you want to query and table

operators that operate on those tables.

The WHERE Clause

In the WHERE clause, you specify a predicate or logical expression to filter the

rows returned by the FROM phase. Only rows for which the logical expression

evaluates to TRUE are returned by the WHERE phase to the subsequent logical

query processing phase. In the sample query in Listing 2-1, the WHERE phase

filters only orders placed by customer 71.

Referential integrity means that a value in the matching column on the many

side must correspond to a value in the primary key for some row in the table on

the one side or be NULL.

Referential integrity is a property of data stating that all its references are valid.

In the context of relational databases, it requires that if a value of one attribute

162 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

(column) of a relation (table) references a value of another attribute (either in the

same or a different relation), then the referenced value must exist.[1]

For referential integrity to hold in a relational database, any column in a base

table that is declared a foreign key can only contain either null values or values

from a parent table's primary key or a candidate key.[2] In other words, when a

foreign key value is used it must reference a valid, existing primary key in the

parent table. For instance, deleting a record that contains a value referred to by a

foreign key in another table would break referential integrity. Some relational

database management systems (RDBMS) can enforce referential integrity,

normally either by deleting the foreign key rows as well to maintain integrity, or by

returning an error and not performing the delete. Which method is used may be

determined by a referential integrity constraint defined in a data dictionary.

The adjective 'referential' describes the action that a foreign key performs,

'referring' to a linked column in another table. In simple terms, 'referential

integrity' guarantees that the target 'referred' to will be found. A lack of referential

integrity in a database can lead relational databases to return incomplete data,

usually with no indication of an error.

DISCUSS SQL:1999 and SQL:2016 STANDARDS

SQL:1999 (also called SQL 3) was the fourth revision of the SQL standard.

Starting with this version, the standard name used a colon instead of a hyphen to

163 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

be consistent with the names of other ISO standards. This standard was

published in multiple installments between 1999 and 2002.

The first installment of SQL:1999 had five parts:

 Part 1: SQL/Framework (100 pages) defined the fundamental concepts of

SQL.

 Part 2: SQL/Foundation (1050 pages) defined the fundamental syntax and

operations of SQL: types, schemas, tables, views, query and update

statements, expressions, and so forth. This part is the most important for

regular SQL users.

 Part 3: SQL/CLI (Call Level Interface) (514 pages) defined an application

programming interface for SQL.

 Part 4: SQL/PSM (Persistent Stored Modules) (193 pages) defined

extensions that make SQL procedural.

 Part 5: SQL/Bindings (270 pages) defined methods for embedding SQL

statements in application programs written in a standard programming

language.

Three more parts, also considered part of SQL:1999, were published later.

SQL:1999 introduced many important features that are part of modern SQL.

Among the most important were COMMON TABLE EXPRESSIONS (CTEs). This

is a very useful feature that lets you ORGANIZE LONG AND COMPLEX SQL

164 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

QUERIES and make them more readable. When the WITH [RECURSIVE] syntax

is used, CTEs can also RECURSIVELY PROCESS HIERARCHICAL DATA. 

SQL:1999 also introduced OLAP (Online Analytical Processing) capabilities,

which includes features that are helpful when preparing business reports.

The GROUP BY extensions ROLLUP, CUBE, and GROUPING SETS entered

the standard at this time. 

Some minor additions in SQL:1999 standard include using expressions in

ORDER BY, the inclusion of data types for large binary objects (LOB and CLOB),

and the introduction of triggers.

The size of the SQL standard grew significantly between 1992 and 1999. The

SQL-92 standard had almost 600 pages, but it was still accessible to regular SQL

users. Books like A Guide to the SQL Standard by Christopher Date and Hugh

Darwen discussed and explained the SQL-92 standard.

Starting with SQL:1999 the standard – now over 2,000 pages – was no longer

accessible to regular SQL users. It has become a resource for database experts

and database vendors. The standard guides the development of SQL in major

databases; it shows which new language features are worth implementing to stay

current. It also standardizes the syntax of new SQL features, making sure that

major databases implement them in a similar way, using similar syntax and

semantics.

165 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The change in the role of the SQL standard is emphasized by the fact that there

is no longer an official body that certifies compliance with the standard. Until

1996, the National Institute of Standards and Technology (NIST) data

management standards program certified SQL DBMS compliance with the SQL

standard. Now, vendors self-certify the compliance of their products.

SQL:2003 and beyond

In the 21st century, the SQL standard has been regularly updated.

The SQL:2003 standard was published on March 1, 2004. Its major addition

was WINDOW FUNCTIONS, a powerful analytical feature that allows you to

compute summary statistics without collapsing rows. Window functions

significantly increased the expressive power of SQL. They are extremely useful

in PREPARING ALL KINDS OF BUSINESS REPORTS , ANALYZING TIME

SERIES DATA, and ANALYZING TRENDS. The addition of window functions to

the standard coincided with the popularity of OLAP and data warehouses. People

started using databases to make data-driven business decisions. This trend is

only gaining momentum, thanks to the growing amount of data that all

businesses collect. You can learn window functions with our WINDOW

FUNCTIONS course. (READ ABOUT THE COURSE or WHY IT’S WORTH

LEARNING SQL WINDOW FUNCTIONS HERE.) SQL:2003 also introduced

XML-related functions, sequence generators, and identity columns.

166 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

After 2004, there were no major ground-breaking additions to the language. The

changes in the SQL standard reflected the changes in technology at the time.

SQL:2003 introduced XML-related functions to allow for interoperability between

databases and XML technologies, which were the hot new thing in the early

2000s. SQL:2006 further specified how to use SQL with XML. It was not a

revision of the complete SQL standard, just Part 14, which deals with SQL-XML

interoperability.

The next revisions of the standard brought minor enhancements to the

language. SQL:2008 legalized the use of ORDER BY outside cursor

definitions(!), and added INSTEAD OF triggers, the TRUNCATE statement, and

the FETCH clause. SQL:2011 added temporal data and some enhancements to

window functions and the FETCH clause.

SQL:2016 added row pattern matching and polymorphic table functions as well

as long-awaited JSON support. In the 2010s, JSON replaced XML as the

common data exchange format; modern Internet applications use JSON instead

of XML as their data format. The emerging NoSQL movement also popularized

JSON; document databases store JSON files, and key-value stores are

compatible with the JSON format. The SQL standard added JSON support to

allow for interoperability with modern applications and new types of databases.

The current SQL standard is SQL:2019. It added Part 15, which defines

multidimensional array support in SQL.

167 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

SQL:2016 or ISO/IEC 9075:2016 (under the general title "Information technology

– Database languages – SQL") is the eighth revision of the ISO (1987)

and ANSI (1986) standard for the SQL database query language. It was formally

adopted in December 2016. The standard consists of 9 parts which are

described in some detail in SQL.

SQL:2016 New features

SQL:2016 introduced 44 new optional features. 22 of them belong to the JSON

functionality, ten more are related to polymorphic table functions. The additions

to the standard include:

 JSON: Functions to create JSON documents, to access parts of JSON

documents and to check whether a string contains valid JSON data

 Row Pattern Recognition: Matching a sequence of rows against a regular

expression pattern

 Date and time formatting and parsing

 LISTAGG: A function to transform values from a group of rows into a

delimited string

 Polymorphic table functions: table functions without predefined return type

 New data type DECFLOATAdvanced SQL

Processing Multiple Tables

Now that we have explored some of the possibilities for working with a single

table, it’s time to bring out the light sabers, jet packs, and tools for heavy lifting:

168 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

We will work with multiple tables simultaneously. The power of RDBMSs is

realized when working with multiple tables. When relationships exist among

tables, the tables can be linked together in queries. Remember from the previous

chapter that these relationships are established by including a common

column(s) in each table where a relationship is needed. In most cases this is

accomplished by setting up a primary key—foreign key relationship, where the

foreign key in one table references the primary key in another, and the values in

both come from a common domain. We can use these columns to establish a link

between two tables by finding common values in the columns. 

The linking of related tables varies among different types of relational systems. In

SQL, the WHERE clause of the SELECT command is also used for multiple-table

operations. In fact, SELECT can include references to two, three, or more tables

in the same command. As illustrated next, SQL has two ways to use SELECT for

combining data from related tables.

The most frequently used relational operation, which brings together data from

two or more related tables into one resultant table, is called a join. Originally,

SQL specified a join implicitly by referring in a WHERE clause to the matching of

common columns over which tables were joined. Since SQL-92, joins may also

be specified in the FROM clause. In either case, two tables may be joined when

each contains a column that shares a common domain with the other. As

mentioned previously, a primary key from one table and a foreign key that

references the table with the primary key will share a common domain and are

169 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

frequently used to establish a join. In special cases, joins will be established

using columns that share a common domain but not the primary-foreign key

relationship, and that also works (e.g., we might join customers and salespersons

based on common postal codes, for which there is no relationship in the data

model for the database). The result of a join operation is a single table. Selected

columns from all the tables are included. Each row returned contains data from

rows in the different input tables where values for the common columns match.

What Is an SQL JOIN?

In other guides, you have learned how to write basic SQL queries to retrieve data

from a table. In real-life applications, you would need to fetch data from multiple

tables to achieve your goals. To do so, you would need to use SQL joins. In this

guide, you will learn how to query data from multiple tables using joins.

A JOIN clause is used when you need to combine data from two or more tables

into one data set. Records from both tables are matched based on a condition

(also called a JOIN predicate) you specify in the JOIN clause. If the condition is

met, the records are included in the output. According to the article in

learnsql.com, they explain the SQL JOIN concept and the different JOIN types

using examples. So, before we go any further, let's take a look at the tables that

we are going to use in this article.

Get to Know the Database

170 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

We are going to use tables from a fictional bank database. The first table

is called account and it contains data related to customer bank accounts:

account_id overdraft_amt customer_i type_i segment

d d
2556889 12000 4 2 RET
132359879 1550 1 1 RET

5
2225546 5000 5 2 RET
5516229 6000 4 5 RET
5356222 7500 5 5 RET
2221889 5400 1 2 RET
2455688 12500 50 2 CORP
132248865 2500 51 1 CORP

6
132359879 3100 52 1 CORP

5
132311159 1220 53 1 CORP

5
account table

This table contains 10 records (10 accounts) and five columns:

 account_id – Uniquely identifies each account.

 overdraft_amount – The overdraft limit for each account.

 customer_id – Uniquely identifies each customer.

 type_id – Identifies the type of that account.

 segment – Contains the values ‘RET’ (for retail clients) and ‘CORP’ (for

corporate clients).

The second table is called customer and contains customer-related data:

171 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

customer_i name lastname gender marital_status

d
1 MARC TESCO M Y
2 ANNA MARTIN F N
3 EMMA JOHNSON F Y
4 DARIO PENTAL M N
5 ELENA SIMSON F N
6 TIM ROBITH M N
7 MILA MORRIS F N
8 JENNY DWARTH F Y
customer table

This table contains eight records and five columns:

 customer_id – Uniquely identifies each account.

 name – The customer’s first name.

 lastname – The customer’s last name.

 gender– The customer’s gender (M or F).

 marital_status – If the customer is married (Y or N).

Now that we have these two tables, we can combine them to display additional

results related to customer or account data. JOIN can help us to get answers to

questions like:

1. Who owns each account in the account table?

2. How many accounts does Marc Tesco have ?

3. How many accounts are owned by a female customer?

4. What is the total overdraft amount for all of Emma Johnson’s accounts?

To answer each of these questions, we need to combine two tables

(account and customer) using a column that appears in both tables (in this

172 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

case, customer_id). Once we merge the two tables, we will have account and

customer information in a single output.

Keep in mind that in the account table we have some customers that can’t be

found in the customer table. (Info about corporate clients is stored somewhere

else.) Also, keep in mind that some customer IDs are not present in

the account table; some customers don't have accounts.

There are several ways we can combine two tables. Or, put another way, we can

say that there are several different SQL JOIN types.

SQL’s 4 JOIN Types

SQL JOIN types include:

 INNER JOIN (also known as a ‘simple’ JOIN). This is the most common

type of JOIN.

 LEFT JOIN (or LEFT OUTER JOIN)

 RIGHT JOIN (or RIGHT OUTER JOIN)

 FULL JOIN (or FULL OUTER JOIN)

 Self joins and cross joins are also possible in SQL

Let's dive deeper into the first four SQL JOIN types. I will use an example to

explain the logic and the syntax of each type. Sometimes people use Venn

diagrams when explaining SQL JOIN types. I’m not going to use them here, but if

that’s your thing then check out the article HOW TO LEARN SQL JOINS.

173 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

INNER JOIN

INNER JOIN is used to display matching records from both tables. This is also

called a simple JOIN; if you omit the INNER keyword (or any other keyword,

like LEFT, RIGHT, or FULL) and just use JOIN, this is the type of join you’ll get

by default.

There are usually two (or more) tables in a join statement. We call them the left

and right tables. The left table is in the FROM clause – and thus to the left of

the JOIN keyword. The right table is between the JOIN and ON keywords, or to

the right of the JOIN keyword.If the JOIN condition is met in an INNER JOIN, that

record is included in the data set. It can be from either table. If the record does

not match the criteria, it’s not included. The image below shows what would

happen if the color blue was the join criteria for the left and right tables:

Let's take a look how INNER JOIN works in our example. I’m going to do a

simple JOIN on account and customer to

display account and customer information in one output:

SELECT account.*,

      customer.name,

      customer.lastname,

      customer.gender,

      customer.marital_status

FROM account

174 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

JOIN customer

ON account.customer_id=customer.customer_id

Here is a short explanation of what’s going on:

 I’m using JOIN because we are merging

the account and customer tables.

175 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 The JOIN predicate here is defined by equality: account.customer_id =

customer.customer_id

In other words, records are matched by values in the customer_id column:

176 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Records that share the same customer ID value are matched. (They are

shown in color in the above image.) Records that don’t have a match in

either table (shown in gray) are not included in the result set.

 For records that have a match, all attributes from the account table are

displayed in the result set. The name, last name, gender, and marital

status attributes from the customer table are also displayed.

After running this code, SQL returns following:

INNER JOIN result

As we mentioned earlier, only colored (matching) records were returned; all

others are discarded. In business terms, we displayed all the retail accounts with

detailed information about their owners. Non-retail accounts were not displayed

because their customer information is not stored in the customer table.

LEFT JOIN

Sometimes you’ll need to keep all records from the left table – even if some don't

have a match in the right table. In the last example, the gray rows were not
177 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

displayed in the output. Those are corporate accounts. In some cases, you may

want to have them in the data set, even if their customer data is left empty. If we

would like to return unpaired records from the left table, then we should write

a LEFT JOIN. Below, you can see that the LEFT JOIN returns everything in the

left table and matching rows in the right table.

Here is how the previous query would look if we used LEFT JOIN instead

of INNER JOIN:

SELECT account.*,

      customer.name,

      customer.lastname,

      customer.gender,

      customer.marital_status

FROM account

178 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

LEFT JOIN customer

ON account.customer_id=customer.customer_id

The syntax is identical. The result, however, is not the same?. Now we can see

the corporate accounts (gray records) in the results:

Left join - account with customer

Notice how attributes like name, last name, gender, and marital status in the last

four rows are populated with NULLs. This is because these gray rows don’t have

matches in the customer table (i.e. customer_id values of 50, 51 ,52 , and 53

are not present in the customer table). Thus, those attributes have been left

NULL in this result.

RIGHT JOIN

179 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Similar to LEFT JOIN, RIGHT JOIN keeps all records from the right table (even if

there is no matching record in the left table). Here’s that familiar image to show

you how it works:

180 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Once again, we use the same example. However, we’ve replaced LEFT

JOIN with RIGHT JOIN:

SELECT account.account_id,

      account.overdraft_amount,

      account.type_id,

      account.segment,

      account.customer_id,

      customer.customer_id

      customer.name,

      customer.lastname,

      customer.gender,

      customer.marital_status

FROM account

RIGHT JOIN customer

ON account.customer_id=customer.customer_id

The syntax is mostly the same. I’ve made one more small change: In addition

to account.customer_id, I’ve also added customer.customer_id column to the

result set. I did this to show you what happens to records from

the customer table that don't have a match on the left (account) table.

181 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Here is the result:

RIGHT JOIN result

As you can see, all records from the right table have been included in the result

set. Keep in mind:

 Unmatched customer IDs from the right table (numbers 2,3, 6,7, and 8,

shown in gray) have their account attributes set to NULL in this result set.

They are retail customers that don’t have a bank account – and thus no

records in the account table.

 You might expect that the resulting table will have eight records because

that is the total number of records in the customer table. However, this is

not the case. We have 11 records because customer IDs 1, 4, and 5 each

have two accounts in the account table. All possible matches are

displayed.

182 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

FULL (OUTER) JOIN

I’ve shown you how to keep all records from the left or right tables. But what if

you want to keep all records from both tables? In our case, you’d want to display

all matching records plus all corporate accounts plus all customers without

accounts. To do this, you can use FULL OUTER JOIN. This JOIN type will pair

all matching columns and will also display all unmatching columns from both

tables. Unfamiliar attributes will be populated with NULLs. Have a look at the

image below:

Here is the FULL OUTER JOIN syntax:

SELECT account.*,

      CASE WHEN customer.customer_id IS NUL

                 THEN account.customer_id
183 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

      customer.lastname,

      customer.gender,

      customer.marital_status

FROM account

FULL JOIN customer

ON account.customer_id=customer.customer_id;

Now the result looks like this:

Full outer join result

Notice how the last five rows have account attributes populated with NULLs. This

is because these customers do not have records in the account table. Notice

also how customers 50, 51, 52, and 53 have first or last names and other

attributes from the customer table populated with NULLs. This is because they

184 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

don't exist in the customer table. Here, customer_id in the result table is never

NULL because we defined customer_id with a CASE WHEN statement:

CASE WHEN customer.customer_id IS NULL

                 THEN account.customer_id

                 ELSE customer.customer_id END customer_i

This actually means that customer_id in the result table is a combination

of account.customer_id and customer.customer_id (i.e. when one is NULL, use

the other one). We could also display both columns in the output, but this CASE

WHEN statement is more convenient.

Most Common Questions asked about Joins

Question 1: What is a Natural Join and in which situations is a natural join

used?

Solution:

A Natural Join is also a Join operation that is used to give you an output based

on the columns in both the tables between which, this join operation must be

implemented. To understand the situations n which natural join is used, you need

to understand the difference between Natural Join and Inner Join.

The main difference the Natural Join and the Inner Join relies on the number of

columns returned. Refer below for example.

185 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Now, if you apply INNER JOIN on these 2 tables, you will see an output as

below:

If you apply NATURAL JOIN, on the above two tables, the output will be as

below:

From the above example, you can clearly see that the number of columns

returned from the Inner Join is more than that of the number of columns returned

from Natural Join. So, if you wish to get an output, with less number of columns,

then you can use Natural.

Question 2: How to map many-to-many relationships using joins?

Solution:

To map many to many relationships using joins, you need to use two JOIN

statements.

For example, if we have three tables(Employees, Projects and Technologies),

and let us assume that each employee is working on a single project. So, one

186 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

project cannot be assigned to more than one employee. So, this is basically, a

one-to-many relationship.

Now, similarly, if you consider that, a project can be based on multiple

technologies, and any technology can be used in multiple projects, then this kind

of relationship is a many-to-many relationship.

To use joins for such relationships, you need to structure your database with 2

foreign keys. So, to do that, you have to create the following 3 tables:

 Projects

 Technologies

 projects_to_technologies

The project_to_technologies table holds the combinations of project-technology

in every row. This table maps the items on the projects table to the items on the

technologies table so that multiple projects can be assigned to one or more

technologies.

Once the tables are created, use the following two JOIN statements to link all the

above tables together:

 projects_to_technologies to projects

 proejcts_to-technologies to technologies

Question 3: What is a Hash Join?

Solution:

187 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Hash joins are also a type of joins which are used to join large tables or in an

instance where the user wants most of the joined table rows.

The Hash Join algorithm is a two-step algorithm. Refer below for the steps:

 Build phase: Create an in-memory hash index on the left side input

 Probe phase: Go through the right side input, each row at a time to find

the matches using the index created in the above step.

Question 4: What is Self & Cross Join?

Solution:

Self Join

SELF JOIN in other words is a join of a table to itself. This implies that each row

in a table is joined with itself.

Cross Join

The CROSS JOIN is a type of join in which a join clause is applied to each row of

a table to every row of the other table. Also, when the WHERE condition is used,

this type of JOIN behaves as an INNER JOIN, and when the WHERE condition is

not present, it behaves like a CARTESIAN product.

Question 5: Can you JOIN 3 tables in SQL?

Solution:

Yes. To perform a JOIN operation on 3 tables, you need to use 2 JOIN

statements. You can refer to the second question for an understanding of how to

join 3 tables with an example.

What is subquery in SQL?

188 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A subquery is a SQL query nested inside a larger query.

 A subquery may occur in :

o - A SELECT clause

o - A FROM clause

o - A WHERE clause

 The subquery can be nested inside a SELECT, INSERT, UPDATE, or

DELETE statement or inside another subquery.

 A subquery is usually added within the WHERE Clause of another SQL

SELECT statement.

 You can use the comparison operators, such as >, <, or =. The

comparison operator can also be a multiple-row operator, such as IN,

ANY, or ALL.

 A subquery is also called an inner query or inner select, while the

statement containing a subquery is also called an outer query or outer

select.

 The inner query executes first before its parent query so that the results of

an inner query can be passed to the outer query.

You can use a subquery in a SELECT, INSERT, DELETE, or UPDATE statement

to perform the following tasks:

 Compare an expression to the result of the query.

 Determine if an expression is included in the results of the query.

189 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Check whether the query selects any rows.

Syntax :

 The subquery (inner query) executes once before the main query (outer

query) executes.

 The main query (outer query) uses the subquery result.

SQL Subqueries Example :

In this section, you will learn the requirements of using subqueries. We have the

following two tables 'student' and 'marks' with common field 'StudentID'.

Student Marks 

Now we want to write a query to identify all students who get better marks than

that of the student who's StudentID is 'V002', but we do not know the marks of

'V002'.

190 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

- To solve the problem, we require two queries. One query returns the marks

(stored in Total_marks field) of 'V002' and a second query identifies the students

who get better marks than the result of the first query.

First query:

SELECT *  

FROM `marks`  

WHERE studentid = 'V002';

Query result:

The result of the query is 80.

- Using the result of this query, here we have written another query to identify the

students who get better marks than 80. Here is the query :

Second query:

SELECT a.studentid, a.name, b.total_marks

FROM student a, marks b

WHERE a.studentid = b.studentid

AND b.total_marks >80;

Query result:

191 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Above two queries identified students who get the better number than the student

who's StudentID is 'V002' (Abhay).

You can combine the above two queries by placing one query inside the other.

The subquery (also called the 'inner query') is the query inside the parentheses.

See the following code and query result:

SQL Code:

SELECT a.studentid, a.name, b.total_marks

FROM student a, marks b

WHERE a.studentid = b.studentid AND b.total_marks >

(SELECT total_marks

FROM marks

WHERE studentid =  'V002');

Query result:

Pictorial Presentation of SQL Subquery:

192 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Subqueries: General Rules

A subquery SELECT statement is almost similar to the SELECT statement and it

is used to begin a regular or outer query. Here is the syntax of a subquery:

Syntax:

(SELECT [DISTINCT] subquery_select_argument

FROM {table_name | view_name}

{table_name | view_name} ...

193 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

[WHERE search_conditions]

[GROUP BY aggregate_expression [, aggregate_expression] ...]

[HAVING search_conditions])

Subqueries: Guidelines

There are some guidelines to consider when using subqueries:

 A subquery must be enclosed in parentheses. 

 A subquery must be placed on the right side of the comparison operator. 

 Subqueries cannot manipulate their results internally, therefore ORDER

BY clause cannot be added into a subquery. You can use an ORDER BY

clause in the main SELECT statement (outer query) which will be the last

clause.

 Use single-row operators with single-row subqueries. 

 If a subquery (inner query) returns a null value to the outer query, the

outer query will not return any rows when using certain comparison

operators in a WHERE clause.

Type of Subqueries

 Single row subquery: Returns zero or one row.

 Multiple row subquery: Returns one or more rows.

 Multiple column subqueries: Returns one or more columns.

194 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Correlated subqueries: Reference one or more columns in the outer SQL

statement. The subquery is known as a correlated subquery because the

subquery is related to the outer SQL statement.

 Nested subqueries: Subqueries are placed within another subquery.

Understanding Correlated and Uncorrelated Sub-queries in SQL

Sub-queries are queries within another query.  The result of the inner sub-query

is fed to the outer query, which uses that to produce its outcome. If that outer

query is itself the inner query to a further query, then the query will continue until

the final outer query completes.

There are two types of sub-queries in SQL however, correlated sub-queries and

uncorrelated sub-queries. Let’s take a look at these.

Uncorrelated Sub-query

A uncorrelated sub-query is a type of sub-query where inner query doesn’t

depend upon the outer query for its execution. It can complete its execution as a

standalone query. Let us explain uncorrelated sub-queries with the help of an

example.

Suppose, you have database “schooldb” which has two tables: student and

department.  A department will have many students. This means that the student

table has a column “dep_id” which contains the id of the department to which that

student belongs. Now, suppose we want to retrieve records of all students from

the “Computer” department.

195 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The sub-query used in this case will be uncorrelated sub-query since the inner

query will retrieve the id of the computer department from the department table;

the result of this inner query will be directly fed into the outer query which

retrieves records of students from the student table where “dep_id” column’s

value is equal to value retrieved by inner query.

The inner query which retrieves the id of the department using name can be

executed as standalone query as well.

Correlated Sub-query

A correlated sub-query is a type of query, where inner query depends upon the

outcome of the outer query in order to perform its execution.

Suppose we have a student and department table in “schooldb” as discussed

above. We want to retrieve the name, age and gender of all the students whose

age is greater than the average age of students within their department.

In this case, the outer query will retrieve records of all the students iteratively and

each record is passed to the inner query. For each record, the inner query will

retrieve average age of the department for the student record passed by the

outer query. If the age of the student is greater than average age, the record of

the student will be included in the result, and if not not. Let’s see this in action.

Preparing the Data

Let’s create a database named “schooldb”. Run the following SQL in your query

window:

196 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The above command will create a database named “schooldb” on your database

server.

Next, we need to create a “department” table within the “schooldb” database. The

department table shall have three columns: id, name and capacity. To create

department table, execute following query:

1 CREATE TABLE department

2 ( 

3   id INT PRIMARY KEY,

4   name VARCHAR(50) NOT NULL, 

5   capacity INT NOT NULL, 

6 )
Next lets add some dummy data to the table so that we can execute our sub-

queries. Execute the following to create 5 departments: English, Computer, Civil,

Maths and History.

1 USE schooldb; 

2  

3 INSERT INTO department 

4 VALUES (1, 'English', 300), 

5          (2, 'Computer', 450), 

6          (3, 'Civil', 400),

7          (4, 'Maths', 400),

8          (5, 'History', 300)

197 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Next we need to create a “student” table within our database. The student table

will have five columns: id, name, age, gender, and dep_id.

The dep_id column will act as the foreign key column and will have values from

the id column of the department table. This will create a one to many relationship

between the department and student tables. Execute following query to create

student table.

1
USE schooldb; 
2
 
3
CREATE TABLE student
4
(  
5
  id INT PRIMARY KEY,
6
  name VARCHAR(50) NOT NULL,
7
  gender VARCHAR(50) NOT NULL,
8
  age INT NOT NULL,
9
  dep_id INT NOT NULL    
1
)
0

198 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

USE schooldb;
1
 
2
INSERT INTO student
3
  VALUES (1, 'Jolly', 'Female', 20,
4
4), 
5
         (2, 'Jon', 'Male', 22, 3),
6
         (3, 'Sara', 'Female', 25, 4),
7
         (4, 'Laura', 'Female', 18, 2),
8
         (5, 'Alan', 'Male', 20, 3),
9
         (6, 'Kate', 'Female', 22, 2),
10
         (7, 'Joseph', 'Male', 18, 2),
11
         (8, 'Mice', 'Male', 23, 1),
12
         (9, 'Wise', 'Male', 21, 5),
13
         (10, 'Elis', 'Female', 27, 2);
Notice that values in “dep_id” column of the student table exists in the id column

of the department table.

Now, let us see examples of both correlated and uncorrelated sub-queries.

Uncorrelated Sub-query Example

Let us execute a uncorrelated sub-query which retrieves records of all the

students who belong to “Computer” department.

1 USE schooldb;

2  

3 SELECT * FROM

199 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

  student 
4
  WHERE dep_id =
5
  (
6
    SELECT id from department WHERE name =
7
'Computer'
8
  );
The output of the above SQL will be:

Gender age dep_id

Female 18 2

Female 22 2

Male 18 2

Female 27 2

You can see that there are two queries. The inner query retrieves id of the

“Computer” department while the outer query retrieves student records with that

id value in the dep_id column.

We know that in the case of uncorrelated sub-queries the inner query can be

executed as standalone query and it will still work. Let’s check if this is true in this

case. Execute the following query on the server.

1 SELECT id from department WHERE name = 'Computer';

200 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The above query will execute successfully and will return 2 i.e. the of the

“Computer” department. This is a uncorrelated sub-query.

Correlated Sub-query Example

201 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

We know that in case of correlated sub-queries, the inner query depends upon

the outer query and cannot be executed as a standalone query.

Lets execute a correlated sub-query that retrieves results of all the students with

age greater than average age within their department as discussed above.

USE schooldb;
1
 
2
SELECT   name, gender, age
3
  FROM     student Greater
4
  WHERE    age >
5
  (SELECT   AVG (age)
6
     FROM     student average
7
     WHERE      greater.dep_id =
8
average.dep_id) ;
The output of the above query will be:

gender age

Female 22

Female 27

Male 22

Female 25

202 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

We know that in the case of a correlated sub-query, the inner query cannot be

executed as standalone query. You can verify this by executing the following

inner query on it’s own:

SELECT   AVG (age)


1
  FROM     student average    
2
  WHERE      greater.dep_id =
3
average.dep_id
The above query will throw an error.

Other small differences between correlated and uncorrelated sub-queries are:

1. The outer query executes before the inner query in the case of a

correlated sub-query. On the other hand in case of a uncorrelated

sub-query the inner query executes before the outer query.

2. Correlated sub-queries are slower. They take M x N steps to execute

a query where M is the records retrieved by outer query and N is the

number of iteration of inner query. Uncorrelated sub-queries

complete execution in M + N steps.

SubQuery vs Join in SQL

Any information which you retrieve from the database using subquery can be

retrieved by using different types of joins also. SQL is flexible and it provides

different ways of doing the same thing. Some people find SQL Joins confusing

203 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

and subquery specially noncorrelated more intuitive but in terms of performance

SQL Joins are more efficient than subqueries.

Important points about SubQuery in DBMS

1. Almost whatever you want to do with subquery can also be done using join, it

is just a matter of choice subquery seems more intuitive to many users.

2. Subquery normally returns a scaler value as a result or result from one column

if used along with IN Clause.

3. You can use subqueries in four places: subquery as a column in select clause,

4. In the case of correlated subquery outer query gets processed before the inner

query.

That's all about subquery in SQL. It's an important concept to learn and

understand, as both correlated and non-correlated subquery is essential to solve

SQL query-related problems. They are not just important from the SQL interview

point of view but also from the Data Analysis point of view.

4. Understand the use of SQL in procedural languages, both standard

(e.g PHP) and proprietary (e.g. PL/SQL).

The transaction controls help manage transaction processing, ensuring that

transactions are either completed or rolled back if errors or problems occur. The

204 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

security statements are used to control database access as well as to create

user roles and permissions.

SQL syntax is the coding format used in writing statements

Commonly used SQL statements include select, add, insert, update, delete,

create, alter and truncate.

The first thing to understand about SQL is that SQL isn’t a procedural language,

as are Python, C, C++, C#, and Java. To solve a problem in a procedural

language, you write a procedure — a sequence of commands that performs one

specific operation after another until the task is complete. The procedure may be

a straightforward linear sequence or may loop back on itself, but in either case,

the programmer specifies the order of execution

SQL, on the other hand, is nonprocedural. To solve a problem using SQL, simply

tell SQL what you want (as if you were talking to Aladdin’s genie) instead of

telling the system how to get you what you want. The database management

system (DBMS) decides the best way to get you what you request.

All right. You were just told that SQL is not a procedural language — and that’s

essentially true. However, millions of programmers out there (and you’re

probably one of them) are accustomed to solving problems in a procedural

manner. So, in recent years, there has been a lot of pressure to add some

procedural functionality to SQL — and SQL now incorporates features of a

procedural language: BEGIN blocks, IF statements, functions, and (yes)

205 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

procedures. With these facilities added, you can store programs at the server,

where multiple clients can use your programs repeatedly.

To illustrate what is meant by “tell the system what you want,” suppose you have

an EMPLOYEE table from which you want to retrieve the rows that correspond to

all your senior people. You want to define a senior person as anyone older than

age 40 or anyone earning more than $100,000 per year. You can make the

desired retrieval by using the following query:

SELECT * FROM EMPLOYEE WHERE Age > 40 OR Salary > 100000 ;

This statement retrieves all rows from the EMPLOYEE table where either the

value in the Age column is greater than 40 or the value in the Salary column is

greater than 100,000. In SQL, you don’t have to specify how the information is

retrieved. The database engine examines the database and decides for itself

how to fulfill your request. You need only specify what data you want to retrieve.

SQL-on-Hadoop is a class of analytical application tools that combine

established SQL-style querying with newer Hadoop data framework elements.

By supporting familiar SQL queries, SQL-on-Hadoop lets a wider group of

enterprise developers and business analysts work with Hadoop on commodity

computing clusters. Because SQL was originally developed for relational

databases, it has to be modified for the Hadoop 1 model, which uses the Hadoop

Distributed File System and Map-Reduce or the Hadoop 2 model, which can

work without either HDFS or Map-Reduce.

206 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The different means for executing SQL in Hadoop environments can be divided

into (1) connectors that translate SQL into a MapReduce format; (2) "push down"

systems that forgo batch-oriented MapReduce and execute SQL within Hadoop

clusters; and (3) systems that apportion SQL work between MapReduce-HDFS

clusters or raw HDFS clusters, depending on the workload.

One of the earliest efforts to combine SQL and Hadoop resulted in the Hive data

warehouse, which featured HiveQL software for translating SQL-like queries into

MapReduce jobs. Other tools that help support SQL-on-Hadoop include BigSQL,

Drill, Hadapt, Hawq, H-SQL, Impala, JethroData, Polybase, Presto, Shark (Hive

on Spark), Spark, Splice Machine, Stinger, and Tez (Hive on Tez).

A (very) little SQL history

SQL originated in one of IBM’s research laboratories, as did relational database

theory. In the early 1970s, as IBM researchers developed early relational DBMS

(or RDBMS) systems, they created a data sublanguage to operate on these

systems. They named the pre-release version of this sublanguage SEQUEL

(Structured English QUEry Language). However, when it came time to formally

release their query language as a product, they found that another company had

already trademarked the product name “Sequel.” Therefore, the marketing

geniuses at IBM decided to give the released product a name that was different

from SEQUEL but still recognizable as a member of the same family. So they

named it SQL, pronounced ess-que-ell. Although the official pronunciation is ess-

que-ell, people had become accustomed to pronouncing it “Sequel” in the early

207 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

pre-release days and continued to do so. That practice has persisted to the

present day; some people will say “Sequel” and others will say “S-Q-L,” but they

are both talking about the same thing.

PL/SQL

PL/SQL is a combination of SQL along with the procedural features of

programming languages. It was developed by Oracle Corporation in the early

90's to enhance the capabilities of SQL. PL/SQL is one of three key programming

languages embedded in the Oracle Database, along with SQL itself and Java.

This tutorial will give you great understanding on PL/SQL to proceed with Oracle

database and other advanced RDBMS concepts.

The PL/SQL programming language was developed by Oracle Corporation in the

late 1980s as procedural extension language for SQL and the Oracle relational

database. Following are certain notable facts about PL/SQL −

PL/SQL is a completely portable, high-performance transaction-processing

language.

PL/SQL provides a built-in, interpreted and OS independent programming

environment.

PL/SQL can also directly be called from the command-line SQL*Plus interface.

Direct call can also be made from external programming language calls to

database.

208 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

PL/SQL's general syntax is based on that of ADA and Pascal programming

language.

Apart from Oracle, PL/SQL is available in TimesTen in-memory database and

IBM DB2.

Features of PL/SQL

PL/SQL has the following features:

-PL/SQL is tightly integrated with SQL.

-It offers extensive error checking.

-It offers numerous data types.

-It offers a variety of programming structures.

-It supports structured programming through functions and procedures.

-It supports object-oriented programming.

-It supports the development of web applications and server pages.

Advantages of PL/SQL

PL/SQL has the following advantages:

-SQL is the standard database language and PL/SQL is strongly integrated with

SQL. PL/SQL supports both static and dynamic SQL. Static SQL supports DML

operations and transaction control from PL/SQL block. -In Dynamic SQL, SQL

allows embedding DDL statements in PL/SQL blocks.

209 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

-PL/SQL allows sending an entire block of statements to the database at one

time. This reduces network traffic and provides high performance for the

applications.

-PL/SQL gives high productivity to programmers as it can query, transform, and

update data in a database.

-PL/SQL saves time on design and debugging by strong features, such as

exception handling, encapsulation, data hiding, and object-oriented data types.

-Applications written in PL/SQL are fully portable.

-PL/SQL provides high security level.

-PL/SQL provides access to predefined SQL packages.

-PL/SQL provides support for Object-Oriented Programming.

-PL/SQL provides support for developing Web -Applications and Server Pages.

In this chapter, we will discuss the Environment Setup of PL/SQL. PL/SQL is not

a standalone programming language; it is a tool within the Oracle programming

environment. SQL* Plus is an interactive tool that allows you to type SQL and

PL/SQL statements at the command prompt. These commands are then sent to

the database for processing. Once the statements are processed, the results are

sent back and displayed on screen.

To run PL/SQL programs, you should have the Oracle RDBMS Server installed

in your machine. This will take care of the execution of the SQL commands. The

210 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

most recent version of Oracle RDBMS is 11g. You can download a trial version

of Oracle 11g from the following link −

Download Oracle 11g Express Edition

You will have to download either the 32-bit or the 64-bit version of the installation

as per your operating system. Usually there are two files. We have downloaded

the 64-bit version. You will also use similar steps on your operating system, does

not matter if it is Linux or Solaris.

win64_11gR2_database_1of2.zip

win64_11gR2_database_2of2.zip

After downloading the above two files, you will need to unzip them in a single

directory database and under that you will find the following sub-directories −

Oracle Sub Directries

Step 1

Let us now launch the Oracle Database Installer using the setup file. Following is

the first screen. You can provide your email ID and check the checkbox as

shown in the following screenshot. Click the Next button.

Oracle Install 1

Step 2

You will be directed to the following screen; uncheck the checkbox and click the

Continue button to proceed.

211 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Oracle install error

Step 3

Just select the first option Create and Configure Database using the radio button

and click the Next button to proceed.

Oracle Install 2

Step 4

We assume you are installing Oracle for the basic purpose of learning and that

you are installing it on your PC or Laptop. Thus, select the Desktop Class option

and click the Next button to proceed.

Oracle Install 3

Step 5

Provide a location, where you will install the Oracle Server. Just modify the

Oracle Base and the other locations will set automatically. You will also have to

provide a password; this will be used by the system DBA. Once you provide the

required information, click the Next button to proceed.

Oracle Install 4

Step 6

Again, click the Next button to proceed.

212 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Oracle Install 5

Step 7

Click the Finish button to proceed; this will start the actual server installation.

Oracle Install 6

Step 8

This will take a few moments, until Oracle starts performing the required

configuration.

Oracle Install 7

Step 9

Here, Oracle installation will copy the required configuration files. This should

take a moment −

Oracle Configuration

Step 10

Once the database files are copied, you will have the following dialogue box. Just

click the OK button and come out.

Oracle Configuration

Step 11

Upon installation, you will have the following final window.

213 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Oracle Install 8

Final Step

It is now time to verify your installation. At the command prompt, use the

following command if you are using Windows −

sqlplus "/ as sysdba"

You should have the SQL prompt where you will write your PL/SQL commands

and scripts −

PL/SQL Command Prompt

Text Editor

Running large programs from the command prompt may land you in inadvertently

losing some of the work. It is always recommended to use the command files. To

use the command files −

Type your code in a text editor, like Notepad, Notepad+, or EditPlus, etc.

Save the file with the .sql extension in the home directory.

Launch the SQL*Plus command prompt from the directory where you created

your PL/SQL file.

Type @file_name at the SQL*Plus command prompt to execute your program.

If you are not using a file to execute the PL/SQL scripts, then simply copy your

PL/SQL code and right-click on the black window that displays the SQL prompt;

214 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

use the paste option to paste the complete code at the command prompt. Finally,

just press Enter to execute the code, if it is not already executed.

In this chapter, we will discuss the Data Types in PL/SQL. The PL/SQL variables,

constants and parameters must have a valid data type, which specifies a storage

format, constraints, and a valid range of values. We will focus on the SCALAR

and the LOB data types in this chapter. The other two data types will be covered

in other chapters.

S.No Category & Description

1 Scalar

Single values with no internal components, such as a NUMBER, DATE, or

BOOLEAN.

2 Large Object (LOB)

Pointers to large objects that are stored separately from other data items, such

as text, graphic images, video clips, and sound waveforms.

3 Composite

Data items that have internal components that can be accessed individually. For

example, collections and records.

4 Reference

Pointers to other data items.

215 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

PL/SQL Scalar Data Types and Subtypes

PL/SQL Scalar Data Types and Subtypes come under the following categories −

S.No Date Type & Description

1 Numeric

Numeric values on which arithmetic operations are performed.

2 Character

Alphanumeric values that represent single characters or strings of characters.

3 Boolean

Logical values on which logical operations are performed.

4 Datetime

Dates and times.

PL/SQL provides subtypes of data types. For example, the data type NUMBER

has a subtype called INTEGER. You can use the subtypes in your PL/SQL

program to make the data types compatible with data types in other programs

while embedding the PL/SQL code in another program, such as a Java program.

PL/SQL Numeric Data Types and Subtypes

Following table lists out the PL/SQL pre-defined numeric data types and their

sub-types −

S.No Data Type & Description

216 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

1 PLS_INTEGER

Signed integer in range -2,147,483,648 through 2,147,483,647, represented in

32 bits

2 BINARY_INTEGER

Signed integer in range -2,147,483,648 through 2,147,483,647, represented in

32 bits

3 BINARY_FLOAT

Single-precision IEEE 754-format floating-point number

4 BINARY_DOUBLE

Double-precision IEEE 754-format floating-point number

5 NUMBER(prec, scale)

Fixed-point or floating-point number with absolute value in range 1E-130 to (but

not including) 1.0E126. A NUMBER variable can also represent 0

6 DEC(prec, scale)

ANSI specific fixed-point type with maximum precision of 38 decimal digits

7 DECIMAL(prec, scale)

IBM specific fixed-point type with maximum precision of 38 decimal digits

8 NUMERIC(pre, secale)
217 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Floating type with maximum precision of 38 decimal digits

9 DOUBLE PRECISION

ANSI specific floating-point type with maximum precision of 126 binary digits

(approximately 38 decimal digits)

10 FLOAT

ANSI and IBM specific floating-point type with maximum precision of 126 binary

digits (approximately 38 decimal digits)

11 INT

ANSI specific integer type with maximum precision of 38 decimal digits

12INTEGER

ANSI and IBM specific integer type with maximum precision of 38 decimal digits

13 SMALLINT

ANSI and IBM specific integer type with maximum precision of 38 decimal digits

14 REAL

Floating-point type with maximum precision of 63 binary digits (approximately 18

decimal digits)

Following is a valid declaration −


218 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

DECLARE

   num1 INTEGER;

   num2 REAL;

   num3 DOUBLE PRECISION;

BEGIN

   null;

END;

When the above code is compiled and executed, it produces the following result

PL/SQL procedure successfully completed

PL/SQL Character Data Types and Subtypes

Following is the detail of PL/SQL pre-defined character data types and their sub-

types −

S.No Data Type & Description

1 CHAR

Fixed-length character string with maximum size of 32,767 bytes

219 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

2 VARCHAR2

Variable-length character string with maximum size of 32,767 bytes

3 RAW

Variable-length binary or byte string with maximum size of 32,767 bytes, not

interpreted by PL/SQL

4 NCHAR

Fixed-length national character string with maximum size of 32,767 bytes

5 NVARCHAR2

Variable-length national character string with maximum size of 32,767 bytes

6 LONG

Variable-length character string with maximum size of 32,760 bytes

7 LONG RAW

Variable-length binary or byte string with maximum size of 32,760 bytes, not

interpreted by PL/SQL

8 ROWID

Physical row identifier, the address of a row in an ordinary table

9 UROWID

220 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Universal row identifier (physical, logical, or foreign row identifier)

PL/SQL Boolean Data Types

The BOOLEAN data type stores logical values that are used in logical

operations. The logical values are the Boolean values TRUE and FALSE and the

value NULL.

However, SQL has no data type equivalent to BOOLEAN. Therefore, Boolean

values cannot be used in −

SQL statements

Built-in SQL functions (such as TO_CHAR)

PL/SQL functions invoked from SQL statements

PL/SQL Datetime and Interval Types

The DATE datatype is used to store fixed-length datetimes, which include the

time of day in seconds since midnight. Valid dates range from January 1, 4712

BC to December 31, 9999 AD.

The default date format is set by the Oracle initialization parameter

NLS_DATE_FORMAT. For example, the default might be 'DD-MON-YY', which

includes a two-digit number for the day of the month, an abbreviation of the

month name, and the last two digits of the year. For example, 01-OCT-12.

221 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Each DATE includes the century, year, month, day, hour, minute, and second.

The following table shows the valid values for each field −

Field Name Valid Datetime Values Valid Interval Values

YEAR -4712 to 9999 (excluding year 0) Any nonzero integer

MONTH 01 to 12 0 to 11

DAY 01 to 31 (limited by the values of MONTH and YEAR, according to the rules

of the calendar for the locale) Any nonzero integer

HOUR 00 to 23 0 to 23

MINUTE 00 to 59 0 to 59

SECOND 00 to 59.9(n), where 9(n) is the precision of time fractional seconds 0

to 59.9(n), where 9(n) is the precision of interval fractional seconds

TIMEZONE_HOUR -12 to 14 (range accommodates daylight savings time

changes) Not applicable

TIMEZONE_MINUTE 00 to 59 Not applicable

TIMEZONE_REGION Found in the dynamic performance view

V$TIMEZONE_NAMES Not applicable

TIMEZONE_ABBR Found in the dynamic performance view

V$TIMEZONE_NAMES Not applicable

222 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

PL/SQL Large Object (LOB) Data Types

Large Object (LOB) data types refer to large data items such as text, graphic

images, video clips, and sound waveforms. LOB data types allow efficient,

random, piecewise access to this data. Following are the predefined PL/SQL

LOB data types −

Data Type Description Size

BFILE Used to store large binary objects in operating system files outside the

database. System-dependent. Cannot exceed 4 gigabytes (GB).

BLOB Used to store large binary objects in the database. 8 to 128 terabytes (TB)

CLOB Used to store large blocks of character data in the database. 8 to 128 TB

NCLOB Used to store large blocks of NCHAR data in the database. 8 to 128 TB

PL/SQL User-Defined Subtypes

A subtype is a subset of another data type, which is called its base type. A

subtype has the same valid operations as its base type, but only a subset of its

valid values.

PL/SQL predefines several subtypes in package STANDARD. For example,

PL/SQL predefines the subtypes CHARACTER and INTEGER as follows −

223 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

SUBTYPE CHARACTER IS CHAR;

SUBTYPE INTEGER IS NUMBER(38,0);

You can define and use your own subtypes. The following program illustrates

defining and using a user-defined subtype −

DECLARE

   SUBTYPE name IS char(20);

   SUBTYPE message IS varchar2(100);

   salutation name;

   greetings message;

BEGIN

   salutation := 'Reader ';

   greetings := 'Welcome to the World of PL/SQL';

   dbms_output.put_line('Hello ' || salutation || greetings);

END;

224 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

When the above code is executed at the SQL prompt, it produces the following

result −

Hello Reader Welcome to the World of PL/SQL

PL/SQL procedure successfully completed.

NULLs in PL/SQL

PL/SQL NULL values represent missing or unknown data and they are not an

integer, a character, or any other specific data type. Note that NULL is not the

same as an empty data string or the null character value '\0'. A null can be

assigned but it cannot be equated with anything, including itself.

PHP

The PHP Hypertext Preprocessor (PHP) is a programming language that allows

web developers to create dynamic content that interacts with databases. PHP is

basically used for developing web based software applications. This tutorial helps

you to build your base with PHP.

Why to Learn PHP?

225 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

PHP started out as a small open source project that evolved as more and more

people found out how useful it was. Rasmus Lerdorf unleashed the first version

of PHP way back in 1994.

PHP is a MUST for students and working professionals to become a great

Software Engineer specially when they are working in Web Development

Domain. I will list down some of the key advantages of learning PHP:

PHP is a recursive acronym for "PHP: Hypertext Preprocessor".

PHP is a server side scripting language that is embedded in HTML. It is used to

manage dynamic content, databases, session tracking, even build entire e-

commerce sites.

It is integrated with a number of popular databases, including MySQL,

PostgreSQL, Oracle, Sybase, Informix, and Microsoft SQL Server.

PHP is pleasingly zippy in its execution, especially when compiled as an Apache

module on the Unix side. The MySQL server, once started, executes even very

complex queries with huge result sets in record-setting time.

PHP supports a large number of major protocols such as POP3, IMAP, and

LDAP. PHP4 added support for Java and distributed object architectures (COM

and CORBA), making n-tier development a possibility for the first time.

PHP is forgiving: PHP language tries to be as forgiving as possible.

226 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

PHP Syntax is C-Like.

Characteristics of PHP

Five important characteristics make PHP's practical nature possible −

Simplicity

Efficiency

Security

Flexibility

Familiarity

PHP functions are similar to other programming languages. A function is a piece

of code which takes one more input in the form of parameter and does some

processing and returns a value.

You already have seen many functions like fopen() and fread() etc. They are

built-in functions but PHP gives you the option to create your own functions as

well.

There are two parts which should be clear to you −

Creating a PHP Function

Calling a PHP Function

227 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In fact you hardly need to create your own PHP function because there are

already more than 1000 of built-in library functions created for different area and

you just need to call them according to your requirement.

Please refer to PHP Function Reference for a complete set of useful functions.

Creating PHP Function

Its very easy to create your own PHP function. Suppose you want to create a

PHP function which will simply write a simple message on your browser when

you will call it. Following example creates a function called writeMessage() and

then calls it just after creating it.

PHP Functions with Parameters

PHP gives you option to pass your parameters inside a function. You can pass

as many as parameters your like. These parameters work like variables inside

your function. Following example takes two integer parameters and add them

together and then print them.

Passing Arguments by Reference

It is possible to pass arguments to functions by reference. This means that a

reference to the variable is manipulated by the function rather than a copy of the

variable's value.

Any changes made to an argument in these cases will change the value of the

original variable. You can pass an argument by reference by adding an

228 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

ampersand to the variable name in either the function call or the function

definition.

5. Understand common uses of database triggers and stored

procedures

DATABASE TRIGGERS

A trigger resides in the database and anyone who has the required privilege can

use it, a trigger lets you write a set of SQL statements that multiple applications

can use. It lets you avoid redundant code when multiple programs need to

perform the same database operation.

You can use triggers to perform the following actions, as well as others that are

not found in this list:

Create an audit trail of activity in the database. For example, you can track

updates to the orders table by updating corroborating information to an audit

table.

Implement a business rule. For example, you can determine when an order

exceeds a customer's credit limit and display a message to that effect.

Derive additional data that is not available within a table or within the database.

For example, when an update occurs to the quantity column of the items table,

you can calculate the corresponding adjustment to the total_price column.

229 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Enforce referential integrity. When you delete a customer, for example, you can

use a trigger to delete corresponding rows that have the same customer number

in the orders table.

Benefits of Triggers

Following are the benefits of triggers.

Generating some derived column values automatically

Enforcing referential integrity

Event logging and storing information on table access

Auditing

Synchronous replication of tables

Imposing security authorizations

Preventing invalid transactions

Types of Triggers in Oracle

Triggers can be classified based on the following parameters.

Classification based on the timing

BEFORE Trigger: It fires before the specified event has occurred.

AFTER Trigger: It fires after the specified event has occurred.

INSTEAD OF Trigger: A special type. You will learn more about the further

topics. (only for DML )


230 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Classification based on the level

STATEMENT level Trigger: It fires one time for the specified event statement.

ROW level Trigger: It fires for each record that got affected in the specified event.

(only for DML)

Classification based on the Event

DML Trigger: It fires when the DML event is specified

(INSERT/UPDATE/DELETE)

DDL Trigger: It fires when the DDL event is specified (CREATE/ALTER)

DATABASE Trigger: It fires when the database event is specified

(LOGON/LOGOFF/STARTUP/SHUTDOWN)

Syntax Explanation:

The above syntax shows the different optional statements that are present in

trigger creation.

BEFORE/ AFTER will specify the event timings.

INSERT/UPDATE/LOGON/CREATE/etc. will specify the event for which the

trigger needs to be fired.

231 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

ON clause will specify on which object the above-mentioned event is valid. For

example, this will be the table name on which the DML event may occur in the

case of DML Trigger.

Command “FOR EACH ROW” will specify the ROW level trigger.

WHEN clause will specify the additional condition in which the trigger needs to

fire.

The declaration part, execution part, exception handling part is same as that of

the other PL/SQL blocks. Declaration part and exception handling part are

optional.

:NEW and :OLD Clause

In a row level trigger, the trigger fires for each related row. And sometimes it is

required to know the value before and after the DML statement.

Oracle has provided two clauses in the RECORD-level trigger to hold these

values. We can use these clauses to refer to the old and new values inside the

trigger body.

:NEW – It holds a new value for the columns of the base table/view during the

trigger execution

:OLD – It holds old value of the columns of the base table/view during the trigger

execution

232 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

This clause should be used based on the DML event. Below table will specify

which clause is valid for which DML statement (INSERT/UPDATE/DELETE).

NSTEAD OF Trigger

“INSTEAD OF trigger” is the special type of trigger. It is used only in DML

triggers. It is used when any DML event is going to occur on the complex view.

Consider an example in which a view is made from 3 base tables. When any

DML event is issued over this view, that will become invalid because the data is

taken from 3 different tables. So in this INSTEAD OF trigger is used. The

INSTEAD OF trigger is used to modify the base tables directly instead of

modifying the view for the given event.

Example 1: In this example, we are going to create a complex view from two

base table.

Table_1 is emp table and

Table_2 is department table.

Then we are going to see how the INSTEAD OF trigger is used to issue UPDATE

the location detail statement on this complex view. We are also going to see how

the :NEW and :OLD is useful in triggers.

Step 1: Creating table ’emp’ and ‘dept’ with appropriate columns

Step 2: Populating the table with sample values

Step 3: Creating view for the above created table

233 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Step 4: Update of view before the instead-of trigger

Step 5: Creation of the instead-of trigger

Step 6: Update of view after instead-of trigger

Step 1) Creating table ’emp’ and ‘dept’ with appropriate columns

Database Stored Procedure

Database-stored procedures are sets of pre-compiled SQL statements created in

the server, called and executed by database applications. It is very simple and

the same result can be archived by SQL query.

Stored Procedures Advantages

Stored procedures increase the performance of an application. Once created,

stored procedure is compiled and stored in the database catalog. It runs faster

than an uncompiled SQL commands which are sent from application 

Stored procedure reduces the traffic between application and database server

because instead of sending multiple uncompiled long SQL commands statement,

application has only to send the stored procedure name and get the result back. 

Stored procedure is reusable and transparent to any application which wants to

use it. Stored procedure exposes the database interface to all applications so

developer doesn’t have to program the functions which are already supported in

stored procedure in all programs. 

234 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Stored procedure is secured. Database administrator can grant the right to

application which to access which stored procedures in database catalog without

granting any permission on the underlying database table.

Stored Procedures Disadvantages

Stored procedure make the database server high load in both memory and

processors. Instead of being focused on the storing and retrieving data, you

could be asking the database server to perform a number of logical operations or

a complex of business logic which is not the role of it. 

Stored procedure only contains declarative SQL so it is very difficult to write a

procedure with complexity of business like other languages in application layer

such as Java, C#, C++… 

You cannot debug stored procedure in almost RDMBSs and in MySQL also.

There are some workarounds on this problem but it still not good enough to do

so. 

Writing and maintain stored procedure usually required specialized skill set that

not all developers possess. This introduced the problem in both application

development and maintain phrase.

235 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 7:

DATABASE APPLICATION

DEVELOPMENT

Researched and presented by:

Celino, Ralph Stephen


Rodriguez, Zyra Mae M.

236 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Client-server System

Client-server system is a computing system that is composed of two

logical parts: a server, which provides response for services, and a client, which

requests them. The two parts can run on

separate machines on a network, allowing

users to access powerful server resources

from their personal computers. Mc-Grawhill,

(2003)

THREE COMPONENTS OF CLIENT/SERVER SYSTEMS

 Data presentation services

It is the input/output (I/O), or presentation logic, component. This

component is responsible for formatting and presenting data on the user’s screen

or other output device and for managing user input from a keyboard or other

input device. Presentation logic often resides on the client and is the mechanism

with which the user interacts with the system. 

 Input: Keyboard, Mouse

 Output: Monitor, Printer

 Graphical User Interface (GUI): is an interface through which a

user interacts with electronic devices such as computers and

smartphones through the use of icons, menus and other visual

indicators or representations (graphics).

237 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Processing services

This handles data processing logic, business rules logic, and data

management logic. Processing logic resides on both the client and servers.

 Input/Output processing: includes such activities as data

validation and identification of processing errors.

 Business rules logic: have not been coded at the DBMS level

maybe coded in the processing component.

 Data management logic: identifies the data necessary for

processing the transaction or query.

 Storage services

The component responsible for data storage and retrieval from the

physical storage devices associated with the application. Storage logic usually

resides on the database server, close to the physical location of the data.

Activities of a DBMS occur in the storage logic component.

 Data storage: is the recording (storing) of information (data).

 Data retrieval: is the process of identifying and extracting data

from a database, based on a query provided by the user or

application.

 Database Management System (DBMS): are software systems

used to store, retrieve, and run queries on data. A DBMS serves

238 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

as an interface between an end-user and a database, allowing

users to create, read, update, and delete data in the database.

 DBMS Activities such as; data dictionary management,

data storage management, data security management

and data integrity management.

In the fat client, the application processing occurs entirely on the client,

whereas in the thin client, this processing occurs primarily on the server. In the

distributed example or

also what we called

hybrid client,

application processing

is partitioned between

the client and the

server.

According to Spacey, J. (n.d.-b), a thin client is software that is primarily

designed to

communicate with a

server. It’s features

are produced by

servers such as a

cloud platform. A thick

client is a software that

239 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

implements its own features. It may connect to servers but it remains mostly

functional when it’s disconnected. 

TWO-TIER AND THREE-TIER ARCHITECTURE

 TWO TIER ARCHITECTURE

 It is a client-

server

application

 It was built in

1980’s and can

support up to

100 users.

 Two-tier architecture is generally divided into two parts: Client

application and Database.

 The client in a two-tier architecture application has the code written

for saving data in the database. 

 The client sends a request to the server where it then processes the

request and sends back the data. 

 The client handles both the presentation layer (application interface)

and application layer (logical operations), while the server system

handles the database. 

240 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Characteristics of Two-tier Architecture 

These may be included the advantages and disadvantages of two-tier

architecture:

 No intermediate application present: The client directly interacts

with the

server

without the

presence of

any

intermediate

application.

 Use of Application Programming Interface (API): The Client

application communicates with the data layer through a database

bridge Application Programming Interface (API). The most common

APIs are

Open

Database

Connectivity

(ODBC) and

ADO.NET for

the Microsoft

241 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

platforM (VB.NET and C#) and Java Database Connectivity (JDBC) fo

use with Java programs.

 Installation of database driver: The Database driver is installed in

each computer that

runs the client

application. Reinstall

database driver in all

the computer if the

database changes,

thus increases

deployment cost.

 Database connection:

Each client establishes

a separate database

connection.

 High network traffic: because of an increase in the number of trips

of data transfer across

the physical boundaries

of the network.

Bhuvana. (2006,

August 24)

242 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Applications of two-tier architecture

 Software installed on a client machine

THREE-TIER ARCHITECTURE

 It is a web based application

 Introduced in 1990’s and was proposed in 1995 and it accommodate

hundreds of users.

 Three-tier Architecture is generally divided into three parts:

Presentation layer (Client tier), Application layer (Business tier) and

Database layer (Data tier).

243 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 In three-tier, the application logic or process resides in the middle-tier,

it is separated from the data and the user interface.

Advantages of three-tier architecture

 Maintainability: Because each tier is independent of the other tiers,

updates or changes can be carried out without affecting the

application as a whole.

 Scalability: Because tiers are based on the deployment of layers,

scaling out an application is reasonably straightforward.

 Flexibility: Because each tier can be managed or scaled

independently, flexibility is increased.

 Faster development: Because of division of work web designer does

presentation, software engineer does logic, DB admin does data

model. Benitamayekar. (n.d.)

 Better match of systems to business needs: New modules can be

built to support specific business needs rather than building more

general, complete applications.

 Improved customer service: Multiple interfaces on different clients

can access the same business process.

Disadvantages of three-tier architecture

244 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 High installation cost.

 Structure is more complex as compare to 2 tier architecture.

Applications

 E-commerce Websites

 Database related Websites

CONNECT DATABASES IN A TWO-TIER APPLICATION

 VB.NET

The VB.NET code shown in figure 1 below uses the ADO.NET data

access framework and .NET data providers to connect to the database.

The .NET Framework has different data providers (or database drivers)

that allow you to connect a program written in a .NET programming

language to a database. Common data providers available in the

framework are for SQL Server and Oracle.

Figure 1-a shows the VB.NET code needed to create a simple form

that allows the user input to a name, department number, and student ID.

Figure 1-a -

Setup form for

receiving user

input.

245 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Figure 1-b shows the detailed steps to connect to a database and

issue an INSERT query. By reading the explanations presented in the text

boxes in the figure, you can see how the generic steps for accessing a

database described in the previous section are implemented in the context

of a VB.NET program.

Figure 1-b :

Connecting to a

database and

issuing an

INSERT query.

Figure 1-

c shows

how you

would

access

the

database and process the results for a SELECT query. The main

difference is that use the ExecuteReader() method instead of

ExecuteNonQuery() method. The latter is used for INSERT, UPDATE, and

DELETE queries. The table that results from running a SELECT query are

246 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

captured inside an OracleDataReader object. You can access each row in

the result by traversing the object, one row at a time. Each column in the

object can be accessed by a Get method and by referring to the column’s

position in the query result (or by name). ADO.NET provides two main

choices with respect to handling the result of the query: DataReader (e.g.,

OracleDataReader in Figure 1-c) and DataSet. The primary difference

between the two options is that the first limits us to looping through the

result of a query one row at a time. This can be very cumbersome if the

result has a large number of rows. The DataSet object provides a

disconnected snapshot of the database that we can then manipulate in our

program using the features available in the programming language. Later

in this chapter, we will see how .NET data controls (which use DataSet

objects) can provide a cleaner and easier way to manipulate data in a

program.

Figure 1-c : Sample code snippet for using a select query.

 JAVA

This Java application is actually connecting to the same database

as the VB.NET application in Figure 1. Its purpose is to retrieve and print

the names of all

students in the

Student table. 

247 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In this example, the Java program is using the JDBC API and an

Oracle thin driver to access the Oracle database. Notice that unlike the

INSERT query shown in the VB.NET example, running an SQL SELECT

query requires us to capture the data inside an object that can

appropriately handle the tabular data. JDBC provides two key

mechanisms for this: the ResultSet and RowSet objects. The difference

between these two is somewhat similar to the difference between the

DataReader and DataSet objects described in the VB.NET example. 

The ResultSet object has a mechanism, called the cursor, that

points to its current row of data. When the ResultSet object is first

initialized, the cursor is positioned before the first row. This is why we

need to first call the next() method before retrieving data. The ResultSet

object is used to loop through and process each row of data and retrieve

the column values that we want to access. In this case, we access the

value in the name column using the rec.getString method, which is a part

of the JDBC API. For each of the common database types, there is a

corresponding get and set method that allows for retrieval and storage of

data in the database. It is important to note that while the ResultSet object

maintains an active connection to the database, depending on the size of

the table, the entire table (i.e., the result of the query) may or may not

actually be in memory on the client machine. How and when data are

transferred between the database and client is handled by the Oracle

driver. By default, a ResultSet object is read-only and can be traversed

248 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

only in one direction (forward). However, advanced versions of the

ResultSet object allow scrolling in both directions and can be updateable

as well.

Figure 2-a: Database access from a Java Program.

KEY COMPONENTS OF A WEB APPLICATION

 Database Server

This server hosts the storage logic for the application and hosts the

DBMS. You have read about many of them, including Oracle, Microsoft SQL

Server, Informix, Sybase, DB2, Microsoft Access, and MySQL. The DBMS may

reside either on a separate machine or on the same machine as the Web server.

It can be configured to provide data access for authorized users only. This

type of server keeps the data in a central location that can be regularly backed

up. It also allows users and applications to centrally access the data across the

network. A larger number of the databases in your organization can be kept on

249 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

one server or a group of servers that are specifically configured to protect data

and service client requests.

A database server is a machine running database software dedicated to

providing database services. It is a crucial component in the client-server

computing environment where it provides business-critical information requested

by the client systems.

A database server consists of hardware and software that run a database.

 The software part of a database server, or the database instance, is

the back-end database application. The application represents a set

of memory structures and background processes accessing a set of

database files.

 The hardware part of a database server is the server system used for

database storage and retrieval.

 Web Server

The Web server provides the basic functionality needed to receive and

respond to requests from browser clients. These requests use HTTP or HTTPS

as a protocol.

The main job of a web server is to display website content through storing,

processing and delivering webpages to users. Besides HTTP, web servers also

support SMTP (Simple Mail Transfer Protocol) and FTP (File Transfer Protocol),

used for email, file transfer and storage. Gillis, A. S. (2020)

250 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Application Server

This software provides the building blocks for creating dynamic Web sites

and Web-based applications. Examples include the .NET Framework from

Microsoft; Java Platform, Enterprise Edition (Java EE); and ColdFusion. Also,

while technically not considered an application server platform, software that

enables you to write applications in languages such as PHP, Python, and Perl

also belong to this category.

An application server is a program that resides on the server-side, and it’s

a server programmer providing business logic behind any application. This server

can be a part of the network or the distributed network. It helps the clients to

process any requests by connecting to the Database and returning the

information back to web servers. Pedamkar, P. (2021)

A web browser is a software that allows you to find and view websites on

the Internet. A Web browser Microsoft’s Internet Explorer, Mozilla’s Firefox,

Apple’s Safari, Google’s Chrome, and Opera are examples. Sciencedirect (n.d.)

Information flow:

The database server stores the Database Management System (DBMS)

and the database itself. Its main role is to receive requests from client machines,

search for the required data, and pass back the results. Clients access a

database server through a front-end application that displays the requested data

on the client machine, or through a back-end application that runs on the server

251 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

and manages the database. In a master-slave model, the database master

server is the primary data location. Database slave servers are replicas of the

master server that act as proxies. Marijan, B. (2021)

When a web browser, like Google Chrome or Firefox, needs a file that's

hosted on a web server, the browser will request the file by HTTP. When the

request is received by the web server, the HTTP server will accept the request,

find the content and send it back to the browser through HTTP. 

Application servers are basically used in a web-based application

that has 3-tier architecture. The position at which the application server fits

in is described below:

 Tier 1 – This is a GUI interface that resides at the client end and is

usually a thin client (e.g., browser)

 Tier 2 – This is called the middle tier, which consists of the Application

Server.

 Tier 3 – This is the 3rd tier which is backend servers. E.g., a

Database

Server.

As we can see,

they usually

communicate with the

252 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

webserver for serving any request that is coming from clients. The client first

makes a request, which goes to the webserver. The web server then sends it to

the middle tier, i.e., the application server, which further gets the information from

3rd tier (e.g., database server) and sends it back to the webserver. The web

server further sends back the required information to the client. Different

approaches are being utilized to process requests through the web servers, and

some of them are approaches like JSP, PHP, and ASP.NET. 

Web vs. Application server

Web Server

 Deliver static content

 Content is delivered using the

HTTP protocol only.  

 Serves only web-based

applications. 

Application Server

 Delivers dynamic content

 Provides business logic to application programs using several protocols

(including HTTP). 

 Can serve web and enterprise-based applications. Edpresso Team.

(2021)

253 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

General overview of the information flow in a Web application. A user

submitting a Web page request is unaware of whether the request being

submitted is returning a static Web page or a Web page whose content is a

mixture of static information and dynamic information retrieved from the

database. The data returned from the Web server is always in a format that can

be rendered by the browser (i.e., HTML or XML). As shown in Figure, if the Web

server determines that the request from the client can be satisfied without

passing the request on to the application server, it will process the request and

then return the appropriately formatted information to the client machine. 

CONNECT TO DATABASES IN A THREE-TIER WEB APPLICATION

 JSP - (Java Server Pages)

As with a normal page, your browser sends an HTTP request to the web

server.

 The web server recognizes that the HTTP request is for a JSP page

and forwards it to a JSP engine. This is done by using the URL or JSP

page which ends with .jspinstead of .html.

 A part of the web server called the servlet engine loads the Servlet

class and executes it. During execution, the servlet produces an

output in HTML format. The output is furthur passed on to the web

server by the servlet engine inside an HTTP response.

 The web server forwards the HTTP response to your browser in terms

of static HTML content.

254 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Finally, the web browser handles the dynamically-generated HTML

page inside the HTTP response exactly as if it were a static page. JSP

- Architecture. (n.d.)

In a Php three-tier structure: 

The biggest difference between a Java web server and PHP is that

PHP doesn't have its own built-in web server. PHP itself is basically one

executable which reads in a source code file of PHP code and

interprets/executes the commands written in that file.

PHP runs on a third-party web server which handles any incoming

requests and invokes the PHP interpreter with the given requested PHP

source code file as argument, then delivers any output of that process

back to the HTTP client. 

API is the acronym for Application Programming Interface, which is

a software intermediary that allows two applications to talk to each other.

When you use an application on your mobile phone, the application

connects to the Internet and sends data to a server. The server then

retrieves that data, interprets it, performs the necessary actions and sends

it back to your phone. The application then interprets that data and

presents you with the information you wanted in a readable way. MySQL

is a sample of cloud database 

255 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 ASP.NET - (Active Server Pages Network)

ASP.NET hides the complex processes of data access and

provides much higher level of classes and objects through which data is

accessed easily. These classes hide all complex coding for connection,

data retrieving, data querying, and data manipulation. ASP.NET -

Database Access. (n.d.). 

1. Create a web site and add a SqlDataSourceControl on the web

form.

2. Click on the Configure Data Source option.

3. Click on the New Connection button to establish connection with

a database.

4. Once the connection is set up, you may save it for further use.

At the next step, you are asked to configure the select

statement:

5. Select the columns and click next to complete the steps.

Observe the WHERE, ORDER BY, and the Advanced buttons.

These buttons allow you to provide the where clause, order by

clause, and specify the insert, update, and delete commands of

SQL respectively. This way, you can manipulate the data.

6. Add a GridView control on the form. Choose the data source

and format the control using AutoFormat option.

7. After this the formatted GridView control displays the column

headings, and the application is ready to execute.

256 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

8. Finally execute the application.

THE PURPOSE OF XML AND ITS USES

XML focuses on the transport of data without managing the appearance or

presentation of the output. XML addresses the issue of representing data in a

structure and format that can both be exchanged over the Internet and be

interpreted by different components (i.e., browsers, Web servers, application

servers). XML Introduction. (n.d.)

Most XML applications will work as expected even if new data is added (or

removed). Imagine an application designed to display a version of xml with <to>

<from> <heading> <body>). Then imagine a version with added <date> and

<hour> elements, and a removed <heading>.

 The way XML is constructed, even older version of the application can

still work with the new ones: 

Example:

__

Old Version 

Note 

To: Tove 

From: Jani 

Reminder 

257 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Don't forget me this weekend! 

New Version 

Note 

To: Tove 

From: Jani 

Date: 2015-09-01 08:30 

Don't forget me this weekend! 

__

The XML standard is a flexible way to create information formats and

electronically share structured data via the public Internet, as well as via

corporate networks. 

The tags in the example above (like <to> and <from>) are not defined in

any XML standard. These tags are "invented" by the author of the XML

document. 

For example, Xml has offshoots/ sub class like the xbrl (eXtensible

Business Reporting Language). This type of xml acts as a standard in naming

accounts of a business, it helps businesses to copy, transfer or to communicate

data with their counter parts like the suppliers.

258 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

With XML, being a standard of data exchange, data can be available to all

kinds of "reading machines" like people, computers, voice machines, news feeds,

etc., thus making exchange much simpler. 

 XML stands for eXtensible Markup Language 

 XML is a markup language much like HTML 

 XML was designed to store and transport data

 XML was designed to be self-descriptive

XQUERY USE TO QUERY XML DOCUMENTS

 What is XQuery?

XQuery is a technology from the World Wide Web Consortium (W3C)

that's designed to query collections of XML data -- not just XML files, but

anything that can appear as XML, including relational databases. The word

“Query” used in the 16th century in English as a noun meaning ‘query’, from Latin

quaere ‘ask, seek’. XQuery: Specifications, Articles, Mailing List, and Vendors.

(n.d.)

XQuery can be used to:

 Extract information to use in a Web Service 

 Generate summary reports 

 Transform XML data to XHTML 

 Search Web documents for relevant information

259 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

XQuery is compatible with several W3C standards, such as XML,

Namespaces, XSLT, XPath, and XML Schema. 

CHAPTER 8:

260 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

DATA WAREHOUSING

Researched and presented by:

Corto, Michelle T.
Manalo, Cklint Louisse M.
Romero, Agnes M.

DATA WAREHOUSING

A data warehouse is a database designed to enable business intelligence

activities: it exists to help users understand and enhance their organization's

261 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

performance. It is designed for query and analysis rather than for transaction

processing, and usually contains historical data derived from transaction data,

but can include data from other sources. Data warehouses separate analysis

workload from transaction workload and enable an organization to consolidate

data from several sources. This helps in:

 Maintaining historical records and Analyzing the data to gain a better

understanding of the business and to improve the business

Data Warehouse is a relational database management system (RDBMS)

constructed to meet the requirements of transaction processing systems. It can

be loosely described as any centralized data repository which can be queried for

business benefits. It is a database that stores information oriented to satisfy

decision-making requests. It is a group of decision support technologies, targets

to enable the knowledge worker (executive, manager, and analyst) to make

superior and higher decisions. So, Data Warehousing supports architectures and

tools for business executives to systematically organize, understand and use

their information to make strategic decisions.

In addition to a relational database, a data warehouse environment can

include an extraction, transportation, transformation, and loading (ETL) solution,

statistical analysis, reporting, data mining capabilities, client analysis tools, and

other applications that manage the process of gathering data, transforming it into

useful, actionable information, and delivering it to business users.

262 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

To achieve the goal of enhanced business intelligence, the data

warehouse works with data collected from multiple sources. The source data

may come from internally developed systems, purchased applications, third-party

data syndicators and other sources. It may involve transactions, production,

marketing, human resources and more. In today's world of big data, the data may

be many billions of individual clicks on web sites or the massive data streams

from sensors built into complex machinery.

A data warehouse usually stores many months or years of data to support

historical analysis. The data in a data warehouse is typically loaded through an

extraction, transformation, and loading (ETL) process from multiple data sources.

Modern data warehouses are moving toward an extract, load, transformation

(ELT) architecture in which all or most data transformation is performed on the

database that hosts the data warehouse. It is important to note that defining the

ETL process is a very large part of the design effort of a data warehouse.

Similarly, the speed and reliability of ETL operations are the foundation of the

data warehouse once it is up and running.

Users of the data warehouse perform data analyses that are often time-

related. Examples include consolidation of last year's sales figures, inventory

analysis, and profit by product and by customer. But time-focused or not, users

want to "slice and dice" their data however they see fit and a well-designed data

warehouse will be flexible enough to meet those demands. Users will sometimes

263 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

need highly aggregated data, and other times they will need to drill down to

details. More sophisticated analyses include trend analyses and data mining,

which use existing data to forecast trends or predict futures. The data warehouse

acts as the underlying engine used by middleware business intelligence

environments that serve reports, dashboards and other interfaces to end users.

BASIC CONCEPTS OF DATA WAREHOUSING 

A data warehouse is a subject-oriented, integrated, time-variant, non-

volatile collection of data used in support of management decision-making

processes and business intelligence (Inmon and Hackathorn, 1994). The

meaning of each of the key terms in this definition follows: 

 Subject-Oriented 

A data warehouse is subject oriented as it offers information regarding a

theme instead of companies’ ongoing operations. These subjects can be sales,

marketing, distributions, etc. 

A data warehouse never focuses on the ongoing operations. Instead, it put

emphasis on modeling and analysis of data for decision making. It also provides

a simple and concise view around the specific subject by excluding data which is

not helpful to support the decision process.

 Integrated 

In Data Warehouse, integration means the establishment of a common

unit of measure for all similar data from the dissimilar database. The data also

264 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

needs to be stored in the Data Warehouse in a common and universally

acceptable manner. 

A data warehouse is developed by integrating data from varied sources

like a mainframe, relational databases, flat files, etc. Moreover, it must keep

consistent naming conventions, format, and coding. 

This integration helps in effective analysis of data. Consistency in naming

conventions, attribute measures, encoding structure etc. have to be ensured.

Consider the following example:

In the above example, there are three different applications labeled A, B

and C. Information stored in these applications are Gender, Date, and Balance.

However, each application’s data is stored in a different way. 

 In Application A, gender fields store logical values like M or F.

 In Application B, gender field is a numerical value.

 In Application C, the gender field is stored in the form of a character value.

 Same is the case with date and balance.


265 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

However, after the transformation and cleaning process all this data is

stored in common format in the Data Warehouse. 

 Time-Variant 

The time horizon for data warehouses is quite extensive compared with

operational systems. The data collected in a data warehouse is recognized with a

particular period and offers information from the historical point of view. It

contains an element of time, explicitly or implicitly. 

One such place where Data warehouse data display time variance is in

the structure of the record key. Every primary key contained with the DW should

have either implicitly or explicitly an element of time. Like the day, week, month,

etc. 

Another aspect of time variance is that once data is inserted in the

warehouse, it can’t be updated or changed. All the historical data along with the

recent data in the

Data warehouse

play a crucial role

to retrieve data of

any duration of

time. If the

business wants

any reports,

266 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

graphs, etc then for comparing it with the previous years and to analyze the

trends, all the old data that are 6 months old, 1-year-old or even older data, etc.

are required.

 Non-volatile 

The data residing in the data warehouse is permanent and defined by its

names. It additionally means that the data in the data warehouse cannot be

erased or deleted or also when new data is inserted into it. In the data

warehouse, data is read-only and can only be refreshed at a particular interval of

time. Operations such as delete, update and insert that are done in a software

application over data are lost in the data warehouse environment. There are only

two types of data operations that can be done in the data warehouse: 

 Data Loading 

 Data Access

A data warehouse is not just a consolidation of all the operational

databases in an organization. Because of its focus on business intelligence,

external data, and time-variant data, a data warehouse is a unique kind of

database. Most data warehouses are relational databases designed in a way

optimized for decision support, not operational data processing.

Data warehousing is the process whereby organizations create and

maintain data warehouses and extract meaning from and help inform decision

making through the use of data in the data warehouses. Successful data

warehousing requires following proven data warehousing practices, sound

267 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

project management, strong organizational commitment, as well as making the

right technology decisions.

The process of creating data warehouses to store a large amount of data

is named Data Warehousing. Data Warehousing helps to improve the speed and

efficiency of accessing different data sets and makes it easier for company

decision-makers to obtain insights that will help the business and promote

marketing tactics that set them aside from their competitors. We can say that it is

a blend of technologies and components which aids the strategic use of data and

information. The main goal of data warehousing is to create a hoarded wealth of

historical data that can be retrieved and analyzed to supply helpful insight into

the organization’s operations.

Types of Data Warehousing

There are mainly three types of data warehousing, which are as follows: 

 Enterprise Data Warehouse: Enterprise data warehouse is a centralized

warehouse that offers decision-making support to different departments

across an enterprise. It provides a unified approach for organizing as well as

representing data. With this warehouse at your end, you gain the ability to

classify the data as per the subject and grant the level of access to different

departments accordingly. 

 Operational Data Store: Popularly known as ODS, Operational Data

Store is used when an organization’s reporting needs are not satisfied by a

data warehouse or an OLTP system. In ODS, a data warehouse can be

268 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

refreshed in real-time, making it best for routine activities like storing

employees’ records. 

 Data Mart: As part of a data warehouse, Data Mart is particularly

designed for a specific business line like finance, accounts, sales, purchases,

or inventory. The warehouse allows you to collect data directly from the

sources.

HISTORY OF DATA WAREHOUSING

The key discovery that triggered the development of data warehousing

was the recognition of the fundamental differences between operational systems

(sometimes called systems of record because their role is to keep the official,

legal record of the organization) and informational systems. The need to

warehouse data evolved as computer systems became more complex and

needed to handle increasing amounts of information.

Here are some key events in evolution of Data Warehouse- 

 1960- Dartmouth and General Mills in a joint research project, develop the

terms dimensions and facts. 

 1970- A Nielsen and IRI introduced dimensional data marts for retail

sales. 

 1983- Tera Data Corporation introduces a database management system

which is specifically designed for decision support.

 Data warehousing started in the late 1980s when IBM worker Paul Murphy

and Barry Devlin developed the Business Data Warehouse. 


269 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 1988- Devlin and Murphy published the first article describing the

architecture of a data warehouse.

 1992- Inmon published the first book describing data warehousing, and he

has subsequently become one of the most prolific authors in this field.

 However, the real concept was given by Inmon Bill. He was considered

the father of the data warehouse. He had written about a variety of topics for

building, usage, and maintenance of the warehouse & the Corporate

Information Factory.

In essence, the data warehousing idea was planned to support an

architectural model for the flow of information from the operational system to

decisional support environments. The concept attempts to address the various

problems associated with the flow, mainly the high costs associated with it.

In the absence of data warehousing architecture, a vast amount of space

was required to support multiple decision support environments. In large

corporations, it was ordinary for various decision support environments to

operate independently.

THE NEED FOR DATA WAREHOUSING 

Data Warehousing is a progressively essential tool for business

intelligence. It allows organizations to make quality business decisions. The data

warehouse benefits by improving data analytics, it also helps to gain

considerable revenue and the strength to compete more strategically in the

market. By efficiently providing systematic, contextual data to the business

270 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

intelligence tool of an organization, the data warehouses can find out more

practical business strategies.

Two major factors drive the need for data warehousing in most organizations

today: 

1. A business requires an integrated, company-wide view of high-quality

information. 

2. The information systems department must separate informational from

operational systems to improve performance dramatically in managing company

data.

Need for a Company-Wide View 

Data in operational systems are typically fragmented and inconsistent, so-

called silos, or islands, of data. They are also generally distributed on a variety of

incompatible hardware and software platforms. For example, one source of

customer data may be located on a UNIX-based server running an Oracle

DBMS, whereas another may be located on a SAP system. Yet, for decision-

making purposes, it is often necessary to provide a single, corporate view of that

information. 

To understand the difficulty of deriving a single corporate view, look at the

simple example shown in Figure 1. This figure shows three tables from three

separate systems of record, each containing similar student data. The STUDENT

DATA table is from the class registration system, the STUDENT EMPLOYEE

table is from the personnel system, and the STUDENT HEALTH table is from a

health center system. Each table contains some unique data concerning

271 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

students, but even common data (e.g., student names) are stored using different

formats.

Figure 1. Examples of heterogeneous data

STUDENT DATA
StudentNo LastName MI FirstName Telephone Status …
123-45-6789 Enright T Mark 483-1967 Soph

389-21-4062 Smith R Elaine 283-4195 Jr

STUDENT EMPLOYEE
StudentID Address Dept Hours …
123-45-6789 1218 Elk Drive, Phoenix, AZ 91304 Soc 8

389-21-4062 134 Mesa Road, Tempe, AZ 90142 Math 10

STUDENT HEALTH
StudentName Telephone Insurance ID …
Mark T. Enright 483-1967 Blue Cross 123-45-6789

Elaine R. Smith 555-7828 ? 389-21-4062

Suppose you want to develop a profile for each student, consolidating all

data into a single file format. Some of the issues that you must resolve are as

follows: 

 Inconsistent key structures - The primary key of the first two tables is

some version of the student Social Security number, whereas the primary key

of STUDENT HEALTH is StudentName. 

 Synonym - In STUDENT DATA, the primary key is named StudentNo,

whereas in STUDENT EMPLOYEE it is named StudentID.

272 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Free-form fields versus structured fields - In STUDENT HEALTH,

StudentName is a single field. In STUDENT DATA, StudentName (a

composite attribute) is broken into its component parts: LastName, MI, and

FirstName. 

 Inconsistent data values - Elaine Smith has one telephone number in

STUDENT DATA but a different number in STUDENT HEALTH. 

 Missing data - The value for Insurance is missing (or null) for Elaine

Smith in the STUDENT HEALTH table.

This simple example illustrates the nature of the problem of developing a

single corporate view but fails to capture the complexity of that task. A real-life

scenario would likely have dozens (if not hundreds) of tables and thousands (or

millions) of records. 

Why do organizations need to bring data together from various systems of

record? Ultimately, of course, the reason is to be more profitable, to be more

competitive, or to grow by adding value for customers. This can be accomplished

by increasing the speed and flexibility of decision making, improving business

processes, or gaining a clearer understanding of customer behavior. For the

previous student example, university administrators may want to investigate if the

health or number of hours students work on campus is related to student

academic performance; if taking certain courses is related to the health of

students; or whether poor academic performers cost more to support, for

example, due to increased health care as well as other costs. In general, certain

273 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

trends in organizations encourage the need for data warehousing; these trends

include the following:

 No single system of record 

Almost no organization has only one database. Because of the

heterogeneous needs for data in different operational settings, because of

corporate mergers and acquisitions, and because of the sheer size of many

organizations, multiple operational databases exist. 

 Multiple systems are not synchronized 

It is difficult, if not impossible, to make separate databases consistent. Even if

the metadata are controlled and made the same by one data administrator,

the data values for the same attributes will not agree. This is because of

different update cycles and separate places where the same data are

captured for each system. Thus, to get one view of the organization, the data

from the separate systems must be periodically consolidated and

synchronized into one additional database. We will see that there can be

actually two such consolidated databases—an operational data store and an

enterprise data warehouse. 

 Organizations want to analyze the activities in a balanced way 

Many organizations have implemented some form of a balanced scorecard—

metrics that show organization results in financial, human, customer

satisfaction, product quality, and other terms simultaneously. To ensure that

this multidimensional view of the organization shows consistent results, a

data warehouse is necessary. When questions arise in the balanced

274 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

scorecard, analytical software working with the data warehouse can be used

to “drill down,” “slice and dice,” visualize, and in other ways mine business

intelligence. 

 Customer relationship management 

Organizations in all sectors are realizing that there is value in having a total

picture of their interactions with customers across all touch points. Different

touch points (e.g., for a bank, these touch points include ATMs, online

banking, tellers, electronic funds transfers, investment portfolio management,

and loans) are supported by separate operational systems. Thus, without a

data warehouse, a teller may not know to try to cross-sell a customer one of

the bank’s mutual funds if a large, atypical automatic deposit transaction

appears on the teller’s screen. Having a total picture of the activity with a

given customer requires a consolidation of data from various operational

systems. 

 Supplier relationship management 

Managing the supply chain has become a critical element in reducing costs

and raising product quality for many organizations. Organizations want to

create strategic supplier partnerships based on a total picture of their

activities with suppliers, from billing, to meeting delivery dates, to quality

control, to pricing, to support. Data about these different activities can be

locked inside separate operational systems (e.g., accounts payable, shipping

and receiving, production scheduling, and maintenance). ERP systems have

improved this situation by bringing many of these data into one database.

275 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

However, ERP systems tend to be designed to optimize operational, not

informational or analytical, processing.

Need to Separate Operational and Informational Systems 

An operational system is a system that is used to run a business in real

time, based on current data. Examples of operational systems are sales order

processing, reservation systems, and patient registration systems. Operational

systems must process large volumes of relatively simple read/write transactions

and provide fast response. Operational systems are also called systems of

record.

Table 1. Comparison of Operational and Informational Systems

Characteristic Operational Systems Informational Systems


Primary Run the business on a Support managerial decision

purpose current basis making


Type of data Current representation of Historical point-in-time

state of the business (snapshots) and predictions


Primary users Clerks, salespersons, Managers, business

administrators analysts, customers


Scope of Narrow, planned, and simple Broad, ad hoc, complex

usage updates and queries queries and analysis


Design goal Performance: throughput, Ease of flexible access and

availability use
Volume Many constant updates and Periodic batch updates and

queries on one or a few table queries requiring many or all

rows rows

276 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Informational systems are designed to support decision making based on

historical point-in-time and prediction data. They are also designed for complex

queries or data-mining applications. Examples of informational systems are

systems for sales trend analysis, customer segmentation, and human resources

planning. 

The key differences between operational and informational systems are

shown in Table 1. These two types of processing have very different

characteristics in nearly every category of comparison. In particular, notice that

they have quite different communities of users. Operational systems are used by

clerks, administrators, salespersons, and others who must process business

transactions. Informational systems are used by managers, executives, business

analysts, and (increasingly) by customers who are searching for status

information or who are decision makers. The need to separate operational and

informational systems is based on three primary factors: 

1. A data warehouse centralizes data that are scattered throughout disparate

operational systems and makes them readily available for decision support

applications. 

2. A properly designed data warehouse adds value to data by improving their

quality and consistency. 

3. A separate data warehouse eliminates much of the contention for resources

that results when informational applications are confounded with operational

processing.

Data warehousing Architectures

277 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A data warehouse architecture is a method of defining the overall

architecture of data communication processing and presentation that exist for

end-clients computing within the enterprise. Each data warehouse is different,

but all are characterized by standard vital components.

The architecture for data warehouses has evolved, and organizations

have considerable latitude in creating variations.The first is a three-level

architecture that characterizes a bottom-up, incremental approach to evolving the

data warehouse; the second is also a three-level data architecture that appears

usually from a more top-down approach that emphasizes more coordination and

an enterprise-wide perspective. Even with their differences, there are many

common characteristics to these approaches.

Data Warehouse applications are designed to support the user ad-hoc

data requirements, an activity recently dubbed online analytical processing

(OLAP). These include applications such as forecasting, profiling, summary

278 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

reporting, and trend analysis. Data warehouses and their architectures vary

depending upon the elements of an organization's situation.

Independent Data Mart Data Warehousing Environment

The independent data mart architecture for a data warehouse is shown in the

figure below. Building this architecture requires four basic steps (moving left to

right in the figure below):

1. Data are extracted from the various internal and external source system

files and databases. In a large organization, there may be dozens or even

hundreds of such files and databases.

2. The data from the various source systems are transformed and integrated

before being loaded into the data marts. Transactions may be sent to the

source systems to correct errors discovered in data staging. The data

warehouse is considered to be the collection of data marts.

3. . The data warehouse is a set of physically distinct databases organized

for decision support. It contains both detailed and summary data.

4. Users access the data warehouse by means of a variety of query

languages and analytical tools. Results (e.g., predictions, forecasts) may be

fed back to data warehouses and operational databases.

Extraction and loading happen periodically—sometimes daily, weekly, or

monthly. Thus, the data warehouse often does not have, nor does it need to

have, current data. Remember, the data warehouse is not (directly) supporting

operational transaction processing, although it may contain transactional data

279 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

(but more often summaries of transactions and snapshots of status variables,

such as account balances and inventory levels). For most data warehousing

applications, users are not looking for a reaction to an individual transaction but

rather for trends and patterns in the state of the organization across a large

subset of the data warehouse. At a minimum, five fiscal quarters of data are kept

in a data warehouse so that at least annual trends and patterns can be

discerned. Older data may be purged or archived. We will see later that one

advanced data warehousing architecture, real-time data warehousing, is based

on a different assumption about the need for current data.

Contrary to many of the principles discussed so far in this chapter, the

independent data marts approach does not create one data warehouse. Instead,

this approach creates many separate data marts, each based on data

warehousing, not transaction processing database technologies. A data mart is a

data warehouse that is limited in scope, customized for the decision-making

applications of a particular end-user group. Its contents are obtained either from

independent ETL processes, as shown in Figure 9-2 for an independent data

mart, or are derived from the data warehouse, which we will discuss in the next

two sections. A data mart is designed to optimize the performance forwell-

defined and predicable uses, sometimes as few as a single or a couple of

queries. For example, an organization may have a marketing data mart, a

finance data mart, a supply chain data mart, and so on, to support known

analytical processing. It is possible that each data mart is built using different

tools; for example, a financial data mart may be built using a proprietary

280 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

multidimensional tool like Hyperion’s Essbase, and a sales data mart may be

built on a more general-purpose data warehouse platform, such as Teradata,

using MicroStrategy and other tools for reporting, querying, and data

visualization.

Independent data marts are often created because an organization

focuses on a series of short-term, expedient business objectives. The limited

short-term objectives can be more compatible with the comparably lower cost

(money and organizational capital) to implement yet one more independent data

mart. However, designing the data warehousing environment around different

sets of short-term objectives means that you lose flexibility for the long term and

the ability to react to changing business conditions. And being able to react to

change is critical for decision support. It can be organizationally and politically

easier to have separate, small data warehouses than to get all organizational

parties to agree to one view of the organization in a central data warehouse.

Also, some data warehousing technologies have technical limitations for the size

of the data warehouse they can support—what we will call later a scalability

issue. Thus, technology, rather than the business, may dictate a data

warehousing architecture if you first lock yourself into a particular data

warehousing set of technologies before you understand your data warehousing

requirements. We discuss the pros and cons of the independent data mart

architecture compared with its prime competing architecture in the next section.

281 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Dependent Data Mart and Operational Data Store Architecture: A Three-Level

Approach

1. A separate ETL process is developed for each data mart, which can yield

costly redundant data and processing efforts. 

2. Data marts may not be consistent with one another because they are

often developed with different technologies, and thus they may not provide a

clear enterprise wide view of data concerning important subjects such as

customers, suppliers, and products.

3. There is no capability to drill down into greater detail or into related facts in

other data marts or a shared data repository, so analysis is limited, or at best

very difficult (e.g., doing joins across separate platforms for different data

marts). Essentially, relating data across data marts is a task performed by

users outside the data warehouse. 

4. Scaling costs are excessive because every new application that creates a

separate data mart repeats all the extract and load steps. Usually, operational

systems have limited time windows for batch data extracting, so at some

point, the load on the operations systems may mean that new technology is

needed, with additional costs. 

5. If there is an attempt to make the separate data marts consistent, the cost to

do so is quite high.

282 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

One of the most popular approaches to addressing the independent data

mart limitations raised earlier is to use a three-level approach represented by the

dependent data mart and operational data store architecture. Here the new level

is the operational data store, and the data and metadata storage level is

reconfigured. The first and second limitations are addressed by loading the

dependent data marts from an enterprise data warehouse (EDW), which is a

central, integrated data warehouse that is the control point and single “version of

the truth” made available to end users for decision support applications.

Dependent data marts still have a purpose to provide a simplified and high-

performance environment that is tuned to the decision-making needs of user

groups. A data mart may be a separate physical database (and different data

marts may be on different platforms) or can be a logical (user view) data mart

instantiated on the fly when accessed.

A user group can access its data mart, and then when other data are

needed, users can access the EDW. Redundancy across dependent data marts

is planned, and redundant data are consistent because each data mart is loaded

in a synchronized way from one common source of data (or is a view of the data

warehouse). Integration of data is the responsibility of the IT staff managing the

enterprise data warehouse; it is not the end users’ responsibility to integrate data

across independent data marts for each query or application. The dependent

data mart and operational data store architecture is often called a “hub and

spoke” approach, in which the EDW is the hub and the source data systems and

the data marts are at the ends of input and output spokes.

283 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The third limitation is addressed by providing an integrated source for all

the operational data in an operational data store. An operational data store

(ODS) is an integrated, subject-oriented, continuously update-able, current-

valued (with recent history), organization-wide, detailed database designed to

serve operational users as they do decision support processing (Imhoff, 1998;

Inmon, 1998)An ODS is typically a relational database and normalized like

databases in the systems of record, but it is tuned for decision-making

applications.

An ODS typically does not contain “deep” history, whereas an EDW holds

typically a multiyear history of snapshots of the state of the organization. An ODS

may be fed from the database of an ERP application, but because most

organizations do not have only one ERP database and do not run all operations

against one ERP, an ODS is usually different from an ERP database. The ODS

also serves as the staging area for loading data into the EDW. The ODS may

284 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

receive data immediately or with some delay from the systems of record,

whichever is practical and acceptable for the decision-making requirements that

it supports.

Different leaders in the field endorse different approaches to data

warehousing. Those that endorse the independent data mart approach argue

that this approach has two significant benefits: 

1. It allows for the concept of a data warehouse to be demonstrated by

working on a series of small projects. 

2. The length of time until there is some benefit from data warehousing is

reduced because the organization is not delayed until all data are centralized.

Logical Data Mart and Real-Time Data Warehouse Architecture

The logical data mart and real-time data warehouse architecture is

practical for only moderate-sized data warehouses or when using high-

performance data warehousing technology, such as the Teradata system.

1. Logical data marts are not physically separate databases but rather

different relational views of one physical, slightly denormalized relational data

warehouse.

285 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

2. Data are moved into the data warehouse rather than to a separate staging

area to utilize the high-performance computing power of the warehouse

technology to perform the cleansing and transformation steps.

3. . New data marts can be created quickly because no physical database or

database technology needs to be created or acquired and no loading routines

need to be written.

4. . Data marts are always up to date because data in a view are created

when the view is referenced; views can be materialized if a user has a series

of queries and analysis that need to work off the same instantiation of the

data

mart.

Whether logical or physical, data marts and data warehouses play

different roles in a data warehousing environment. Although limited in scope, a

data mart may not be small. Thus, scalable technology is often critical. A

significant burden and cost is placed on users when they themselves need to

integrate the data across separate physical data marts (if this is even possible).

As data marts are added, a data warehouse can be built in phases; the easiest

286 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

way for this to happen is to follow the logical data mart and real-time data

warehouse architecture.

The real-time data warehouse aspect of the architecture means that the

source data systems, decision support services, and the data warehouse

exchange data and business rules at a near-real-time pace because there is a

need for rapid response (i.e., action) to a current, comprehensive picture of the

organization. The purpose of real-time data warehousing is to know what is

happening, when it is happening, and to make desirable things happen through

the operational systems. For example, a help desk professional answering

questions and logging problem tickets will have a total picture of the customer’s

most recent sales contacts, billing and payment transactions, maintenance

activities, and orders. With this information, the system supporting the help desk

can, based on operational decision rules created from a continuous analysis of

up-to-date warehouse data, automatically generate a script for the professional to

sell what the analysis has shown to be a likely and profitable maintenance

contract, an upgraded product, or another product bought by customers with a

similar profile. A critical event, such as entry of a new product order, can be

considered immediately so that the organization knows at least as much about

the relationship with its

customer as does the

customer.

287 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In addition of the given information above, here are some of the three common

architectures in data warehousing:

Data Warehouse Architecture: Basic

Data Warehouse Architecture: With Staging Area

Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to a system

that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and

every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make

finding and work with particular instances of data more accessible. For example,

288 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

author, data build, and data changed, and file size are examples of very basic

document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly

summarized (aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance.

The summarized record is updated continuously as new information is loaded

into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the

business managers for strategic decision-making. These customers interact with

the warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

Reporting and Query Tools

Application Development Tools

Executive Information Systems Tools

Online Analytical Processing Tools

Data Mining Tools

Data Warehouse Architecture: With Staging Area

289 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

We must clean and process your operational information before put it into the

warehouse.

We can do this programmatically, although data warehouses uses a staging area

(A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational method

coming from multiple source systems, especially for enterprise data warehouses

where all relevant data of an enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record from

source systems is copied.

290 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups

within our organization.

We can do this by adding data marts. A data mart is a segment of a data

warehouses that can provided information for reporting and analysis on a

section, unit, department or operation in the company, e.g., sales, payroll,

production, etc.

The figure illustrates an example where purchasing, sales, and stocks are

separated. In this example, a financial analyst wants to analyze historical data for

purchases and sales or mine historical information to make predictions about

customer behavior.

291 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse

system:

1.

Separation: Analytical

and transactional

processing should be

keep apart as much as

possible.

2. Scalability: Hardware and software architectures should be simple to upgrade

the data volume, which has to be managed and processed, and the number of

user's requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and

technologies without redesigning the whole system.

292 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

4. Security: Monitoring accesses are necessary because of the strategic data

stored in the data warehouses.

5. Administerability: Data Warehouse management should not be complicated.

Characteristics of data warehouse data

Understand and model the data in each of the three layers of the data

architecture for a data warehouse, you need to learn some basic characteristics

of data as they are stored in data warehouse databases.

Status Versus Event Data

The difference between

status data and event data is

shown in figure . The figure

shows a typical log entry

recorded by a DBMS when

processing a business

transaction for a banking application. This log entry contains both status and

event data: The “before image” and “after image” represent the status of the bank

account before and then after a withdrawal. Data representing the withdrawal (or

update event) are shown in the middle of the figure.

Transactions are business activities that cause one or more business

events to occur at a database level. An event results in one or more database

actions (create, update, or delete). The withdrawal transaction in the above figure

293 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

leads to a single update, which is the reduction in the account balance from 750

to 700. On the other hand, the transfer of money from one account to another

would lead to two actions: two updates to handle a withdrawal and a deposit.

Sometimes non-transactions, such as an abandoned online shopping cart, busy

signal or dropped network connection, or an item put in a shopping cart and then

taken out before checkout, can also be important activities that need to be

recorded in the data warehouse.

Both status data and event data can be stored in a database. However, in

practice, most of the data stored in databases (including data warehouses) are

status data. A data warehouse likely contains a history of snapshots of status

data or a summary (say, an hourly total) of transaction or event data. Event data,

which represent transactions, may be stored for a defined period but are then

deleted or archived to save storage space.Both status and event data are

typically stored in database logs (as represented in the figure) for backup and

recovery purposes.

Transient Versus Periodic Data 

In data warehouses, it is typical to maintain a record of when events

occurred in the past. This is necessary, for example, to compare sales or

inventory levels on a particular date or during a particular period with the

previous year’s sales on the same date or during the same period. Most

operational systems are based on the use of transient data. Transient data are

data in which changes to existing records are written over previous records, thus

294 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

destroying the previous data content. Records are deleted without preserving the

previous contents of those records. You can easily visualize transient data by

again referring to Figure 9-6. If the after image is written over the before image,

the before image (containing the previous balance) is lost. However, because

this is a database log, both images are normally preserved. Periodic data are

data that are never physically altered or deleted once added to the store. The

before and after images in Figure 9-6 represent periodic data. Notice that each

record contains a time stamp that indicates the date (and time, if needed) when

the most recent update event occurred.

OTHER DATA WAREHOUSE CHANGES

Besides the periodic changes to data values outlined previously, six other kinds

of changes to a warehouse data model must be accommodated by data

warehousing:

1. New descriptive attributes For example, new characteristics of products or

customers that are important to store in the warehouse must be

accommodated. Later in the chapter we call these attributes of dimension

tables. This change is fairly easily accommodated by adding columns to

tables and allowing null values for existing rows (if historical data exist in

source systems, null values do not have to be stored).

2. New business activity attributes For example, new characteristics of an

event already stored in the warehouse, such as a column C for the table in

Figure 9-8, must be accommodated. This can be handled as in item 1, but

295 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

is more difficult when the new facts are more refined, such as data

associated with days of the week, not just month and year.

3. New classes of descriptive attributes This is equivalent to adding new

tables to the database.

4. Descriptive attributes become more refined For example, data about

stores must be broken down by individual cash registers to understand

sales data. This change is in the grain of the data, an extremely important

topic, which we discuss later in the chapter. This can be a very difficult

change to accommodate.

5. Descriptive data are related to one another For example, store data are

related to geography data. This causes new relationships, often hierarchical, to

be included in the data model.

6. New source of data This is a very common change, in which some new

business need causes data feeds from an additional source system or

some new operational system is installed that must feed the warehouse.

This change can cause almost any of the previously mentioned changes,

as well as the need for new extract, transform, and load processes.

It is usually not possible to go back and reload a data warehouse to

accommodate all of these kinds of changes for the whole data history

maintained. But it is critical to accommodate such changes smoothly to enable

the data warehouse to meet new business conditions and information and

296 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

business intelligence needs. Thus, designing the warehouse for change is very

important.

In addition, according to oracle help center,these are the key characteristics of a

data warehouse:

 Some data is denormalized for simplification and to improve performance

 Large amounts of historical data are used

 Queries often retrieve large amounts of data

 Both planned and ad hoc queries are common

 The data load is controlled

In general, fast query performance with high data throughput is the key to a

successful data warehouse.

Data warehouse can be controlled when the user has a shared way of explaining

the trends that are introduced as specific subject. Below are

major characteristics of data warehouse:

1. Subject-oriented –

A data warehouse is always a subject oriented as it delivers information about a

theme instead of organization’s current operations. It can be achieved on specific

theme. That means the data warehousing process is proposed to handle with a

specific theme which is more defined. These themes can be sales, distributions,

marketing etc.

A data warehouse never put emphasis only current operations. Instead, it

focuses on demonstrating and analysis of data to make various decision. It also

297 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

delivers an easy and precise demonstration around particular theme by

eliminating data which is not required to make the decisions.

2. Integrated –

It is somewhere same as subject orientation which is made in a reliable format.

Integration means founding a shared entity to scale the all similar data from the

different databases. The data also required to be resided into various data

warehouse in shared and generally granted manner.

A data warehouse is built by integrating data from various sources of data such

that a mainframe and a relational database. In addition, it must have reliable

naming conventions, format and codes. Integration of data warehouse benefits in

effective analysis of data. Reliability in naming conventions, column scaling,

encoding structure etc. should be confirmed. Integration of data warehouse

handles various subject related warehouse.

3. Time-Variant –

In this data is maintained via different intervals of time such as weekly, monthly,

or annually etc. It founds various time limit which are structured between the

large datasets and are held in online transaction process (OLTP). The time limits

for data warehouse is wide-ranged than that of operational systems. The data

resided in data warehouse is predictable with a specific interval of time and

delivers information from the historical perspective. It comprises elements of time

explicitly or implicitly. Another feature of time-variance is that once data is stored

in the data warehouse then it cannot be modified, alter, or updated.

4. Non-Volatile –

298 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

As the name defines the data resided in data warehouse is permanent. It also

means that data is not erased or deleted when new data is inserted. It includes

the mammoth quantity of data that is inserted into modification between the

selected quantity on logical business. It evaluates the analysis within the

technologies of warehouse.

In this, data is read-only and refreshed at particular intervals. This is beneficial in

analysing historical data and in comprehension the functionality. It does not need

transaction process, recapture and concurrency control mechanism.

Functionalities such as delete, update, and insert that are done in an operational

application are lost in data warehouse environment. Two types of data operations

done in the data warehouse are:

 Data Loading

 Data Access

The Derived Data Layer

Derived data is generated from existing data using a mathematical

operation or a data transformation. OLAP Services uses SQL ROLLUP to

generate aggregate data in the data warehouse. Dimension tables, also

called lookup tables, are used to store the dimension members for all levels in

the hierarchy. This is the data layer associated with logical or physical data

marts. It is the layer with which users normally interact for their decision support

applications. Ideally, the reconciled data level is designed first and is the basis for

the derived layer, whether data marts are dependent, independent, or logical. In

299 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

order to derive any data mart we might need, it is necessary that the EDW

(Enterprise Data Warehouse) be a fully normalized relational database

accommodating transient and periodic data; this gives us the greatest flexibility to

combine data into the simplest form for all user needs, even those that are

unanticipated when the EDW is designed.

Derived data is generated from existing data using a mathematical

operation or a data transformation. It can be created as part of a database

maintenance operation or generated at run-time in response to a query.

The objectives that are sought with derived data are quite different from

the objectives of reconciled data. Typical objectives are the following: 

 Provide ease of use for decision support applications 

 Provide fast response for predefined user queries or requests for

information (information usually in the form of metrics used to gauge the

health of the organization in areas such as customer service, profitability,

process efficiency, or sales growth) 

 Customize data for particular target user groups 

 Support ad hoc queries and data mining and other analytical applications 

To satisfy these needs, we usually find the following characteristics in derived

data: 

 Both detailed data and aggregate data are present: 

300 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

a. Detailed data are often (but not always) periodic—that is, they provide a

historical record. 

b. Aggregate data are formatted to respond quickly to predetermined (or

common) queries.

 Data are distributed to separate data marts for different user groups. 

 The data model that is most commonly used for a data mart is a

dimensional model, usually in the form of a star schema, which is a relational-

like model (such models are used by relational online analytical processing

[ROLAP] tools). 

Star Schema

A star schema is a database organizational structure optimized for use in

a data warehouse or business intelligence that uses a single large fact table to

store transactional or measured data, and one or more smaller dimensional

tables that store attributes about the data. It is called a star schema because the

fact table sits at the center of the logical diagram, and the small dimensional

tables branch off to form the points of the star. 

A star schema is a simple database design (particularly suited to ad hoc

queries) in which dimensional

data (describing how data are

commonly aggregated for

reporting) are separated from

fact or event data (describing

301 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

business activity). A star schema is one version of a dimensional model (Kimball,

1996a).

A star schema consists of two types of tables: one fact table and one or

more dimension tables. Fact tables contain factual or quantitative data

(measurements that are numerical, continuously valued, and additive) about a

business, such as units sold, orders booked, and so on. Dimension tables hold

descriptive data (context) about the subjects of the business. The dimension

tables are usually the source of attributes used to qualify, categorize, or

summarize facts in queries, reports, or graphs; thus, dimension data are usually

textual and discrete (even if numeric). A data mart might contain several star

schemas with similar dimension tables but each with a different fact table. Typical

business dimensions (subjects) are Product, Customer, and Period.

Components of Star Schema

A Fact Table sits at the center of a star schema database, and each star

schema database only has a single fact table. The fact table contains the specific

measurable (or

quantifiable) primary data to

be analyzed, such as sales

records, logged performance

data or financial data. It may

be transactional -- in that

302 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

rows are added as events happen -- or it may be a snapshot of historical data up

to a point in time.

Dimension tables store supporting information to the fact table. Each star

schema database has at least one dimension table, but will often have many.

Each dimension table will relate to a column in the fact table with a dimension

value, and will store additional information about that value.

Star Schema Example

A star schema provides answers to a domain of business questions. For

example, consider the following questions: 

1. Which cities have the highest sales of large products? 

2. What is the average monthly sales for each store manager? 

3. In which stores are we losing money on which products? Does this vary by

quarter? 

A simple example of a star schema that could provide answers to such

questions is shown in Figure 9-10. This example has three dimension tables:

PRODUCT, PERIOD, and STORE, and one fact table, named SALES. The fact

table is used to record three business

facts: total units sold, total dollars sold, and

total dollars cost. These totals are

recorded for each day (the lowest level of

PERIOD) a product is sold in a store.

Could these three questions be

answered from a fully normalized data

303 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

model of transactional data? Sure, a fully normalized and detailed database is

the most flexible, able to support answering almost any question. However, more

tables and joins would be involved, data need to be aggregated in standard

ways, and data need to be sorted in an understandable sequence.

Star Schema Sample Data

Some sample data for this schema are shown in Figure 9-11. From the

fact table, we find (for example) the following facts for product number 110 during

period 002: 1. Thirty units were sold in store S1. The total dollar sale was 1500,

and total dollar cost was 1200. 2. Forty units were sold in store S3. The total

dollar sale was 2000, and total dollar cost was 1200.

Additional detail concerning the dimensions for this example can be

obtained from the dimension tables. For example, in the PERIOD table, we find

that period 002 corresponds to year 2010, quarter 1, month 5. Try tracing the

other dimensions in a similar manner.

Surrogate Key

Surrogate keys are widely used and accepted design standard in data

warehouses. It is a sequentially generated unique number attached with each

and every record in a Dimension table in any Data Warehouse. It joins between

the fact and dimension tables and is necessary to handle changes in dimension

table attributes.

Surrogate keys are typically meaningless integers used to connect the fact

to the dimension tables of a data warehouse.  There are various reasons why we

cannot simply reuse our existing natural or business keys.  Surrogate keys

304 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

essentially buffer the data warehouse from the operational environment by

making it immune to any operational changes.  They are used to relate the facts

in the fact table to the appropriate rows in the dimension tables, with the

business keys only occurring in the (much smaller) dimension tables to keep the

link with the identifiers in the operational systems.

 Business keys change, often slowly, over time, and we need to remember

old and new business key values for the same business object. As we will see

in a later section on slowly changing dimensions, a surrogate key allows us to

handle changing and unknown keys with ease.

 Using a surrogate key also allows us to keep track of different nonkey

attribute values for the same production key over time. Thus, if a product

package changes in size, we can associate the same product production key

with several surrogate keys, each for the different package sizes. 

 Surrogate keys are often simpler and shorter, especially when the

production key is a composite key. 

 Surrogate keys can be of the same length and format for all keys, no

matter what business dimensions are involved in the database, even dates. 

The primary key of each dimension table is its surrogate key. The primary

key of the fact table is the composite of all the surrogate keys for the related

dimension tables, and each of the composite key attributes is obviously a foreign

key to the associated dimension table.

Grain of the Fact Table

305 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Fact tables provide the (usually) additive values that act as independent

variables by which dimensional attributes are analyzed. Fact tables are often

defined by their grain. The fact table grain functionality sets a new compound

primary key for a table. This means you no longer need to use connection points

for incremental uploads to fact tables. The grain of a fact table defines the lowest

level of detail that the fact table is divided into. 

The raw data of a star schema are kept in the fact table. All the data in a

fact table are determined by the same combination of composite key elements;

so, for example, if the most detailed data in a fact table are daily values, then all

measurement data must be daily in that fact table, and the lowest level of

characteristics for the period dimension must also be a day. Determining the

lowest level of detailed fact data stored is arguably the most important and

difficult data mart design step. The level of detail of this data is specified by the

intersection of all of the components of the primary key of the fact table. This

intersection of primary keys is called the grain of the fact table. Determining the

grain is critical and must be determined from business decision-making needs

(i.e., the questions to be answered from the data mart). There is always a way to

summarize fact data by aggregating using dimension attributes, but there is no

way in the data mart to understand business activity at a level of detail finer than

the fact table grain.

Duration of the Database

306 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

As in the case of the EDW or ODS, another important decision in the

design of a data mart is the amount of history to be kept; that is, the duration of

the database. The natural duration is about 13 months or 5 calendar quarters,

which is sufficient to see annual cycles in the data. Some businesses, such as

financial institutions, have a need for longer durations. Older data may be difficult

to source and cleanse if additional attributes are required from data sources.

Even if sources of old data are available, it may be most difficult to find old values

of dimension data, which are less likely than fact data to have been retained. Old

fact data without associated dimension data at the time of the fact may be

worthless.

Size of the Fact Table

As you would expect, the grain and duration of the fact table have a direct

impact on the size of that table. We can estimate the number of rows in the fact

table as follows: 

1. Estimate the number of possible values for each dimension associated

with the fact table (in other words, the number of possible values for each

foreign key in the fact table). 

2. Multiply the values obtained in the first step after making any necessary

adjustments.

Let’s apply this approach to the star schema shown in Figure 9-11.

Assume the following values for the dimensions: 

 Total number of stores = 1000 

 Total number of products = 10,000 

307 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Total number of periods = 24 (2 years’ worth of monthly data)

Although there are 10,000 total products, only a fraction of these products

are likely to record sales during a given month. Because item totals appear in the

fact table only for items that record sales during a given month, we need to adjust

this figure. Suppose that on average 50 percent (or 5000) items record sales

during a given month. Then an estimate of the number of rows in the fact table is

computed as follows: 

 Total rows = 1000 stores X 5000 active products X 24 months

= 120,000,000 rows (!)

Thus, in our relatively small example, the fact table that contains two

years’ worth of monthly totals can be expected to have well over 100 million

rows. This example clearly illustrates that the size of the fact table is many times

larger than the dimension tables. For example, the STORE table has 1000 rows,

the PRODUCT table 10,000 rows, and the PERIOD table 24 rows. If we know the

size of each field in the fact table, we can further estimate the size (in bytes) of

that table. The fact table (named SALES) in Figure 9-11 has six fields. If each of

these fields averages four bytes in length, we can estimate the total size of the

fact table as follows:

 Total size = 120,000,000 rows X 6 fields X 4 bytes/field 

= 2,880,000,000 bytes (or 2.88 gigabytes)

The size of the fact table depends on both the number of dimensions and

the grain of the fact table. Suppose that after using the database shown in Figure

9-11 for a short period of time, the marketing department requests that daily

308 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

totals be accumulated in the fact table. (This is a typical evolution of a data mart.)

With the grain of the table changed to daily item totals, the number of rows is

computed as follows: 

 Total rows = 1000 stores X 2000 active products X 720 days (2 years)

= 1,440,000,000 row

In this calculation, we have assumed that 20 percent of all products record

sales on a given day. The database can now be expected to contain well over 1

billion rows. The database size is calculated as follows: 

 Total size = 1,440,000,000 rows X 6 fields X 4 bytes/field 

= 34,560,000,000 bytes (or 34.56 gigabytes)

Modeling Date and Time

Because data warehouses and data marts record facts about dimensions

over time, date and time (henceforth simply called date) is always a dimension

table, and a date surrogate key is always one of the components of the primary

key of any fact

table. Because a

user may want to

aggregate facts on

many different

aspects of date or

different kinds of

dates, a date

309 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

dimension may have many nonkey attributes. Also, because some

characteristics of dates are country or event specific (e.g., whether the date is a

holiday or there is some standard event on a given day, such as a festival or

football game), modeling the date dimension can be more complex than

illustrated so far. 

Modeling Dates

The figure above shows a typical design for the date dimension. As we have

seen before, a date surrogate key appears as part of the primary key of the fact

table and is the primary key of the date dimension table. The nonkey attributes of

the date dimension table include all of the characteristics of dates that users use

to categorize, summarize, and group facts that do not vary by country or event.

VARIATIONS OF THE STAR SCHEMA

The simple star schema introduced earlier is adequate for many

applications. However, various extensions to this schema are often required to

cope with more complex modeling problems.

Multiple Fact Tables

310 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Multiple-fact, multiple-grain queries in relational data sources occur when

a table containing dimensional

data is joined to multiple fact

tables on different key columns.

It is often desirable for

performance or other reasons to

define more than one fact table

in a given star schema.For

example, suppose that various users require different levels of aggregation (in

other words, a different table grain). Performance can be improved by defining a

different fact table for each level of aggregation. The obvious trade-off is that

storage requirements may increase dramatically with each new fact table. More

commonly, multiple fact tables are needed to store facts for different

combinations of dimensions, possibly for different user groups.

Conformed Dimension 

In data warehousing, a conformed dimension is a dimension that has the same

meaning to every fact with which it relates. Conformed dimensions allow facts

and measures to be categorized and described in the same way across multiple

facts and/or data marts, ensuring consistent reporting across the enterprise.

One or more dimension tables associated with two or more fact tables for which

the dimension tables have the same business meaning and primary key with

311 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

each fact table. Conformed dimensions are dimensions that are shared by

multiple stars. They are used to compare the measures from each star schema.

Figure 9-13 illustrates a typical situation of multiple fact tables with two related

star schemas. In this example, there are two fact tables, one at the center of

each star: 

1. Sales—facts about the sale of a product to a customer in a store on a

date 

2. Receipts—facts about the receipt of a product from a vendor to a

warehouse on a date 

As is common, data about one or more business subjects (in this case, Product

and Date) need to be stored in dimension tables for each fact table, Sales and

Receipts. Two approaches have been adopted in this design to handle shared

dimension tables. In one case, because the description of the product is quite

different for sales and receipts, two separate product dimension tables have

been created. On the other hand, because users want the same descriptions of

dates, one date dimension table is used. In each case, we have created a

conformed dimension, meaning that the dimension means the same thing with

each fact table and, hence, uses the same surrogate primary keys. Even when

the two star schemas are stored in separate physical data marts, if dimensions

are conformed, there is a potential for asking questions across the data marts

(e.g., Do certain vendors recognize sales more quickly, and are they able to

312 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

supply replenishments with less lead time?). In general, conformed dimensions

allow users to do the following: 

 Share nonkey dimension data 

 Query across fact tables with consistency  

 Work on facts and business subjects for which all users have the same

meaning.

Factless Fact Table

A factless fact table is a fact table that does not have any measures.  It is

essentially an intersection of dimensions (it contains nothing but dimensional

keys). Factless facts are a simple collection of dimensional keys which define the

transactions or describing conditions for the time period of the fact.  There are

two types of factless tables:  One is for capturing an event, and one is for

describing conditions.

The most common example used for factless facts are student attendance in a

class. As you can see

from the dimensional

diagram below the

FACT_ATTENDANCE

is an amalgamation of

the DATE_KEY, the

STUDENT_KEY, and

the CLASS_KEY.

313 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

As you can see there is nothing we can measure about a student’s attendance at

a class. The student was there and the attendance was recorded or the student

was not there and no record is recorded. It is a fact, plain and simple. There is a

derivation of this fact where you can always load the full roster of individuals

registered for the class and add a flag stating the person was in attendance.

In conclusion, factless fact tables are important dimensional data structures used

to convey transactional information which contain no measures. These tables are

occasionally necessary for capturing important dimensional relationships which

are critical to meeting the defined business reporting requirements.

Normalizing Dimension Tables 

Fact tables are fully normalized because each fact depends on the whole

composite primary key and nothing but the composite key. However, dimension

tables may not be normalized. Most data warehouse experts find this acceptable

for a data mart optimized and simplified for a given user group, so that all the

dimension data are only one join away from associated facts. (Remember that

this can be done with logical data marts, so duplicate data do not need to be

stored.) Sometimes, as with any other relational database, the anomalies of a

denormalized dimension table cause add, update, and delete problems. In this

section, we address various situations in which it makes sense or is essential to

further normalize dimension tables.

Multivalued Dimensions

When the relationships between the dimension member and the fact are many to

many which means the dimension members are lower granularity than the facts.

314 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Fact table should contain a one-to-one relationship with the dimension. So, we

introduce the Bridge table when we need to related multiple dimensions values

with one record.

There are situations when your data needs to represent a many to many

relationships such that your dimension members are at a lower grain than related

facts; aka multivalued dimension.  In these cases, a single fact record should

relate to multiple dimension values.  Here are a few examples from the Kimball

Group.

 Patients can have multiple diagnoses.

 Students can have multiple majors.

 Consumers can have multiple hobbies or interests.

 Commercial customers can have multiple industry classifications.

 Employees can have multiple skills or certifications.

 Products can have multiple optional features.

315 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Bank accounts can have multiple customers.

Multivalued dimension

There may be a need for facts to be qualified by a set of values for the same

business subject. For example, consider the hospital example in Figure 9-15. In

this situation, a particular hospital charge and payment for a patient on a date

(e.g., for all foreign keys in the Finances fact table) is associated with one or

more diagnoses. (We indicate this with a dashed M:N relationship line between

the Diagnosis and Finances tables.) We could pick the most important diagnosis

as a component key for the Finances table, but that would mean we lose

316 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

potentially important information about other diagnoses associated with a row.

Or, we could design the Finances table with a fixed number of diagnosis keys,

more than we think is ever possible to associate with one row of the Finances

table, but this would create null components of the primary key for many rows,

which violates a property of relational databases.

The best approach (the normalization approach) is to create a table for an

associative entity between Diagnosis and Finances, in this case the Diagnosis

group table. (Thus, the dashed relationship in the Figure is not needed.) In the

data warehouse database world, such an associative entity table is called a

“helper table,” and we will see more examples of helper tables as we progress

through subsequent sections. A helper table may have nonkey attributes (as can

any table for an associative entity); for example, the weight factor in the

Diagnosis group table of Figure above indicates the relative role each diagnosis

plays in each group, presumably normalized to a total of 100 percent for all the

diagnoses in a group. Also note that it is not possible for more than one Finances

row to be associated with the same Diagnosis group key; thus, the Diagnosis

group key is really a surrogate for the composite primary key of the Finances fact

table.

Hierarchies 

Many times a dimension in a star schema forms a natural, fixed depth hierarchy.

For example, there are geographical hierarchies (e.g., markets within a state,

states within a region, and regions within a country) and product hierarchies

(packages or sizes within a product, products within bundles, and bundles within

317 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

product groups). When a dimension participates in a hierarchy, a database

designer has two basic choices:

1. Include all the information for each level of the hierarchy in a single

denormalized dimension table for the most detailed level of the hierarchy,

thus creating considerable redundancy and update anomalies. Although it

is simple, this is usually not the recommended approach. 

2. Normalize the dimension into a nested set of a fixed number of tables with

1:M relationships between them. Associate only the lowest level of the

hierarchy with the fact table. It will still be possible to aggregate the fact

data at any level of the hierarchy, but now the user will have to perform

nested joins along the hierarchy or be given a view of the hierarchy that is

prejoined.

3. Fixed product hierarchy

When the depth of the hierarchy can be fixed, each level of the hierarchy is a

separate dimensional entity. Some hierarchies can more easily use this scheme

than can others. Consider the product hierarchy in this Figure. Here each product

is part of a product family (e.g., Crest with Tartar Control is part of Crest), and a

product family is part of a product category (e.g., toothpaste), and a category is

318 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

part of a product group (e.g., health and beauty). This works well if every product

follows this same hierarchy. Such hierarchies are very common in data

warehouses and data marts.

Slowly Changing Dimensions

Slowly Changing Dimensions (SCD) - dimensions that change slowly over

time, rather than changing on regular schedule, time-base. In Data Warehouse

there is a need to track changes in dimension attributes in order to report

historical data. In other words, implementing one of the SCD types should enable

users to assign the proper dimension's attribute value for a given date. Examples

of such dimensions could be: customer, geography, employee.

There are many approaches to deal with SCD. The most popular are:

 Type 0 - The passive method

 Type 1 - Overwriting the old value

 Type 2 - Creating a new additional record

 Type 3 - Adding a new column

Type 0 - The passive method. In this method no special action is performed upon

dimensional changes. Some dimension data can remain the same as it was first

time inserted, others may be overwritten.

Type 1 - Overwriting the old value. In this method no history of dimension

changes is kept in the database. The old dimension value is simply overwritten

by the new one. This type is easy to maintain and is often used for data which

319 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

changes are caused by processing corrections(e.g. removal of special

characters, correcting spelling errors).

Before the change: 

After the change: 

Type 2 - Creating a new additional record. In this methodology all history of

dimension changes is kept in the database. You capture attribute change by

adding a new row with a new surrogate key to the dimension table. Both the prior

and new rows contain as attributes the natural key(or other durable identifier).

Also 'effective date' and 'current indicator' columns are used in this method.

There could be only one record with current indicator set to 'Y'. For 'effective

date' columns, i.e. start_date and end_date, the end_date for the current record

usually is set to value 9999-12-31. Introducing changes to the dimensional model

in type 2 could be very expensive database operation so it is not recommended

to use it in dimensions where a new attribute could be added in the future.

Before the change: 

320 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

After the change: 

Type 3 - Adding a new column. In this type usually only the current and previous

value of dimension is kept in the database. The new value is loaded into the

'current/new' column and the old one into the 'old/previous' column. Generally

speaking, history is limited to the number of columns created for storing historical

data. This is the least commonly needed technique.

Before the change: 

After the change: 

Ten Essential Rules of Dimensional Modeling

1. Use atomic facts: Eventually, users want detailed data, even if their initial

requests are for summarized facts. 

321 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

2. Create single-process fact tables: Each fact table should address the

important measurements for one business process, such as taking a

customer order or placing a material purchase order. 

3. Include a date dimension for every fact table: A fact should be

described by the characteristics of the associated day (or finer) date/time to

which that fact is related. 

4. Enforce consistent grain: Each measurement in a fact table must be

atomic for the same combination of keys (the same grain). 

5. Disallow null keys in fact tables: Facts apply to the combination of key

values, and helper tables may be needed to represent some M:N

relationships.

6. Honor hierarchies: Understand the hierarchies of dimensions and

carefully choose to snowflake the hierarchy or denormalize into one

dimension. 

7. Decode dimension tables: Store descriptions of surrogate keys and

codes used in fact tables in associated dimension tables, which can then be

used to report labels and query filters. 

8. Use surrogate keys: All dimension table rows should be identified by a

surrogate key, with descriptive columns showing the associated production

and source system keys. 

9. Conform dimensions: Conformed dimensions should be used across

multiple fact tables. 

322 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

10. Balance requirements with actual data: Unfortunately, source data may

not precisely support all business requirements, so you must balance what is

technically possible with what users want and need.

Big Data and Columnar Databases

Big Data

Big Data is an ill-defined term applied to databases whose size strains the ability

of commonly used relational DBMSs to capture, manage, and process the data

within a tolerable elapsed time.

Big Data basically refers to the data which is in large volume and has complex

data sets. This large amount of data can be structured, semi-structured, or non-

structured and cannot be processed by traditional data processing software and

databases. Various operations like analysis, manipulation, changes, etc are

performed on data and then it is used by companies for intelligent decision

making. Big data is a very powerful asset in today's world. Big data can also be

used to tackle business problems by providing intelligent decision making.

Big data is a combination of structured, semistructured and unstructured data

collected by organizations that can be mined for information and used in machine

learning projects, predictive modeling and other advanced analytics applications. 

Systems that process and store big data have become a common component

of data management architectures in organizations, combined with tools that

support big data analytics uses. Big data is often characterized by the three V's:

323 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 the large volume of data in many environments;

 the wide variety of data types frequently stored in big data systems; and

 the velocity at which much of the data is generated, collected and

processed.

Concept of 5V’s

Big data refers to data that is so large, fast or complex that it’s difficult or

impossible to process using traditional methods. The act of accessing and storing

large amounts of information for analytics has been around for a long time.

Volume

Volume, the first of the 5 V's of big data, refers to the amount of data that exists.

Volume is like the base of big data, as it is the initial size and amount of data that

is collected. If the volume of data is large enough, it can be considered big data.

What is considered to be big data is relative, though, and will change depending

on the available computing power that's on the market.

Velocity

The next of the 5 V's of big data is velocity. It refers to how quickly data is

generated and how quickly that data moves. This is an important aspect for

companies that need their data to flow quickly, so it's available at the right times

to make the best business decisions possible.

324 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

An organization that uses big data will have a large and continuous flow of data

that is being created and sent to its end destination. Data could flow from

sources such as machines, networks, smartphones or social media. This data

needs to be digested and analyzed quickly, and sometimes in near real time.

As an example, in healthcare, there are many medical devices made today to

monitor patients and collect data. From in-hospital medical equipment to

wearable devices, collected data needs to be sent to its destination and analyzed

quickly.

In some cases, however, it may be better to have a limited set of collected data

than to collect more data than an organization can handle -- since this can lead

to slower data velocities.

Variety

The next V in the five 5 V's of big data is variety. Variety refers to the diversity

of data types. An organization might obtain data from a number of different data

sources, which may vary in value. Data can come from sources in and outside an

enterprise as well. The challenge in variety concerns the standardization and

distribution of all data being collected.

Collected data can be unstructured, semi-structured or structured in nature.

Unstructured data is data that is unorganized and comes in different files or

formats. Typically, unstructured data is not a good fit for a mainstream relational

325 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

database because it doesn't fit into conventional data models. Semi-structured

data is data that has not been organized into a specialized repository but has

associated information, such as metadata. This makes it easier to process than

unstructured data. Structured data, meanwhile, is data that has been organized

into a formatted repository. This means the data is made more addressable for

effective data processing and analysis.

Veracity

Veracity is the fourth V in the 5 V's of big data. It refers to the quality and

accuracy of data. Gathered data could have missing pieces, may be inaccurate

or may not be able to provide real, valuable insight. Veracity, overall, refers to the

level of trust there is in the collected data.

Data can sometimes become messy and difficult to use. A large amount of data

can cause more confusion than insights if it's incomplete. For example,

concerning the medical field, if data about what drugs a patient is taking is

incomplete, then the patient's life may be endangered.

Both value and veracity help define the quality and insights gathered from data.

Value

The last V in the 5 V's of big data is value. This refers to the value that big data

can provide, and it relates directly to what organizations can do with that

collected data. Being able to pull value from big data is a requirement, as the

326 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

value of big data increases significantly depending on the insights that can be

gained from them.

Organizations can use the same big data tools to gather and analyze the data,

but how they derive value from that data should be unique to them.

Why Is Big Data Important?

The importance of big data doesn’t simply revolve around how much data you

have. The value lies in how you use it. By taking data from any source and

analyzing it, you can find answers that  1) streamline resource management, 2)

improve operational efficiencies, 3) optimize product development, 4) drive new

revenue and growth opportunities and 5) enable smart decision making. When

you combine big data with high-performance analytics, you can accomplish

business-related tasks such as:

 Determining root causes of failures, issues and defects in near-real time.

 Spotting anomalies faster and more accurately than the human eye.

 Improving patient outcomes by rapidly converting medical image data into

insights.

 Recalculating entire risk portfolios in minutes.

 Sharpening deep learning models' ability to accurately classify and react

to changing variables.

 Detecting fraudulent behavior before it affects your organization.

327 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Companies use big data in their systems to improve operations, provide better

customer service, create personalized marketing campaigns and take other

actions that, ultimately, can increase revenue and profits. Businesses that use it

effectively hold a potential competitive advantage over those that don't because

they're able to make faster and more informed business decisions.

For example, big data provides valuable insights into customers that companies

can use to refine their marketing, advertising and promotions in order to increase

customer engagement and conversion rates. Both historical and real-time data

can be analyzed to assess the evolving preferences of consumers or corporate

buyers, enabling businesses to become more responsive to customer wants and

needs.

Big data is also used by medical researchers to identify disease signs and risk

factors and by doctors to help diagnose illnesses and medical conditions in

patients. In addition, a combination of data from electronic health records, social

media sites, the web and other sources gives healthcare organizations and

government agencies up-to-date information on infectious disease threats or

outbreaks.

Here are some more examples of how big data is used by organizations:

 In the energy industry, big data helps oil and gas companies identify

potential drilling locations and monitor pipeline operations; likewise, utilities

use it to track electrical grids.

328 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Financial services firms use big data systems for risk management

and real-time analysis of market data.

 Manufacturers and transportation companies rely on big data to manage

their supply chains and optimize delivery routes.

 Other government uses include emergency response, crime prevention

and smart city initiatives.

Columnar Databases

A column-oriented DBMS or columnar DBMS is a database management

system (DBMS) that stores data tables by column rather than by row. Practical

use of a column store versus a row store differs little in the relational

DBMS world. Both columnar and row databases can use traditional database

query languages like SQL to load data and perform queries. Both row and

columnar databases can become the backbone in a system to serve data for

common extract, transform, load (ETL) and data visualization tools. However, by

storing data in columns rather than rows, the database can more precisely

access the data it needs to answer a query rather than scanning and discarding

unwanted data in rows.

A columnar database stores data by columns rather than by rows, which makes it

suitable for analytical query processing, and thus for data warehouses.

329 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A columnar database is optimized for fast retrieval of columns of data, typically in

analytical applications. Column-oriented storage for database tables is an

important factor in analytic query performance because it drastically reduces the

overall disk I/O requirements, and reduces the amount of data you need to load

from disk.

Like other NoSQL databases, column-oriented databases are designed to scale

“out” using distributed clusters of low-cost hardware to increase throughput,

making them ideal for data warehousing and Big Data processing.

A columnar database stores data of each column independently. This allows to

read data from disks only for those columns that are used in any given query.

The cost is that operations that affect whole rows become proportionally more

expensive. The synonym for a columnar database is a column-oriented database

management system. ClickHouse is a typical example of such a system.

Key columnar database advantages are:

 Queries that use only a few columns out of many.

 Aggregating queries against large volumes of data.

 Column-wise data compression.

330 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Columnar Database example

In a columnar database, all the values in a column are physically grouped

together. For example, all the values in column 1 are grouped together; then all

values in column 2 are grouped together; etc. The data is stored in record order,

so the 100th entry for column 1 and the 100th entry for column 2 belong to the

same input record. This enables individual data elements, such as customer

name to be accessed in columns as a group, rather than individually row-by-row.

Here is an example of a simple database table with four columns and three rows.

In a columnar DBMS, the data would be stored like this:

0411,0412,0413;Moriarty,Richards,Diamond;Angela,Jason,Samantha;52.35,325.

82,25.50.

In a row-oriented DBMS, the data would be stored like this:

0411,Moriarty,Angela, 52.35;412,

Richards,Jason,325.82;0413,Diamond,Samantha,25.50.

NoSQL

331 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Short for “Not only SQL,” NoSQL is a class of database technology used to store

and access textual and other unstructured data, using more flexible structures

than the rows and columns format of relational databases. The major purpose of

using a NoSQL database is for distributed data stores with humongous data

storage needs. NoSQL is used for Big data and real-time web apps. For

example, companies like Twitter, Facebook and Google collect terabytes of user

data every single day. Carl Strozz introduced the NoSQL concept in 1998.

NoSQL databases (aka "not only SQL") are non-tabular databases and store

data differently than relational tables. NoSQL databases come in a variety of

types based on their data model. The main types are document, key-value, wide-

column, and graph. They provide flexible schemas and scale easily with large

amounts of data and high user loads.

Why NoSQL?

The concept of NoSQL databases became popular with Internet giants like

Google, Facebook, Amazon, etc. who deal with huge volumes of data. The

system response time becomes slow when you use RDBMS for massive

volumes of data.

To resolve this problem, we could “scale up” our systems by upgrading our

existing hardware. This

process is expensive. 

The alternative for this issue

is to distribute database load

332 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

on multiple hosts whenever the load increases. This method is known as “scaling

out.” 

NoSQL databases are non-relational, so it scales out better than relational

databases as they are designed with web applications in mind.

Brief history of NoSQL databases

NoSQL databases emerged in the late 2000s as the cost of storage dramatically

decreased. Gone were the days of needing to create a complex, difficult-to-

manage data model in order to avoid data duplication. Developers (rather than

storage) were becoming the primary cost of software development, so NoSQL

databases optimized for developer productivity.

As storage costs rapidly decreased, the amount of data that applications needed

to store and query increased. This data came in all shapes and sizes

— structured, semi-structured, and polymorphic — and defining the schema in

advance became nearly impossible. NoSQL databases allow developers to store

huge amounts of unstructured data, giving them a lot of flexibility.

Additionally, the Agile Manifesto was rising in popularity, and software engineers

were rethinking the way they developed software. They were recognizing the

need to rapidly adapt to changing requirements. They needed the ability to iterate

333 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

quickly and make changes throughout their software stack — all the way down to

the database. NoSQL databases gave them this flexibility.

Cloud computing also rose in popularity, and developers began using public

clouds to host their applications and data. They wanted the ability to distribute

data across multiple servers and regions to make their applications resilient, to

scale out instead of scale up, and to intelligently geo-place their data. Some

NoSQL databases like MongoDB provide these capabilities.

NoSQL database features

Each NoSQL database has its own unique features. At a high level, many

NoSQL databases have the following features:

 Flexible schemas

 Horizontal scaling

 Fast queries due to the data model

 Ease of use for developers

The User-Interface

User Interface

 The means by which the user and a computer system interact, in

particular the use of input devices and software. 

 The purpose of a UI is to enable a user to effectively control a computer or

machine they are interacting with. 

334 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 A successful user interface should be intuitive (not require training to

operate), efficient (not create additional or unnecessary friction) and user-

friendly (be enjoyable to use).

A variety of tools are available to query and analyze data stored in data

warehouses and data marts. These tools may be classified as follows: 

 Traditional query and reporting tools 

 OLAP, MOLAP, and ROLAP tools NoSQL Short for “Not only SQL,” 

 Data visualization tools 

 Business performance management and dashboard tools 

 Data-mining tools 

Traditional query and reporting tools include spreadsheets, personal computer

databases, and report writers and generators.

Role of Metadata 

The first requirement for building a user-friendly interface is a set of metadata

that describes the data in the data mart in business terms that users can easily

understand. 

The metadata associated with data marts are often referred to as a “data

catalog,” “data directory,” or some similar term. Metadata serve as kind of a

“yellow pages” directory to the data in the data marts. The metadata should allow

users to easily answer questions such as the following: 

335 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

1. What subjects are described in the data mart? (Typical subjects are

customers, patients, students, products, courses, and so on.) 

2. What dimensions and facts are included in the data mart? What is the

grain of the fact table? 

3. How are the data in the data mart derived from the enterprise data

warehouse data? What rules are used in the derivation? 

4. How are the data in the enterprise data warehouse derived from

operational data? What rules are used in this derivation? 

5. What reports and predefined queries are available to view the data? 

6. What drill-down and other data analysis techniques are available? 

7. Who is responsible for the quality of data in the data marts, and to whom

are requests for changes made?

Online Analytical Processing (OLAP) Tools (OLAP)

A specialized class of tools has been developed to provide users with

multidimensional views of their data. Such tools also usually offer users a

graphical interface so that they can easily analyze their data. In the simplest

case, data is viewed as a simple three dimensional cube. 

Online analytical processing (OLAP) is the use of a set of query and reporting

tools that provides users with multidimensional views of their data and allows

them to analyze the data using simple windowing techniques. The term online

analytical processing is intended to contrast with the more traditional term online

transaction processing (OLTP). 

336 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Online Analytical Processing Server (OLAP) is based on the multidimensional

data model. It allows managers, and analysts to get an insight of the information

through fast, consistent, and interactive access to information.  OLAP is actually

a general term for several categories of data warehouse and data mart access

tools (Dyché, 2000). 

Relational OLAP (ROLAP) tools use variations of SQL and view the database

as a traditional relational database, in either a star schema or another normalized

or denormalized set of tables. ROLAP tools access the data warehouse or data

mart directly. 

Multidimensional OLAP (MOLAP) tools load data into an intermediate

structure, usually a three- or higher-dimensional array (hypercube). We illustrate

MOLAP in the next few sections because of its popularity. It is important to note

with MOLAP that the data are not simply viewed as a multidimensional

hypercube, but rather a MOLAP data mart is created by extracting data from the

data warehouse or data mart and then storing the data in a specialized separate

data store through which data can be viewed only through a multidimensional

structure. Other, less-common categories of OLAP tools are database OLAP

(DOLAP), which includes OLAP functionality in the DBMS query language (there

are proprietary, non-ANSI standard SQL systems that do this), and hybrid OLAP

(HOLAP), which allows access via both multidimensional cubes or relational

query languages.

337 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

OLAP Operations

 Cube slicing–

slicing the data

cube to produce a

simple two-

dimensional table or

view.

 Drill-down–

analyzing a given set of data at a finer level of detail.

Slicing a data cube

In the Figure, this slice is for the product named shoes. The resulting table shows

the three measures (units,

revenues, and cost) for this

product by period (or month).

Other views can easily be

developed by the user by means

of simple “drag and drop”

operations. This type of operation

is often called slicing and dicing

the cube.

338 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

An example of drill-down is shown in Figure 9-22. Figure 9-22a shows a

summary report for the total sales of three package sizes for a given brand of

paper towels: 2-pack, 3-pack, and 6-pack. However, the towels come in different

colors, and the analyst wants a further breakdown of sales by color within each of

these package sizes. Using an OLAP tool, this breakdown can be easily obtained

using a “point-and-click” approach with a mouse device.

The result of the drill-down is shown in Figure 9-22b. Notice that a drill-down

presentation is equivalent to adding another column to the original report. (In this

case, a column was added for the attribute color.) 

Data Mining 

Knowledge discovery, using a sophisticated blend of techniques from traditional

statistics, artificial intelligence, and computer graphics.

It is the process of finding patterns and correlations within large data sets to

identify relationships between data. Data mining tools allow a business

organization to predict customer behavior. Data mining tools are used to build

risk models and detect fraud. Data mining is used in market analysis and

management, fraud detection, corporate analysis and risk management.

The goals of data mining are threefold: 

1. Explanatory To explain some observed event or condition, such as why

sales of pickup trucks have increased in Colorado

339 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

2. Confirmatory To confirm a hypothesis, such as whether two-income

families are more likely to buy family medical coverage than single-income

families 

3. Exploratory To analyze data for new or unexpected relationships, such as

what spending patterns are likely to accompany credit card fraud

Business Performance Management

Business Performance Management (BPM) refers to the mechanisms

companies put in place to measure performance and communicate results

internally and externally. The goal of CPM is to use current and historical

performance data to improve future performance and decision making.

A business performance management (BPM) system allows managers to

measure, monitor, and manage key activities and processes to achieve

organizational goals. Dashboards are often used to provide an information

system in support of BPM.Dashboards, just as those in a car or airplane cockpit,

include a variety of displays to show different aspects of the organization. Often

the top dashboard, an executive dashboard,

is based on a balanced scorecard, in which

different measures show metrics from

different processes and disciplines, such as

operations efficiency, financial status,

customer service, sales, and human

340 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

resources. Each display of a dashboard will address different areas in different

ways.

For example, Figure 9-25 is a simple dashboard for one financial measure,

revenue. The left panel shows dials about revenue over the past three years,

with needles indicating where these measures fall within a desirable range. Other

panels show more details to help a manager find the source of out-of-tolerance

measures.

Data Visualization 

Data visualization is the representation of data in graphical and multimedia

formats for human analysis. Benefits of data visualization include the ability to

better observe trends and patterns and to identify correlations and clusters. Data

visualization is often used in conjunction with data mining and other analytical

techniques.

In essence, data visualization is a way to show multidimensional data not as

numbers and text but as graphs. Thus, precise values are often not shown, but

rather the intent is to more readily show relationships between the data.

341 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

342 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 9:

DATA WAREHOUSING- 

MODERN PRINCIPLES AND

METHODOLOGIES

Researched and presented by:

Donnabelle M. Durante
Shiela Mae E. Rosano

343 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

What Is a Decision Support System?


A decision support system (DSS) is a computerized program used to

support determinations, judgments, and courses of action in an organization or a

business. A DSS sifts (filter) through and analyzes massive amounts of data,

compiling comprehensive (complete) information that can be used to solve

problems in decision-making.

Typical information used by a DSS includes target or projected revenue,

sales figures or past ones from different time periods, and other inventory- or

operations-related data.

A decision support system gathers and analyzes data, synthesizing

(combining) it to produce comprehensive information reports. In this way, as an

informational application, a DSS differs from an ordinary operations application,

whose function is just to collect data. The DSS can either be completely

computerized or powered by humans. In some cases, it may combine both. The

ideal systems analyze information and actually make decisions for the user. At

the very least, they allow human users to make more informed decisions at a

quicker pace..

DSS Primary Purpose

The primary purpose of using a DSS is to present information to the

customer in an easy-to-understand way. A DSS system is beneficial because it

can be programmed to generate many types of reports, all based on user

specifications. For example, the DSS can generate information and output its

344 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

information graphically, as in a bar chart that represents projected revenue or as

a written report. As technology continues to advance, data analysis is no longer

limited to large, bulky mainframe computers. Since a DSS is essentially an

application, it can be loaded on most computer systems, whether on desktops or

laptops. Certain DSS applications are also available through mobile devices. The

flexibility of the DSS is extremely beneficial for users who travel frequently. This

gives them the opportunity to be well-informed at all times, providing the ability to

make the best decisions for their company and customers on the go or even on

the spot.

FIVE CATEGORIES OF DSS

1. Communication-driven

Its purpose are to help conduct a meeting or for users to

collaborate. The most common technology used to deploy the DSS is a

web or client server. 

Example:

 Chats and instant messaging soft wares such as messenger,

online collaboration and net meeting systems using Google Meet or Zoom

2. Data-driven 

Most data-driven DSSs are targeted at managers, staff and also

product/service suppliers. It is used to query a database or data

345 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

warehouse to seek specific answers for specific purposes. It is deployed

via a main frame system, client/server link, or via the web.

 Example: Computer-based databases that have a query system to check

in particular a GIS

 A geographic information system (GIS) is a computer system for

capturing, storing, checking, and displaying data related to

positions on Earth’s surface. GIS can use any information that

includes location, data about people such as population, income,

education level and information about landscape, different kinds

of soil and so much more.

GIS is not limited just for geologists who used to

study earthquake faults. Many retail businesses use GIS to help

them determine where to locate a new store. Marketing companies

use GIS to decide to whom to market stores and restaurants, and

where that marketing should be.  

3. Document-driven 

Document-driven DSS is a computerized support system that

integrates a variety of storage and processing technologies to provide

document retrieval and analysis. Such is intended to assist in decision

making. 

346 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Examples:  Policies and procedures, product specifications, catalogs and

corporate historical documents, including minutes of meetings, corporate

records, and important correspondence. 

4. Knowledge-driven 

Knowledge-driven DSSs are a catch-all (hold variety of things)

category covering a broad range of systems covering users within the

organization setting it up but may also include others interacting with the

organization. These systems contain specialized problem-solving

expertise wherein the “expertise” consists of knowledge about a particular

domain. 

Example: 

TaxAct is a system that supports online tax filing. It contains

information (tips) that can help one to improve his or her tax outcome and

financial wellness.

5. Model-driven

  In general, model-driven DSS use complex financial, simulation,

optimization or multi-criteria models to provide decision support. Model-

driven DSS use data and parameters provided by decision makers to aid

them in analyzing a particular situation. In other words, it is a system that

provides managers with models and analysis capabilities that can be used

during the process of making a decision.

347 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Example: Optimization Spreadsheet DSS, wherein the decision variables

are the quantities of TVs, stereos and speakers to build. The objective

function is to maximize total profits. The constraints are from the parts

inventory. Managers should be able to determine the best way to use the

resources. Managers need to determine what “best” means, but usually it

implies maximizing profits or minimizing costs. Optimization may be

incorporated in a DSS used routinely in a firm or a management scientist

may build an optimization model for a special decision support study.

The four (4) Modern Data Warehouse Architectures


348 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

What is modern data architecture?

Data architecture is the structure (how data is being organized) of your

data assets (anything valuable) developed with a vision of how those assets and

your information systems will inevitably interact with one another. This includes

planning how data in a system will be created, processed, stored, and

transmitted. 

Over time, data architecture has undergone several paradigm shifts

related to new technologies and business demands. Modern data architecture as

we know it has been significantly impacted by the concurrent evolution of big

data, machine learning or Artificial Intelligence, and cloud computing platforms. In

other words, modern data architecture is designed proactively with scalability

(system's ability to handle a growing amount of work) and flexibility in mind,

anticipating complex data needs.

Companies are increasingly moving towards cloud-based data

warehouses instead of traditional on-premise systems that involve the use of

physical servers (computers) located on-site and owned, managed and

maintained by your organization. Such was basically because cloud-based data

warehouses are quicker and cheaper to set up. There is no need to spend more

to purchase physical hardware, maintain and upgrade hardware, in addition to

running necessary systems such as power and cooling. And lastly, cloud-based

data warehouse architectures can typically perform complex analytical queries

much faster because they use massively parallel processing (MPP), a term that

349 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

means using a large number of computer processors to simultaneously perform a

set of coordinated computations in parallel.

FOUR (4) MODERN DATA WAREHOUSE ARCHITECTURES 

1. Multiple Parallel Processing (MPP) Architectures

MPP architecture enables a mighty (vast) scale and Distributed

Computing, a model in which components of a software system are

shared among multiple computers. MPP basically uses a "shared-

nothing". There are numerous physical nodes, each runs its instance or

each has its own task. Thus, making it a lot faster in terms of performance

compare to traditional architectures.

Example: Amazon Redshift

Amazon

Redshift

uses MPP

350 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

architecture, breaking up large data sets into chunks which are assigned to slices

within each node. Queries perform faster because the compute nodes process

queries in each slice simultaneously. The Leader Node aggregates the results

and returns them to the client application.

Client applications, such as analytics tools, can directly connect to

Redshift using open source PostgreSQL JDBC and ODBC drivers. Analysts can

thus perform their tasks directly on the Redshift data.

Amazon Redshift requires computing resources to be provisioned and set

up in the form of clusters, which contain a collection of one or more nodes. Each

node has its own CPU, storage, and RAM. A leader node compiles queries and

transfers them to compute nodes, which execute the queries.

On each node, data is stored in chunks, called slices. Redshift uses

a columnar storage, meaning each block of data contains values from a single

column across a number of rows, instead of a single row with values from

multiple columns.

2. Multi-Structured Data 

Interprets Big Data or data set whose size or type is beyond the

ability of traditional relational databases to capture, manage and process

the data with low latency and Analytics Infrastructure (concept that

comprises many technologies and services that support the essential

process of extracting the value of a given data) for multiple storage data

351 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

with a polyglot persistence strategy. A polyglot persistence database is

used when it is necessary to solve a complex problem by breaking that

problem into segments and applying different database models. 

Example: 

An e-commerce website which sells products online (Shopee,

Lazada) will use a NoSQL Store for storing the session state (record or

track the while browsing the app) of the users shopping on the website

while the payment system which captures the credit card information

persists it to a relational database like Oracle. In a similar fashion you can

implement different services to use different data stores and avoid building

a monolith (single massive) application where one database failure can

lead to the entire business going down. The need for polyglot data stores

is not just for high availability but also for scalability demands of an

internet-scale application.

3. Lambda Architecture

Lambda architecture

proposes a simpler,

elegant paradigm that is

designed to tame

complexity while being

able to store and

effectively process large

352 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

amounts of data. In the context of big data scenarios, Lambda architecture is a

frequently used form of architecture in IT system landscapes when it comes to

reconciling the requirements of two different user groups. On the one hand, there

are users who have always had to process and evaluate data of high quality.

These are usually enriched with additional, calculated key figures. The “classic”

users need the data for specific key dates in departments such as reporting,

accounting, risk or controlling. On the other hand, there are users with a short-

term need for information who have to react quickly to events. This can be the

defective ATM for the maintenance technician, but also the next boycott call for a

certain company in the social media for a stock trader.

Lambda architecture is used to solve the problem of computing

arbitrary functions. The lambda architecture itself is composed of 3 layers:

3.1. Batch Layer 

New data comes continuously, as a feed to the data system. It gets

fed to the batch layer and the speed layer simultaneously. It looks at all

the data at once and eventually corrects the data in the stream layer. 

Here we can find lots of ETL and a traditional data warehouse. This layer

is built using a predefined schedule, usually once or twice a day. The

batch layer has two very important functions:

 To manage the master dataset (data about the business

entities that provide context for business transactions)

 To pre-compute (initial computation) the batch views. 

353 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

3.2. Serving Layer

The outputs from the batch layer in the form of batch views and

those coming from the speed layer in the form of near real-time views

(users see data that is only a few seconds old) get forwarded to the

serving.  This layer indexes the batch views so that they can be queried in

low-latency on an ad-hoc basis.

3.3. Speed Layer 

This layer handles the data that are not already delivered in the

batch view due to the latency of the batch layer. In addition, it only deals

with recent data in order to provide a complete view of the data to the user

by creating real-time views.

The query application reads data from the text file where the batch

layer stored its results. It combines and then sorts the data. 

Example: 

354 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Lambda Architecture implementation that focuses on few common tools:

namely Hive, Spark and Kafka. The pre-system is an SAP Bank Analyzer 9 on a

HANA database.

The program (1) for loading the market data receives JSON files from the

ECB Statistical Data Warehouse via a REST call. These files are then parsed

(analyze) to extract and re-bundle the relevant data. Using the Kafka-Java-API a

Kafka-Producer is implemented, which writes the data formatted as JSON-string

into a Kafka-Topic 

 (2) Since only the latest version of the market data is needed, such a

topic is an easy-to-use key-value store. Of course, this step can also be done

directly in Spark and you can also skip the caching of the data in Kafka Topics.

However, the focus was to test as many interfaces as possible with a simple use

case. In addition, the traceability of older calculations is ensured in this way.

The main program for loading cash flows (3) was developed using the

Spark-Java-API. Two versions of the program were created for this purpose, one

for stream processing and a second for batch processing. Thanks to the

possibility to use Spark-Streaming for batch processing via the trigger setting

“One-Time-Micro-Batch”, the implementation and maintenance effort is limited.

Most of the code can be used for both cases. The processing mode is simply

selected as needed via a configuration file. Such a single processing brings all

known advantages of the Spark streaming library, such as the automatic

recovery of the query in case of an unintentional system shutdown or crash of

355 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

created checkpoints. In addition, however, all advantages of batch processing

are retained, such as the reduction of costs through targeted cluster startup and

shutdown.

Using the Spark-API, the HANA database (4) is accessed and the latest

record is retrieved. The recognition runs over a column with a continuous integer

of the datatype Long, which is generated from the timestamp of the data set. This

detour had to be taken because the SAP timestamp is not compatible with the

Spark timestamp in this case. The loading is then done in so-called microbatch

requests, which are sent to the HANA DB at certain time intervals and retrieve all

data since the last microbatch by querying the number just described. This

process is done and managed automatically by Spark. In the case of a

conventional Spark batch retrieval, all data from the last processed time stamp

would be retrieved, but would then have to be managed and stored by the user.

The Spark Streaming API does this automatically using the checkpoint files, as

explained above.

The latest market data is directly loaded from the aforementioned Kafka

Topic (5) via the Spark-Kafka implementation and is provided to the FTP library

for discounting cash flows. The library interpolates the grid points of the yield

curve to the due dates of the cash flow and discounts the cash flow accordingly.

The resulting dataframe is then checked with the help of a delta library for

changes of records already available in Hive and applies a filter if necessary. For

this purpose, the contents of the relevant fields are hashed and compared with

356 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

the values in the target table. If there is a match, the corresponding row is filtered

out of the dataframe. 

The result including the hash values is written to a partitioned hive table (6) by

Spark. The partitioning by month and year helps to keep the performance of

reading the data for the delta comparison as high as possible.

4. Hybrid Architecture

Utilize existing On-Premises data structures. Hybrid architecture

is a combination of having on-premises sources with cloud sources. For

most companies, it is certainly an essential component to have a hybrid

cloud for your cloud adoption. Therefore, selecting the right cloud source

benefits your company for a clever Hybrid Integration Platform strategy.

Thus, end goal would be business benefits.

Use Cloud services for Advanced Analytics. For instance, the

architecture of a hybrid cloud typically includes an Infrastructure-as-a-

Service (IaaS) platform. IaaS is one of the three main categories of cloud

computing services, alongside software as a service (SaaS) and platform

as a service (PaaS), that provides virtualized computing resources over

the internet.

The main Infrastructure-as-a-Service platforms are Amazon Web

Services (AWS), Microsoft Azure and Google Cloud platform. A private

cloud is one in which resources are. These can be stored on premises or

357 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

off premises. Lastly, a hybrid cloud management requires a wide area

network (a telecommunications network that extends over a large

geographic area for the primary purpose of computer networking) to

connect the public and private clouds.

The Main Infrastructure-as-a-Service Platform:

358 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The Data Staging and ETL

The data staging

area sits between the data

source(s) and the data

target(s), which are

often data warehouses, data

marts, or other data repositories. In other words, it is temporary storage area

between the data sources and a data warehouse. 

359 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Data staging

areas are often

transient (temporary)

in nature, with their

contents being erased

prior to running an

ETL process or

immediately following

successful completion of an ETL process.

ETL is a process of data integration that encompasses three steps —

extraction, transformation, and loading. In a nutshell, an ETL system take large

volumes of raw data from multiple sources, converts it for analysis, and loads

that data into your warehouse. 

THE ETL PROCESS

Extraction

In the first step, extracted data sets come from a source, say for example

from SQL server into a staging area. The staging area acts as a buffer between

the data warehouse and the source data. Since data may be coming from

multiple different sources, it is likely in various formats and directly transferring

the data to the warehouse may result in corrupted data. The staging area is used

for data cleansing and organization.

360 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Transformation

The data cleaning and organization stage is the transformation stage. All of

that data from multiple source systems will be normalized and converted to a

single system format — improving data quality and compliance. ETL yields

transformed data through different methods such as cleaning, filtering, joining,

sorting, splitting, deduplication and summarization.

Loading

Finally, data that has been extracted to a staging area and transformed is

loaded into your data warehouse. Depending upon your business needs, data

can be loaded in batches or all at once. The exact nature of the loading will

depend upon the data source, ETL tools, and various other factors.

Multidimensional Model

A multidimensional model views data in the form of a data-cube. A data

cube enables data to be modeled and viewed in multiple dimensions. It is defined

by dimensions and facts. The dimensions are the perspectives or entities

concerning which an organization keeps records. 

For example, a shop may create a sales data warehouse to keep records

of its sales for the dimension time, item, and location. These dimensions allow

the save to keep track of things such as monthly sales of items and the locations

361 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

at which the items were sold. Each dimension has a table related to it, called a

dimensional table.

Consider the data of a shop for items sold per quarter in the city of Delhi.

The data is shown in the table. In this 2D representation, the sales for Delhi are

shown for the time dimension (organized in quarters) and the item dimension

(classified according to the types of an item sold). The fact or measure displayed

in rupee_sold (in thousands).

362 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Now, if we want to view the sales data with a third dimension, For

example, suppose the data according to time and item, as well as the location is

considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data

are shown in the table. The 3D data of the table are represented as a series of

2D tables.

363 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Conceptually, it may also be represented by the same data in the form of

a 3D data cube, as shown in fig:

Benefits of Using Multidimensional Solutions

The primary reason for building an Analysis Services multidimensional

model is to achieve fast query performance against business data. A

multidimensional model is composed of cubes and dimensions that can be

annotated and extended to support complex query constructions. BI developers

create cubes to support fast response times, and to provide a single data source

for business reporting. Given the growing importance of business intelligence

across all levels of an organization, having a single source of analytical data

ensures that discrepancies are kept to a minimum, if not eliminated entirely.

364 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Another important benefit to using Analysis Services multidimensional

databases is integration with commonly used BI reporting tools such as Excel,

Reporting Services, and PerformancePoint, as well as custom applications and

third-party solutions.

META-DATA

Metadata can be explained in a few ways:

 Data that provide information about other data.

 Metadata summarizes basic information about data, making finding &

working with particular instances of data easier.

 Metadata can be created manually to be more accurate, or automatically

and contain more basic information.

In short, metadata is important. I like to answer this "what is metadata"

question as such: metadata is a shorthand representation of the data to which

they refer. If we use analogies, we can think of metadata as references to data.

Think about the last time you searched Google. That search started with the

metadata you had in your mind about something you wanted to find. You may

have begun with a word, phrase, meme, place name, slang or something else.

The possibilities for describing things seem endless. Certainly metadata schema

can be simple or complex, but they all have some things in common.

EXAMPLE

365 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A simple example of metadata for a document might include a collection of

information like the author, file size, the date the document was created, and

keywords to describe the document. Metadata for a music file might include the

artist's name, the album, and the year it was released.

4 Stages of Data Warehouses

Stage 1: Offline Database

In their most early stages, many companies have Data Bases. The data is

forwarded from the day-to-day operational systems to an external server for

storage. Unless extrapolated and manually analyzed, this data sits where it is

and does not impact ongoing business functions. Transactions such as loading

or processing of data have no effects on an operational standpoint.

Offline Database, lets users search for numbers even without being

connected to the Internet. - 

https://glosbe.com 

This is the initial stage where data is simply copied to a server from an operating

system.

 Offline Operational Database: This is the initial stage where data is

simply copied to a server from an operating system. It is done so that

data loading, processing, and reporting do not affect the performance

of the operational system.

Stage 2: Offline Data Warehouse

366 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

While not entirely up-to-date, offline Data Warehouses regularly update their

content from existing operational systems. By emphasizing reporting-oriented

data structures, the organized data meets the particular objectives of the Data

Warehouse.

 Offline Data Warehouse: In this stage, all the data warehouses are

updated on a regular time cycle from the operational database to get

actionable business insights.

Stage 3: Real-time Data Warehouse

Real-Time Data Warehouses gathers information through operational system

events-based triggers. Often, these come in the form of transactions such as

airline bookings or bank balances.

 Real-time Data Warehouse: In this stage, data warehouses are updated

based on transaction or event basis. Whenever a transaction takes place

in an operational database, it is updated in the data warehouse.

Stage 4: Integrated Data Warehouse

Daily activities to be passed back to the operating system continuously in the

Integrated Data Warehouse. Integrated Data Warehouses are the ideal Data

Warehouse stage with the data not just readily available but also updated and

accurate.

367 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Integrated Data Warehouse: This is the final stage where all the

transactions which are used daily by the organization are passed back into

the operational system. Each transaction that takes place in the

operational database is updated in the warehouse simultaneously.

Accessing Data Warehouses

Storage is a fairly simple choice. You can host your data warehouse on-

premises, in the cloud, or use a hybrid approach. On-premises hosting

is, according to some, on its way out. Cloud hosting is much cheaper and more

flexible because you’re renting space on another’s server. You don’t need to run

maintenance, you can expand and cut back as needed, and there is an ever-

expanding set of features added each year. Bridging the gap between these two

approaches is hybrid hosting, which, as we mentioned before, is the preferred

choice for companies migrating from on-premises to cloud hosting.

To get data into your data warehouse, you need to use a type

of software commonly called ETL software. Extract, transform, load (ETL) is a

process where the data is extracted, made ready for use, then loaded into the

data warehouse.

Of course, data warehouses don’t run themselves. Labor is a significant part of

keeping a data warehouse running because it’s not just a system; it’s a “full-

fledged…architecture” that requires experts to set up and manage.

What is OLAP?

368 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

OLAP (Online Analytical Processing) was introduced into the business

intelligence (BI) space over 20 years ago, in a time where computer hardware

and software technology weren’t nearly as powerful as they are today. OLAP

introduced a groundbreaking way for business users (typically analysts) to easily

perform multidimensional analysis of large volumes of business data.

Aggregating, grouping, and joining data are the most difficult types of

queries for a relational database to process. The magic behind OLAP derives

from its ability to pre-calculate and pre-aggregate data. Otherwise, end users

would be spending most of their time waiting for query results to be returned by

the database.

Vendors offer a variety of OLAP products that can be grouped into three

categories: multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and

hybrid OLAP (HOLAP). Here is a breakdown of the differences between them. 

What is ROLAP?

ROLAP stands for Relational Online Analytical Processing. ROLAP stores

data in columns and rows (also known as relational tables) and retrieves the

information on demand through user submitted queries. A ROLAP database can

be accessed through complex SQL queries to calculate information. ROLAP can

handle large data volumes, but the larger the data, the slower the processing

times. 

What is MOLAP?

MOLAP stands for Multidimensional Online Analytical Processing. MOLAP

uses a multidimensional cube that accesses stored data through various

369 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

combinations. Data is pre-computed, pre-summarized, and stored (a difference

from ROLAP, where queries are served on-demand).

 Its speedy data retrieval makes it the best for “slicing and dicing” operations.

One major disadvantage of MOLAP is that it is less scalable than ROLAP, as it

can handle a limited amount of data.

What is HOLAP?

HOLAP stands for Hybrid Online Analytical Processing. As the name

suggests, the HOLAP storage mode connects attributes of both MOLAP and

ROLAP. Since HOLAP involves storing part of your data in a ROLAP store and

another part in a MOLAP store, developers get the benefits of both. 

With this use of the two OLAPs, the data is stored in both

multidimensional databases and relational databases. The decision to access

one of the databases depends on which is most appropriate for the requested

processing application or type. This setup allows much more flexibility for

handling data. For theoretical processing, the data is stored in a multidimensional

database. For heavy processing, the data is stored in a relational database. 

Problems of Data Warehousing

The problems associated with developing and managing a data warehousing are

as follows:

Underestimation of resources of data loading

Sometimes we underestimate the time required to extract, clean, and load

the data into the warehouse. It may take the significant proportion of the total

370 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

development time, although some tools are there which are used to reduce the

time and effort spent on this process.

Required data not captured

In some cases the required data is not captured by the source systems

which may be very important for the data warehouse purpose. For example the

date of registration for the property may be not used in source system but it may

be very important analysis purpose.

High maintenance

Data warehouses are high maintenance systems. Any reorganization· of

the business processes and the source systems may affect the data warehouse

and it results high maintenance cost.

Data ownership

Data warehousing may change the attitude of end-users to the ownership of

data. Sensitive data that owned by one department has to be loaded in data

warehouse for decision making purpose. But some time it results in to reluctance

of that department because it may hesitate to share it with others.

371 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 10:

DATA QUALITY AND INTEGRATION

Researched and presented by:

Gabotero, Stephanie S.
Tiolo, Michelle Anne M.

372 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

What Is a Data Governance?

Data governance is a set of processes and procedures aimed at

managing the data within an organization with an eye toward high-level

objectives such as availability, integrity and compliance with regulations.

Data governance oversees data access policies by measuring risk and

security exposures (Leon, 2007). Data governance is a function that has to be

jointly owned by IT and the business. Successful data governance will require

support from upper management in the firm. A key role in enabling success of

data governance in an organization is that of a data steward.

Data steward 

A person assigned the responsibility of ensuring that organizational

applications properly support the organization’s enterprise goals for data quality.

1. have a strong interest in managing information as a

corporate resource

2. an in-depth understanding of the business of the

organization, and 

3. good negotiation skills.

The Sarbanes-Oxley Act of 2002

 The Sarbanes-Oxley Act of 2002 has made it imperative that

organizations undertake actions to ensure data accuracy, timeliness, and

consistency (Laurent, 2005).

373 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 The Sarbanes-Oxley Act of 2002 is a federal law that established

sweeping auditing and financial regulations for public companies.

 Lawmakers created the legislation to help protect shareholders,

employees and the public from accounting errors and fraudulent financial

practices. Auditors, accountants and corporate officers became

accountable for the new set of rules.

Establishment of a business information advisory committee consisting of

representatives from each major business unit who have the authority to make

business policy decisions can contribute to the establishment of high data quality

(Carlson, 2002; Moriarty, 1996). 

A data governance program needs to include the following:

1. Sponsorship from both senior management and business units

2. A data steward manager to support, train, and coordinate the data

stewards.

3. Data stewards for different business units, data subjects, source systems,

or combinations of these elements.

4. A governance committee, headed by one person, but composed of data

steward managers, executives and senior vice presidents, IT leadership

(e.g., data administrators), and other business leaders, to set strategic

goals, coordinate activities, and provide guidelines and standards for all

enterprise data management activities.

374 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The goals of data governance are:

1. Transparency

2. Increasing the value of data maintained by the organization

Managing data quality

 The importance of high-quality data cannot be overstated.

 The data that serves as the foundation of these systems must be good

data, and if the data are bad—the systems fail.

 High-quality data—that is, data that are accurate, consistent, and available

in a timely fashion—are essential to the management of organizations

today.

 Leading provider of technology for data quality and integration, data

quality is important to:

o Minimize IT project risk

 Dirty data can cause delays and extra work on

information systems projects, especially those that

involve reusing data from existing systems.

 Make timely business decisions 

o The ability to make quick and informed business

decisions is compromised when managers do not

have high-quality data or when they lack confidence

in their data.

375 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Ensure regulatory compliance

o Not only is quality data essential for SOX and Basel II (Europe)

compliance, quality data can also help an organization in justice,

intelligence, and antifraud activities.

 Expand the customer base 

o Being able to accurately spell a customer’s name or to accurately

know all aspects of customer activity with your organization will help

in up-selling and cross-selling new business.

Redman (2004) summarizes data quality as “fit for their intended uses in

operations, decision making, and planning.” In other words, this means that data

are free of defects and possess desirable features (relevant, comprehensive,

proper level of detail, easy to read, and easy to interpret). 

Characteristics of Quality Data (Loshin and Russom, 2006):

1. Uniqueness

- Uniqueness means that each entity exists no more than once

within the database, and there is a key that can be used to uniquely

access each entity.

2. Accuracy

- Accuracy has to do with the degree to which any datum correctly

represents the real-life object it models.

3. Consistency

376 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

- Consistency means that values for data in one data set

(database) are in agreement with the values for related data in another

data set (database).

4. Completeness 

- Completeness refers to data having assigned values if they need

to have values.

5. Timeliness

- Timeliness means meeting the expectation for the time between

when data are expected and when they are readily available for use

6. Currency

- Currency is the degree to which data are recent enough to be

useful.

7. Conformance

 Conformance refers to whether data are stored, exchanged, or

presented in a format that is as specified by their metadata.

8. Referential integrity

 Data that refer to other data need to be unique and satisfy

requirements to exist

377 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

External Data Sources

 Much of an

organization’s data

originates outside

the organization,

where there is less

control over the

data sources to

comply with

expectations of the

receiving

organization.

Redundant data storage and inconsistent metadata

 Many organizations have allowed the uncontrolled proliferation of

spreadsheets, desktop databases, legacy databases, data marts, data

warehouses, and other repositories of data.

Data Entry Problems

 User interfaces that do not take advantage of integrity controls—such as

automatically filling in data, providing drop-down selection boxes, and

other improvements in data entry control— are tied for the number-one

cause of poor data.

378 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Lack of Organizational Commitment

 For a variety of reasons, many organizations simply have not made the

commitment or invested the resources to improve their data quality.

Data Quality Improvement

Implementing a successful quality improvement program will require the

active commitment and participation of all members of an organization.

Get the Business Buy-In

 Data quality initiatives need to be viewed as business imperatives rather

than as an IT project.

Conduct a Data Quality Audit

 An organization without an established data quality program should begin

with an audit of data to understand the extent and nature of data quality

problems.

Establish a Data Stewardship Program

 As pointed out in the section on data governance, stewards are held

accountable for the quality of the data for which they are responsible.

Improve Data Capture Processes

 As noted earlier, lax data entry is a major source of poor data quality, so

improving data capture processes is a fundamental step in a data quality

improvement program.

For simplicity, we summarize what Inmon recommends only for the original data

capture step:

379 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

i. Enter as much of the data as possible via automatic, not human.

ii. Where data must be entered manually, ensure that it is selected from

preset options.

iii. Use trained operators when possible.

iv. Follow good user interface design principles that create consistent

screen layouts, easy to follow navigation paths, clear data entry masks

and formats minimal use of obscure Codes and so on.

v. Immediately check entered data for quality against data in the

database, so use triggers and user-defined procedures liberally to

make sure that only high-quality data enter the database; when

questionable data are entered, immediate and understandable

feedback should be given to the operator, questioning the validity of

the data.

Apply Modern Data Management Principles and Technology

 Powerful software is now available that can assist users with the technical

aspects of data quality improvement.

Apply TQM Principles and Practices

 Data quality improvements should be considered as an ongoing effort and

not treated as one-time projects

Summary of Data Quality

Ensuring the quality of data that enters databases and data warehouses is

essential if users are to have confidence in their systems.

380 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Master Data Management 

If one were to examine the data used in applications across a large

organization, one would likely find that certain categories of data are referenced

more frequently than others across the enterprise in operational and analytical

system.

 Master data management (MDM) 

o refers to the disciplines, technologies, and methods to

ensure the currency, meaning, and quality of reference data

within and across various subject areas (Imhoff and White,

2006).

o Master data can be as simple as a list of acceptable city

names and abbreviations.

o Master data can be as simple as a list of acceptable city

names and abbreviations.

o MDM can also be realized in specialized forms.

3 popular architectures

1. Identity Registry Approach

o the master data remain in their source systems, and applications

refer to the registry to determine where the agreed-upon source of

particular data

2. Integration Hub Approach

381 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

o data changes are broadcast (typically asynchronously) through a

central service to all subscribing databases.

3. Persistent Approach

o one consolidated record is maintained, and all applications draw on

that one “golden record” for the common data.

DATA INTEGRATION: OVERVIEW

Data Integration

 It is the process of combining data from different sources into a single,

unified view.

 In a typical data integration process, the client sends a request to the

master server for data. The master server then intakes the needed data

from internal and external sources. The data is extracted from the

sources, then consolidated into a single, cohesive data set. This is served

back to the client for use.

 The end location needs to be flexible enough to handle lots of different

kinds of data at potentially large volumes.

Other ways to consolidate data are as follows (White, 2000):

Application Integration

 It creates connectors between two or more applications so they can work

with one another.

382 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Each individual application has a particular way it emits and accepts data,

and this data moves in smaller volumes.

 You only need to enter your data into a system once, then your

information will flow automatically directly in all your other connected

systems and they will take an action automatically.

Business Process Integration

 Is a crucial technique for supporting inter-organizational business

interoperability.

 Achieved by tighter coordination of activities across business processes

(e.g., selling and billing) so that applications can be shared and more

application integration can occur.

 By the help of BPI, it enables companies to digitally connect, communicate

and collaborate with customers, suppliers, partners, service vendors, and

all other players in the supply chain.

User interaction integration

 Achieved by creating fewer user interfaces that feed different data

systems.

Three techniques form the building blocks of any data integration

approach:

1. Data Consolidation

383 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 It is the classic data integration process leveraging ETL technology,

the two terms are sometimes used interchangeably.

 It involves combining data from disparate sources, removing its

redundancies, cleaning up any errors, and aggregating it within a

single data store like a

data warehouse.

 The main idea of data

consolidation is to

provide end users with

all critical data in one

place for the most

detailed reporting and analysis possible

2. Data Federation

 It is a software process that allows multiple databases to function

as one.

 This provides a single source of data for front-end applications

without actually

bringing the data

all into one

physical,

centralized

database.

384 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 It vastly simplifies querying and analyzing information, and it

eliminates the need for users to directly access source systems,

which reduces the challenges involved with administering security

access to multiple systems.

 A main advantage of the federation approach is access to current

data:

3. Data Propagation

 It is the use of the

application to

replicate the data

from one

location(source) to

another

location(destination).

 It is supported by Enterprise Application Integration (EAI) and

Enterprise Data Replication (EDR).

 This is commonly used for real-time business transactions. EDR

sends massive amounts of data between the databases instead of

applications using base triggers and logs.

 The major advantage of the data propagation approach to data

integration is the near-real-time cascading of data changes

throughout the organization.

385 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Characteristics of Data After ETL:

Reconciled data Operational Data

 Detailed

 Historical  Transient

 Normalized  Not normalized

 Comprehensive  Generally restricted in scope to a particular

application.
 Timely
 Often of poor quality
 Quality

controlled

Data Reconciliation Process


 is responsible for transforming operational data to reconciled data.

 It helps you for extracting accurate and reliable information about the state

of industry process from raw measurement data.

 It also helps you to produces a single consistent set of data representing

the most likely process operation.

Data reconciliation occurs in two stages during the process of filling an

enterprise data warehouse:

1. During an initial load, when the EDW is first created.

2. During subsequent updates to keep the EDW current and/or to expand it.

386 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Data Reconciliation Process

1. Mapping and Metadata Management

 This mapping could be shown graphically or in a simple matrix

with rows as source data elements, columns as data warehouse

table columns, and the cells as explanations of any reformatting,

transformations, and cleansing actions to be done. 

2. Extract

 is the act or process of retrieving data out of data sources for

further data processing or data storage.

The two generic types of data extracts are:

 Static Extract

387 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 all the data currently available in the source system is

extracted.

 Incremental Extract

 the data which have changed from the time when the

last data extraction had taken place is extracted.

3. Cleanse

 involves detecting such errors and repairing them and

preventing them from occurring in the future.

 uses pattern recognition and AI techniques to upgrade data

quality.

 Fixing errors like misspellings, erroneous dates, incorrect field

usage, mismatched addresses, missing and duplicate data.

 Also: decoding, reformatting, time stamping, conversion, key

generation, merging, error detection/logging, locating missing

data.

4. Load And Index

 is to load the selected data into the target data warehouse and

to create the necessary indexes.

The two basic modes for loading data to the target EDW:

 Refresh mode

 Bulk rewriting of target data at periodic intervals.

388 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Update mode

 only changes in source data are written to data

warehouse.

Data transformation

 is at the very center of the data reconciliation process.

 involves converting data from the format of the source operational

systems to the format of the enterprise data warehouse.

 the goal of data transformation is to convert the data format from the

source to the target system.

Data transformation functions

1. Record-level functions

 the most important record-level functions are selection, joining,

normalization, and aggregation.

 Selection

 The process of partitioning data according to predefined criteria.

 Joining

 The process of combining data from various sources into a single

table or view.

 Normalization

 Is the e process of decomposing relations with anomalies to

produce smaller, well-structured relations.

389 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Aggregation

 is the process of transforming data from a detailed level to a

summary level.

2. Field-level function

 converts data from a given format in a source record to a

different format in the target record.

Two types of field-level function

 Single-field transformation - converts data from a single source

field to a single target field.

a. Basic field transformation or In-general some transformation –

translate data

from old form to

new form.

b. Algorithmic transformation – it uses a formula or logical expression in

transforming a

data.

390 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

c. Table lookup – it uses a separate table keyed by source record code.

 Multi-field transformation 

- converts data from one or more source fields to one or

more target fields.

- is very common in data warehouse applications.

- may involve more than one source record and/or more than

one target record.

Two types of multi-field transformation:

a. Many sources to one target

391 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

b. One source to many targets

392 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

CHAPTER 11:

DATA AND DATABASE


ADMINISTRATION

Researched and presented by:

Garcia, Janah G.
Notario, Don
Trias, Angela B.

393 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The roles of data and the database administrators 

A Data administrator is the one who are responsible for processing data which is

relevant to be store in the database.  A data administrator likely more a business

role with some technical role which also called as Data Analyst, that why this is

more likely a high level function which is responsible for the overall management

of data resources in an organizations, including in maintaining corporate wide

definition and standard. The head of data administration is a senior- level person

who required to have a high level of both managerial and technical skill. Data

administrator is a person who focused in business but should also understand

database technology.

Responsibilities:

 Filters out relevant data

 Monitor the data flow throughout the organization

 Designs concept-based data model

 Analyze and break down the data to be understood by the non-tech

person

Database Administrator is a person who has a knowledge in data base

technology, controls the organization design and use the database. It provides

necessary technical support for implementing the database such as design,

development, testing, and operational phase. It is not need to be a business

person but a person that can understand the business to administrate the

database effectively.

394 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Responsibilities:

1. Deciding the hardware device- the DBA is responsible to decide which

hardware is suitable to use in the company. To know how much it will cost,

the performance and efficiency of that hardware.

2. Managing Data Integrity- the DBA need to protect the data from

unauthorized use.

3. Decides Data Recovery and Back up method- the DBA need to back up

the entire database in case of breach. DBA need also to recover the data

in case of loss

4. Tuning Database Performance- a way of upgrading the performance of

the database to make it faster and more convenient to be used to all

authorized users.

5. Capacity Issues- DBA need to know the maximum limit of storing data.

6. Database design- DBA is responsible for physical design, external model

design, and integrity control.

7. Database accessibility- the DBA writes subschema to secure the database

accessibility. Only authorized users can access the entire data.

8. Decides validation checks on data-DBA need to validate and check the

data to make it accurate and consistence.

9. Monitoring performance- DBA monitors the CPE and memory usage to

make sure that it works well.

395 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

10. Decides content of the database- DBA decide what will be the structure of

the database files.

11. Provides help and support to user- DBA is responsible also to help the

user who didn’t know well how to operate the system.

12. Database implementation- the DBA implement the database system

before anyone can use it.

13. Improve query processing performance- question made by the users need

to perform speedily that why the DBA improves query processing by

improving the performance.

The open-source movement and database management 

Open source movement is a term that referred to open source software. The

open source movement is a code that people can modify and share because it’s

design to be publically accessible to anyone. The Source code is a part of

software that a computer users can’t see. A programmers who can access the

computer program’s source code can improve the program, by adding either

feature or fixing parts. The example of open source software is LibreOffice and

the GNU Image Manipulation Program.

LibreOffice is a free open source, it contain application like word processing,

spreadsheet, presentation, database management and graphic editing. This is

compatible with other office productivity like Microsoft Office. It runs on Microsoft

Windows, macOS, and Linux.

396 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

It is often cheaper but more flexible and has more longevity because it developed

by a community and not a single author or any company.

What are the value of open source? 

The most common reason why people choose the open source

1. Peer review- the open source code is free and accessible to all, that’s why

it is actively checked and improves by peer programmers.

2. Transparency- open source help to track and check if there’s any changes

in the data code.

3. Reliability- the open source code constantly updated through active open

source community.

4. Flexibility- it help to solve problem to your business because of the help of

open source community and peer programmers.

5. Lower Cost- free and accessible

6. No vendor locking- because it is free to be used and you can take your

open source code anytime and anywhere

7. Open collaboration- the active source communities can help you to get

new solution to a problem

In database management system is a software package that generally

manipulate the data itself, data format, field name, record and file structure, 

Components of a DBMS

397 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

1. Storage Engine-  it is used to store data, it can used additional component

to store data 

2. Metadata catalog- sometimes called a system catalog or database

dictionary, the DBMS uses this to verify the user who request for the data.

The metadata catalog can include information about database objects,

schemas, programs, security, performance, communication and other

environmental details about the databases it manages.

3. Database access language- the DBMS most provide API to access the

data.

*API (Application Programming Interface) is a software that allow

the two application to communicate (middleman) for example when

signing into your Facebook account using your phone, the mobile

application tells the API to retrieve your Facebook account. The

Facebook will access your data information to your mobile

application

4. Lock manager- locks are required to make sure that multiple user can’t

access and change the same data simultaneously

4. Log Manager- record all data changes to makes sure that the records are

accurate and efficient. The DBMS uses the log manager during shutdown

and startup to ensure data integrity

4. Data Utilities-  include reorganization, run stats, backup and copy, recover,

integrity check, load data, unload data and repair database

Benefits of using a DBMS

398 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Central storage and management of data within the DBMS provides the

following:

 data abstraction and independence;

 data security;

 a locking mechanism for concurrent access;

 an efficient handler to balance the needs of multiple applications using the

same data;

 the ability to swiftly recover from crashes and errors;

 strong data integrity capabilities;

 logging and auditing of activity;

 simple access using a standard API; and

 uniform administration procedures for data.

Example of this is commercial airlines, they rely on a DBMS for data-intensive

applications such as scheduling flight plans and managing customer flight

reservations.

Managing data security 

Data Security are sometimes called computer security, system security or

information security. The data security is a measure that need to be taken to

prevent any unauthorized access to the information in computer and database or

on the web. And also it prevent corruption or modification of that information

Data Security Protecting Against

399 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Security Hackers:  people who intended to steal, protest, or gather

information in the computer system.

 Malware: shortened name for “ malicious software”, it used to have an

access to those files even it can be only open by an authorized user, it can

also cause damage in a computer or computer system

 Computer Viruses: is a form of malware that uses written codes so the

virus can spread from one computer and computer system to another, this

can damage the computer and the data stored on it.

The 2017 WannaCry Ransomware Attack Was One Of The Most Widespread

Computer Infections Ever, And WannaCry Attacks Continue Today.

 The WannaCry ransomware epidemic of 2017 disrupted hospitals, banks

and communications companies worldwide.

 Four years later, cybercriminals renewed efforts to deploy WannaCry

ransomware during the COVID-19 pandemic.

 Companies can take steps to prevent infection, with software updates

being most important.

WannaCry ransomware is an example of crypto ransomware, a type of malicious

software used by cybercriminals to extort money. WannaCry takes your data a

hostage.  It does either they will lock you out of your computer so you can’t use it

(called lock ransomware) or encrypting your valuable files so you can’t able to

open or read them (called crypto ransomware). WannaCry target computer’s

operating system that uses Microsoft Windows 

400 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Data Security Management is used to ensure the data’s organization is not

accessed or corrupted by someone who are unauthorized users. The Data

Security Management Plan includes Planning, Implementation of the Plan and

verification and updating of the plans components

Here are some basics of data security that are often included in any data security

management plan:

1. Backups- ensure that you have another copy of all the file to easily

recovery in case that there might be happen like breach, computer viruses

or damage in the computer.

2. Data masking- which some sensitive data or information is obscured

3. Data Erasure- a method when all the data in the computer is wiped clean

or overwritten when the equipment is sold or discarded

4. Encrypted-  the process which the data is scrambled and encoded, only

the another entity can decode the data using encryption key

5. Authentication- using username and password of every user to identify

who access the computer system

6. One time password-  the password that only work in one network session

or transaction

7. Electronic security token- need to have a physical device that serve as

electronic key and a password to access the data or information

8. Two factor authentication-  requires a two method authentication

401 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

9. Transparent data encryption (TDE)-  a method that encrypts the actual

files, intruder who access the data using different server can’t read or

used the data

10. Cloud access security broker- Software that works between users of a

cloud service and the cloud applications. The software monitors activity

and ensures the user's security policies are followed.

11. Big data security- securing the extremely large amount of data that add

another level of security by using security tools. The Hadoop can be used

to store and process extreme large data set

12. Payment security, mobile app security, web browser security email

security- using special security featured that work to prevent unauthorized

access

Database software and data security features 

Database software is used to create, edit, and maintain database files and

records, enabling easier file and record creation, data entry, data editing,

updating, and reporting. The software also handles data storage, backup and

reporting, multi-access control, and security. Strong database security is

especially important today, as data theft becomes more frequent. Database

software is sometimes also referred to as a “database management system”

(DBMS). It’s primarily used for storing, modifying, extracting and searching for

information within a database. Database software is also used to

402 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

implement cybersecurity measures to protect against malware, viruses and other

security threats.

Database software makes data management simpler by enabling users to store

data in a structured form and then access it. It typically has a graphical interface

to help create and manage the data and, in some cases, users can construct

their own databases by using database software.

A database typically requires a comprehensive database software program

known as a database management system (DBMS). A DBMS serves as an

interface between the database and its end users or programs, allowing users to

retrieve, update, and manage how the information is organized and optimized. A

DBMS also facilitates oversight and control of databases, enabling a variety of

administrative operations such as performance monitoring, tuning, and backup

and recovery.

Most database software includes a graphical user interface (GUI) consisting of

structured fields and tabular forms that give users a centralized view of the data

present in a database and the tools to manipulate and query it. Structured Query

Language (SQL) commands are also typically used to interact with databases

through the software. Administrators input SQL queries to prompt the system to

perform an action, such as retrieving a specific set of data. However, there are

also databases that use other means for retrieving information in addition to SQL.

The most widely-used databases consist of a basic set of columns and rows that

display information retrieved using SQL. However, more complex software has

403 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

been developed in recent years to accommodate the massive amounts of unique

data collected by organizations, especially enterprises. These tools are multi-

layered, use a variety of query languages and support more storage formats,

such as XML.

Database software is available both as a commercial product and open

source software. Commercial options often have the advantage of vendor

support. While open-source software may lack this support, they make up for it

with more customization and free downloads.

Database software exists to protect the information in the database and ensure

that it’s both accurate and consistent. Its functions include storage, backup and

recovery, and presentation and reporting. It can also help your team with multi-

user access control, security management, and database communication.

Database Challenges

 Absorbing significant increases in data volume

The explosion of data coming in from sensors, connected machines, and

dozens of other sources keeps database administrators scrambling to

manage and organize their companies’ data efficiently.

 Ensuring data security

Data breaches are happening everywhere these days, and hackers are

getting more inventive. It’s more important than ever to ensure that data is

secure but also easily accessible to users.

 Keeping up with demand

404 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

In today’s fast-moving business environment, companies need real-time

access to their data to support timely decision-making and to take advantage

of new opportunities.

 Managing and maintaining the database and infrastructure

Database administrators must continually watch the database for problems

and perform preventative maintenance, as well as apply software upgrades

and patches. As databases become more complex and data volumes grow,

companies are faced with the expense of hiring additional talent to monitor

and tune their databases.

 Removing limits on scalability

A business needs to grow if it’s going to survive, and its data management

must grow along with it. But it’s very difficult for database administrators to

predict how much capacity the company will need, particularly with on-

premises databases.

 Ensuring data residency, data sovereignty, or latency requirements

Some organizations have use cases that are better suited to run on-premises.

In those cases, engineered systems that are pre-configured and pre-

optimized for running the database are ideal. Customers achieve higher

availability, greater performance and up to 40% lower cost with Oracle

Exadata, according to Wikibon’s recent analysis (PDF).

BENEFITS OF DATABASE SOFTWARE

405 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 Data availability: Traversing through large stores of data in a single

database can be time-consuming and labor-intensive. Database

software makes this information readily available by providing the ability

to input queries to direct you to the exact data you’re searching for.

 Minimized redundancy: Users commonly work on the same projects

within multiple locations in a database. This can end up creating

multiple copies of the same file, leading to data redundancy. This was

particularly an issue with file-based data management systems, the

predecessor to database software. This can cause confusion when

searching for and organizing data and consumes valuable storage

space. Database software reduces redundancy by controlling

information stored in a variety of locations.

 Improved data security: Security should always be a top concern

when it comes to stored data. Database software can authorize or

block user access to views of protected data within an application

called, also called subschemas. It can also give access to specific

functions of a database depending on assigned roles. For example,

only system administrators and others with high-level access are able

to modify the database or alter user access. Authorizing access

typically involves using unique passwords for each user.

 Backup and Recovery: Database software has the ability to

regularly backup the data from a database and store it in a safe

location in the event of an outage or data breach. It can then use these

406 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

backups to automatically recover and restore the database to its

previous state.

 Analytics: Database software can collect valuable analytics, such as

what information users’ access, the frequency at which they access it,

potential security threats and other hiccups in the system. This

information is then visualized through the GUI so administrators can

easily gain insights and make data-driven decisions to improve

efficiency.

USER ROLES

Part of what allows database software to improve efficiency and maintain security

is the ability to assign roles to users that authorize or restrict access to certain

portions of a network. This ensures that users only have access to the assets

they need to do their job. The primary roles include the following:

 Administrators: This role has the highest level of access to the

database. They are able to view and manage the most sensitive

information, modify other users’ access, alter security protocols and

more.

 Programmers: In order to build and

modify applications, programmers require special permissions. They

can install new applications, modify application functionality and in

some cases remove them altogether.

407 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 End users: These users typically have the most restricted access. and

can only retrieve, update, share and delete information relevant to their

duties. At most, they can retrieve, update, share and delete information

only in the applications that are essential to their jobs. In some cases,

they are confined to read-only access. This only allows users to view

this information but are not able to manipulate or delete it.

 Applications and programs: Aside from human users, programs also

need to access databases to retrieve and transmit information. Setting

permissions for how these programs access data is also an important

aspect of network security. The level of permissions for programs can

mirror those of different users stated above.

USER INTERACTION

 Building tables and forms: In order to add and organize files in a

database, database software is used to create fields and data entry

forms. When new files are added, they are indexed according to

programmer-defined parameters, such as name, type and length. Data

entry forms are created to input this information for each file. This

information is used by the software to determine where files are stored

and how they can be accessed.

 Updating and editing data: After data is stored, it will likely need to be

regularly updated or edited with new information. Database software

offers an ‘Edit’ mode to make these changes. However, each file will

408 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

have restrictions on who can edit data according to assigned user

permissions.

 View and query data: Besides storing data, one of the primary uses of

database software is to quickly and easily find relevant information.

Queries are used to search through a database and retrieve data.

 Reporting: Most database software has the capability to track

database activity. It also has features that allow users to pull this

information into reports that can be used to make data-driven business

decisions.

TYPES OF DATABASE SOFTWARE

There are multiple different types of database software that are typically broken

down into six categories:

 Analytical database software: This tool is used to gather and

compare data to assess the performance of different assets, such as

website traffic, employee productivity or business goals.

 Data warehouse software: This software acts as a large repository

that can pull and store data from a variety of databases. Data sets from

these different databases can then be compared to find inconsistencies

to improve data integrity.

 Distributed database software: Administrators can use this tool to

manage information from multiple databases from a centralized system.

409 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

 End user database software: Designed for the smaller scale, end

user database software stores information used by single users.

 External database software: This software acts as a central location

for multiple users to access the same information, typically over

the internet.

 Operational database software: Users can use this tool to manage or

modify data in real time.

TYPES OF DATABASE SOFTWARE TECHNOLOGY

 Relational database management system (RDBMS): this traditional

database technology can be applied to most use cases, and as a

result, is a very popular option. Information is presented in rows and

columns and allows for easy querying using SQL. RDBMS are mostly

used to store relatively simple information, such as contact information

and user identities. This technology is also highly scalable making it a

good option for large organizations. It can be hosted on-premises, in

the cloud and on hybrid-cloud systems.

 NoSQL: This is the second most common database technology next to

RDBMS. The name of this technology stands for “not only SQL.”

Standard SQL language can be used but it also supports a variety of

data models, such as key-value, document, columnar and graph

410 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

formats, as opposed to just rows and columns. The purpose of this

design is to allow it to handle evolving data structures.

 In-memory database management system (IMDBMS): Rather than

focusing on a variety of use cases or data structures, the main goal

of in-memory database tools is to provide fast response times and

improved performance.

 Columnar database management system (CDBMS): This technology

was mainly designed for data warehouses. These systems typically

store large amounts of very similar data. So a data structure composed

of mostly columns is a more straightforward solution to maintaining a

database.

 Cloud-based database management system: Cloud database

technology is gaining popularity as many organizations are shifting to a

cloud-based or hybrid cloud infrastructure. They are highly scalable

and maintenance is often provided by the cloud service.

ON-PREMISE VS. HOSTED DATABASE SOFTWARE

Database software can be delivered in two ways depending on an organization’s

infrastructure. On-premise software is deployed at an organization’s physical

location on hardware-based servers. It’s typically managed by the company’s

internal IT department. On-premise database software generally allows for more

customization.

411 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

The other option is cloud-hosting delivered as SaaS. One large benefit

depending on an organization’s resources is that the software is typically

maintained by the service provider, freeing up IT teams to focus on other efforts.

It is also more scalable than on-premise software, as it’s not limited by hardware.

TOP DATABASE SOFTWARE VENDORS

Database software is used for a number of reasons across many industries.

Because they have so many uses, there are dozens of database software

programs available. Here are a few of the most popular:

Microsoft SQL Server: Microsoft’s SQL server is one of the oldest players in the

game, first released in 1989. It’s mainly used for Windows-based systems but

also supports Linux operating systems (OS).

Oracle RDBMS: This tool is one of the most popular database software options

for enterprise organizations as it can support large databases but maintains good

performance. It can support Windows, Linux and UNIX systems

IBM DB2: IBM DB2 was also an early contender in the database software space,

introduced in 1983. It’s praised for its simple deployment, installation and

operation. It also supports Windows, Linux and UNIX systems.

Altibase: This is an open-source database software solution but is also a high

performing, enterprise-grade tool. It uses an in-memory database to offer high

speeds and is one of the few solutions that provides scale-out technology and

sharding.

MySQL: MySQL is an open-source relational database tool. It’s common for web

hosting providers to bundle MySQL with their offerings making it a popular tool

412 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

for web developers. It can handle robust sets of data but its relatively simple

deployment and management make it a good option for smaller organizations

and independent web developers as well.

AmazonRDS: As an offering from Amazon Web Services (AWS), Amazon

Relational Database Service (AmazonRDS) is a cloud-based database-as-a-

service (DBaaS). It offers high scalability, dedicated secure connections and it

creates and stores backups automatically.

SQL Developer: This tool was built with flexibility in mind. It can integrate with a

number of other database tools and supports queries in a variety of formats,

including XML, HTML, PDF, or Excel.

Knack: Released in 2010, Knack is a relatively new database software tool. It’s

another DBaaS offering that is easy to use. It allows users to structure, connect

and extend data without the need for any coding. It’s already gained a notable

portfolio of clients, such as Spotify, Capital One and Intel.

Using databases improve business performance and decision-making, 

Using database and other computing and business intelligence tools,

organizations can now leverage the data they collect to run more efficiently,

enable better decision-making, and become more agile and scalable. Optimizing

access and throughput to data is critical to businesses today because there is

more data volume to track. It’s critical to have a platform that can deliver the

performance, scale, and agility that businesses need as they grow over time.

Provide a significant boost to business capabilities, databases automate

413 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

expensive, time-consuming manual processes, they free up business users to

become more proactive with their data. By having direct control over the ability to

create and use databases, users gain control and autonomy while still

maintaining important security standards.

How autonomous technology is improving database management

Self-driving databases use cloud-based technology and machine learning to

automate many of the routine tasks required to manage databases, such as

tuning, security, backups, updates, and other routine management tasks. With

these tedious tasks automated, database administrators are freed up to do more

strategic work. The self-driving, self-securing, and self-repairing capabilities of

self-driving databases are poised to revolutionize how companies manage and

secure their data, enabling performance advantages, lower costs, and improved

security.

What is database security

Database security refers to the range of tools, controls, and measures designed

to establish and preserve database confidentiality, integrity, and availability. 

Database security is a complex and challenging endeavor that involves all

aspects of information security technologies and practices. It’s also naturally at

odds with database usability. The more accessible and usable the database, the

414 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

more vulnerable it is to security threats; the more invulnerable the database is to

threats, the more difficult it is to access and use.

Database security must address and protect the following:

 The data in the database

 The database management system (DBMS)

 Any associated applications

 The physical database server and/or the virtual database server and the

underlying hardware

 The computing and/or network infrastructure used to access the database

Encryption

When data is encrypted, it is transformed using an algorithm to make it

unreadable to anyone without the decryption key. The general idea is to make

the effort of decrypting so difficult as to outweigh the advantage to a hacker of

accessing the unauthorized data. There are two situations where data encryption

can be deployed: data in transit and data at rest. In a database context, data “at

rest” encryption protects data stored in the database, whereas data “in transit”

encryption is used for data being transferred over a network.

Encrypting data at rest is undertaken to prohibit “behind the scenes” snooping for

information. When the data at rest is encrypted, even if a hacker surreptitiously

gains access to the data behind the scenes, without the decryption key the data

is meaningless. Data at rest encryption most commonly is supported by using

415 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

built-in functions, a DBMS feature such as Oracle Transparent Data Encryption,

or through an add-on encryption product.

Encrypting data in transit protects against network packet sniffing. If the data is

encrypted before it is sent over the network and decrypted upon receipt at its

destination, it is protected along its journey. Anyone nefariously attempting to

access the data when in route will receive only encrypted data. And again,

without the decryption key, the data cannot be deciphered. Data in transit

encryption most commonly is supported using DBMS system parameters and

commands or through an add-on encryption product.

Label-Based Access Control

A growing number of DBMSs offer label-based access control (LBAC), which

delivers more fine-grained control over authorization to specific data in the

database. With LBAC, it is possible to support applications that need a more

granular security scheme. LBAC can be set up to specify who can read and

modify data in individual rows and/or columns.

LBAC is not for every application; it is geared more for top-secret, governmental,

and similar types of data. Setting up such a security scheme is virtually

impossible without LBAC. 

Any attempted access to a protected column when the LBAC credentials do not

permit that access will fail. If users try to read protected rows not allowed by their

LBAC credentials, the DBMS simply acts as if those rows do not exist. This is

416 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

important because sometimes even the knowledge that the data exists (without

being able to access it) must be protected.

Data Masking

Data masking is the process of protecting sensitive information in databases from

inappropriate visibility by replacing it with gibberish or realistic but not real data.

Protecting sensitive data using data masking can prevent fraud, identity theft,

and other types of criminal activities. 

A good data masking solution should offer the ability to mask using multiple

techniques. Common techniques include substitution, shuffling, number and data

variance, nulling out, encryption, and table-to-table synchronization. Data

masking is supported by many DBMS offerings as well as by third-party

products. 

Staying Up-to-Date

Be sure to keep up-to-date on the latest security requirements and capabilities of

your DBMS. Understand what is available to you and what you may need to

augment with additional tools. 

5. Database back-up and recovery 

Database Backup 

Database Backup is storage of data that means the copy of the data. It is

a safeguard against unexpected data loss and application errors. It protects the

417 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

database against data loss. If the original data is lost, then using the backup it

can reconstructed. 

The backups are divided into two types, Physical Backup and Logical Backup 

1. Physical backups 

Physical Backups are the backups of the physical files used in storing and

recovering your database. It is a copy of files storing database information to

some other location, such as disk, some offline storage like magnetic tape.

Physical backups are the foundation of the recovery mechanism in the database.

Provides the minute details about the transaction and modification to the

database. 

2. Logical backup 

Logical Backup contains logical data which is extracted from a database. It

includes backup of logical data like views, procedures, functions, tables, etc. It is

a useful supplement to physical backups in many circumstances but not a

sufficient protection against data loss without physical backups, because logical

backup provides only structural information. 

Importance Of Backups

Planning and testing backup help against failure of media, operating

system, software and any other kind of failures that cause a serious data crash. It

determines the speed and success of the recovery. 

Methods of Backup 

The different methods of backup in a database are: 

418 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Full Backup - This method takes a lot of time as the full copy of the database is

made including the data and the transaction records. 

Transaction Log - Only the transaction logs are saved as the backup in this

method. To keep the backup file as small as possible, the previous transaction

log details are deleted once a new backup record is made. 

Differential Backup - This is similar to full back up in that it stores both the data

and the transaction records. However only that information is saved in the

backup that has changed since the last full backup. Because of this, differential

backup leads to smaller files. 

Common causes of Failures in a Database:

1. System Crash 

System crash occurs when there is a hardware or software failure or

external factors like a power failure. The data in the secondary memory is not

affected when system crashes because the database has lots of integrity.

Checkpoint prevents the loss of data from secondary memory. 

2. Transaction Failure 

The transaction failure is affected on only few tables or processes

because of logical errors in the code. This failure occurs when there are system

errors like deadlock or unavailability of system resources to execute the

transaction. 

3. Network Failure 

419 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

A network failure occurs when a client – server configuration or distributed

database system are connected by communication networks. 

4. Disk Failure 

Disk Failure occurs when there are issues with hard disks like formation of

bad sectors, disk head crash, unavailability of disk etc. 

5. Media Failure - Catastrophic Event 

Media failure is the most dangerous failure because, it takes more time to

recover than any other kind of failures. A disk controller or disk head crash is a

typical example of media failure. Natural disasters like floods, earthquakes,

power failures that damage the data. 

6. User Error 

Normally, user error is the biggest reason of data destruction or corruption

in a database. To rectify the error, the database needs to be restored to the point

in time before the error occurred. 

Redundancy 

Data redundancy is a condition created within a database or data storage

technology in which the same piece of data is held in two separate places. This

can mean two different fields within a single database, or two different spots in

multiple software environments or platforms. Whenever data is repeated, this

basically constitutes data redundancy. This can occur by accident, but is also

done deliberately for backup and recovery purposes. 

Hardware redundancy 

420 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Hardware redundancy is achieved by providing two or more physical copies of a

hardware component. When other techniques, such as use of more reliable

components, manufacturing quality control, test, design simplification, etc., have

been exhausted, hardware redundancy may be the only way to improve the

dependability of a system. 

What Is Recovery? 

Recovery is the process of restoring a database to the correct state in the

event of a failure. It ensures that the database is reliable and remains in

consistent state in case of a failure. 

Database recovery can be classified into two parts; 

1. Rolling Forward applies redo records to the corresponding data blocks. 

2. Rolling Back applies rollback segments to the datafiles. It is stored in

transaction tables. 

There are two methods that are primarily used for database recovery. These are:

 Log based recovery - In log-based recovery, logs of all database

transactions are stored in a secure area so that in case of a system

failure, the database can recover the data. All log information, such as the

time of the transaction, its data etc. should be stored before the

transaction is executed.

 Shadow paging - In shadow paging, after the transaction is completed, its

data is automatically stored for safekeeping. So, if the system crashes in

421 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

the middle of a transaction, changes made by it will not be reflected in the

database.

6. Controlling concurrent access

Concurrency Control

Concurrency control is a database management systems (DBMS) concept that is

used to address occur with a multi-user system. Concurrency control, when

applied to a DBMS, is meant to coordinate simultaneous transactions while

preserving data integrity. 

Concurrent access is quite easy if all users are just reading data. There is no way

they can interfere with one another. Though for any practical Database, it would

have a mix of READ and WRITE operations and hence the concurrency is a

challenge.

Potential problems of Concurrency

Lost Updates - occur when multiple transactions select the same row and update

the row based on the value selected

Uncommitted dependency issues - occur when the second transaction selects a

row which is updated by another transaction (dirty read)

Non-Repeatable Read - occurs when a second transaction is trying to access the

same row several times and reads different data each time.

Incorrect Summary issue - occurs when one transaction takes summary over the

value of all the instances of a repeated data-item, and second transaction update

few instances of that specific data-item. In that situation, the resulting summary

does not reflect a correct result.

422 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Reasons for using Concurrency control method is DBMS:

1. To apply Isolation through mutual exclusion between conflicting transactions

2. To resolve read-write and write-write conflict issues

3. To preserve database consistency through constantly preserving execution

obstructions

4. Concurrency control helps to ensure serializability

Concurrency Control Protocols

Different concurrency control protocols offer different benefits between the

amount of concurrency they allow and the amount of overhead that they impose. 

Following are the Concurrency Control techniques in DBMS:

Lock Based Protocols in DBMS is a mechanism in which a transaction cannot

Read or Write the data until it acquires an appropriate lock. Lock based protocols

help to eliminate the concurrency problem in DBMS for simultaneous

transactions by locking or isolating a particular transaction to a single user.

Binary Locks: A Binary lock on a data item can either locked or unlocked states.

1. Shared Lock (S):

A shared lock is also called a Read-only lock. With the shared lock, the data item

can be shared between transactions. This is because you will never have

permission to update data on the data item.

2. Exclusive Lock (X):

With the Exclusive Lock, a data item can be read as well as written. This is

exclusive and can’t be held concurrently on the same data item. X-lock is

423 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

requested using lock-x instruction. Transactions may unlock the data item after

finishing the ‘write’ operation.

3. Simplistic Lock Protocol

This type of lock-based protocols allows transactions to obtain a lock on every

object before beginning operation. Transactions may unlock the data item after

finishing the ‘write’ operation.

4. Pre-claiming Locking

Pre-claiming lock protocol helps to evaluate operations and create a list of

required data items which are needed to initiate an execution process. In the

situation when all locks are granted, the transaction executes. After that, all locks

release when all of its operations are over.

Starvation

Starvation is the situation when a transaction needs to wait for an indefinite

period to acquire a lock.

Following are the reasons for Starvation:

When waiting scheme for locked items is not properly managed

In the case of resource leak

The same transaction is selected as a victim repeatedly

Deadlock

Deadlock refers to a specific situation where two or more processes are waiting

for each other to release a resource or more than two processes are waiting for

the resource in a circular chain.

424 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Two Phase Locking Protocol also known as 2PL protocol is a method of

concurrency control in DBMS that ensures serializability by applying a lock to the

transaction data which blocks other transactions to access the same data

simultaneously. 

Two Phase Locking protocol helps to eliminate the concurrency problem in

DBMS.

This locking protocol divides the execution phase of a transaction into

three different parts.

First phase, when the transaction begins to execute, it requires permission for

the locks it needs.

Second part is where the transaction obtains all the locks. 

When a transaction releases its first lock, the third phase starts.

Third phase, the transaction cannot demand any new locks. Instead, it only

releases the acquired locks.

The Two-Phase Locking protocol allows each transaction to make a lock or

unlock request in two steps:

Growing Phase: In this phase transaction may obtain locks but may not release

any locks.

Shrinking Phase: In this phase, a transaction may release locks but not obtain

any new lock

Strict Two-Phase Locking Method

425 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Strict-Two phase locking system is almost similar to 2PL. The only difference is

that Strict-2PL never releases a lock after using it. It holds all the locks until the

commit point and releases all the locks at one go when the process is over.

Centralized 2PL

In Centralized 2 PL, a single site is responsible for lock management process. It

has only one lock manager for the entire DBMS.

Primary copy 2PL

Primary copy 2PL mechanism, many lock managers are distributed to different

sites. After that, a particular lock manager is responsible for managing the lock

for a set of data items. When the primary copy has been updated, the change is

propagated to the slaves.

Distributed 2PL

In this kind of two-phase locking mechanism, Lock managers are distributed to all

sites. They are responsible for managing locks for data at that site. If no data is

replicated, it is equivalent to primary copy 2PL. Communication costs of

Distributed 2PL are quite higher than primary copy 2PL.

Timestamp based Protocol in DBMS is an algorithm which uses the System

Time or Logical Counter as a timestamp to serialize the execution of concurrent

transactions. The Timestamp-based protocol ensures that every conflicting read

and write operations are executed in a timestamp order.

The older transaction is always given priority in this method. It uses system time

to determine the time stamp of the transaction. This is the most commonly used

concurrency protocol.

426 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Lock-based protocols help you to manage the order between the conflicting

transactions when they will execute. Timestamp-based protocols manage

conflicts as soon as an operation is created.

Validation based Protocol in DBMS also known as Optimistic Concurrency

Control Technique is a method to avoid concurrency in transactions. In this

protocol, the local copies of the transaction data are updated rather than the data

itself, which results in less interference while execution of the transaction.

The Validation based Protocol is performed in the following three phases:

 Read Phase

 Validation Phase

 Write Phase

Read Phase: In the Read Phase, the data values from the database can be read

by a transaction but the write operation or updates are only applied to the local

data copies, not the actual database.

Validation Phase

In Validation Phase, the data is checked to ensure that there is no violation of

serializability while applying the transaction updates to the database.

Write Phase

In the Write Phase, the updates are applied to the database if the validation is

successful, else; the updates are not applied, and the transaction is rolled back.

Characteristics of Good Concurrency Protocol

An ideal concurrency control DBMS mechanism has the following objectives:

Must be resilient to site and communication failures.

427 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

It allows the parallel execution of transactions to achieve maximum concurrency.

Its storage mechanisms and computational methods should be modest to

minimize overhead.

It must enforce some constraints on the structure of atomic actions of

transactions.

DATA DICTIONARIES AND REPOSITORIES

Data dictionary (also called information repositories) are mini database

management systems that manages metadata. It is a repository of information

about a database that documents data elements of a database. It describes the

meanings and purposes of data elements within the context of a project, and

provides guidance on interpretation, accepted meanings and representation. A

Data Dictionary also provides metadata about data elements. The metadata

included in a Data Dictionary can assist in defining the scope and characteristics

of data elements, as well the rules for their usage and application. A data

dictionary is a collection of descriptions of the data objects or items in a data

model for the benefit of programmers and others who need to refer to them.

Often a data dictionary is a centralized metadata repository. A first step in

analyzing a system of interactive objects is to identify each one and its

relationship to other objects. This process is called data modeling and results in a

picture of object relationships. After each data object or item is given a

descriptive name, its relationship is described, or it becomes part of some

structure that implicitly describes relationship. The type of data, such as text or

image or binary value, is described, possible predefined default values are listed

428 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

and a brief textual description is provided. This data collection can be organized

for reference into a book called a data dictionary.

Types of data dictionaries

There are two types of data dictionaries. Active and passive data dictionaries

differ in level of automatic synchronization.

• Active data dictionaries. These are data dictionaries created within the

databases they describe automatically reflect any updates or changes in their

host databases. This avoids any discrepancies between the data dictionaries and

their database structures.

• Passive data dictionaries. These are data dictionaries created as new

databases -- separate from the databases they describe -- for the purpose of

storing data dictionary information. Passive data dictionaries require an additional

step to stay in sync with the databases they describe and must be handled with

care to ensure there are no discrepancies.

Data dictionary components

Specific contents in a data dictionary can vary. In general, these components are

various types of metadata, providing information about data.

• Data object listings (names and definitions)

• Data element properties (such as data type, unique identifiers, size,

nullability, indexes and optionality)

429 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

• Entity-relationship diagrams (ERD)

• System-level diagrams

• Reference data

• Missing data and quality-indicator codes

• Business rules (such as for validation of data quality and schema objects)

Pros and cons of data dictionaries

Data dictionaries can be a valuable tool for the organization and management of

large data listings. Other pros include:

• Provides organized, comprehensive list of data

• Easily searchable

• Can provide reporting and documentation for data across multiple

programs

• Simplifies the structure for system data requirements

• No data redundancy

• Maintains data integrity across multiple databases

• Provides relationship information between different database tables

• Useful in the software design process and test cases

Though they provide thorough listings of data attributes, data dictionaries may be

difficult to use for some users. Other cons include:


430 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

• Functional details not provided

• Not visually appealing

• Difficult to understand for non-technical users

Why Use a Data Dictionary?

Data Dictionaries are useful for a number of reasons. In short, they:

• Assist in avoiding data inconsistencies across a project

• Help define conventions that are to be used across a project

• Provide consistency in the collection and use of data across multiple

members of a research team

• Make data easier to analyze

• Enforce the use of Data Standards

What Are Data Standards and Why Should I Use Them?

Data Standards are rules that govern the way data are collected, recorded, and

represented. Standards provide a commonly understood reference for the

interpretation and use of data sets.

By using standards, researchers in the same disciplines will know that the way

their data are being collected and described will be the same across different

projects. Using Data Standards as part of a well-crafted Data Dictionary can help

431 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

increase the usability of your research data, and will ensure that data will be

recognizable and usable beyond the immediate research team.

TUNING THE DATABASE FOR PERFORMANCE

Databases are the guts of an application; without them, you're left with just skins

and skeletons, which aren't as useful on their own. Therefore, the overall

performance of any app is largely dependent on database performance. There

are dozens of factors that affect performance including how indexes are used,

how queries are structured and how data is modeled. Consequently, making

minor adjustments to any of these elements can have a large impact. Database

performance tuning refers to the various ways database administrators can

ensure databases are running as efficiently as possible. Typically, this refers to

tuning SQL Server or Oracle queries for enhanced performance. The goal of

database tuning is to reconfigure the operating systems according to how they’re

best used, including deploying clusters, and working toward optimal database

performance to support system function and end-user experience. Poor database

performance bogs down operations, and as the lifeblood of a business,

companies can’t afford barriers to data access. One of the best ways to navigate

past performance issues is by getting a regular database performance audit. Just

like a car needs standard tuning and maintenance, database engines and the

environments they reside in need to be assessed and serviced to ensure things

are working as they should and performing optimally. Database tuning can be an

incredibly difficult task, particularly when working with large-scale data where

432 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

even the most minor change can have a dramatic (positive or negative) impact

on performance. In mid-sized and large companies, most database tuning will be

handled by a Database Administrator (DBA). But there are plenty of developers

who have to perform DBA-like tasks; meanwhile, DBAs often struggle to work

well with developers.

Why should you perform database performance tuning?

Tuning the databases enhances the performance but it is only the first step in

keeping applications running smoothly. The purpose of database tuning is to

organize data in a way that makes retrieving information much easier. Without

database performance tuning, we could face problems every time we run

queries, even the response is incorrect or the query takes too long to perform.

10 Database Performance Tuning Best Practices

1. Keep statistics up to date

Table statistics are used to generate optimal execution plans. If the performance

tuning tool is using out-of-date statistics, the plan won’t be optimized for the

current situation.

2. Don’t use leading wildcards

Leading wildcards in parameters force a full table scan, even if there is an

indexed field inside the table. If the database engine must scan all the rows in a

table to find what it’s looking for, the delivery speed of your query results suffers.

Other queries may suffer as well, since scanning all of that data into memory will

433 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

cause the CPU utilization to spike and not allow other queries any time in

memory.

3. Avoid SELECT *

This tip is particularly important if you have a large table (think hundreds of

columns and millions of rows). If an application only needs a few columns,

include them individually instead of wasting time querying for all the data. Again,

reading extra data will cause CPU utilization to spike and memory to be

thrashed. You should check the Page Life Expectancy (PLE) to make sure you

are not having this issue.

4. Use constraints

Constraints are an effective way to speed up queries and helps the SQL

optimizer come up with a better execution plan, but the improved performance

comes at the cost of the data requiring more memory. The increased query

speed may be worth it depending on the business objective, but it’s important to

be aware of the price.

5. Look at the actual execution plan, not the estimated plan

The estimated execution plan is helpful when you are writing queries because it

gives you a preview of how the plan will run, but it is blind to parameter data

types which could be wrong. To get the best results when performance tuning,

it’s often better to review the actual execution plan because it uses the latest,

most accurate statistics.

434 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

6. Adjust queries by making one small change at a time

Making too many changes at once tends to muddy the waters. A better, more

efficient approach to query tuning is to make changes with the most expensive

operations first and work from there.

7. Adjust indexes to reduce I/O

Before you dive into troubleshooting I/O directly, first try adjusting indexes and

query tuning. Consider using a covering index that includes all the columns in the

query, this reduces the need to go back to the table as it can get all the columns

from the index. Adjusting indexes and query tuning have a high impact on almost

all areas of performance, so when they are optimized, many other performance

issues resolve as well.

8. Analyze query plans

Utilizing artificial intelligence to analyze your execution plan and determine how

to change it helps databases execute operations more efficiently.

9. Compare optimized and original SQL

When optimizing SQL queries, be sure to highlight changes in the SQL statement

so you can compare the original statement with the optimized version. Gather a

baseline metric such as logical I/O to compare against as you tune. Don’t make

any changes until you are sure the optimized version is accurate (i.e., includes

current statistics) and really does improve performance.

10. Automate SQL optimization


435 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Automated SQL optimization tools not only analyze your SQL statement but can

also automatically rewrite it or optimize indexes until it finds the variation that

creates the most improvement in the execution time of the query.

DATA AVAILABILITY

Data availability is a measure of how often your data is available to be used,

whether by your own organization, or by one of your partners. It is desirable to

have your data available 24x7x365, which will permit your business to run

uninterrupted. Unexpected issues and interruptions are inevitable when dealing

with data management, so designing a system that can work around those

issues while still delivering data is essential. Data availability is primarily used to

create service level agreements (SLA) and similar service contracts, which define

and guarantee the service provided by third-party IT service providers.

Availability has to do with the accessibility and continuity of information.

Information with low availability concerns may be considered supplementary

rather than necessary. 

Information with high availability concerns is considered critical and must be

accessible in order to prevent negative impact on University activities. It is the

ability to guarantee reliable access to data. Organizations must keep crucial data

available and shorten data outage times as much as possible. To achieve data

availability, organizations must be able to quickly repair all hardware failures and

maintain backups.  Typically, data availability calls for implementing products,

services, policies and procedures that ensure that data is available in normal and

436 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

even in disaster recovery operations. This is usually done by implementing

data/storage redundancy, data security, network optimization, data security and

more. Storage area networks (SAN), network attached storage and RAID-based

storage systems are popular storage management technologies for ensuring

data availability.

Data Availability Challenges

There are several issues that can affect the availability of your data:

Host server failures—if the server that stores your data fails, your data will

become unavailable.

Storage failures—if your physical storage device fails, you can no longer access

the data it stores.

Network crash—if the network crashes, the host server becomes inaccessible

along with the data stored on it.

Poor data quality—low-quality datasets may contain incomplete, inconsistent, or

redundant data, which could be useless for your IT operations.

Data compatibility issues—data that is usable and working on a specific platform

or environment might not be on another.

Legacy data—data that is too outdated can become unusable. You can use data

transformation tools to make older data readily accessible, but these do not

always work.

437 | P a g e
UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City

Best practices to follow to combat data availability challenges include:

• Redundancy and backups. Backing up data is an essential aspect of data

availability. Data backups should be stored in separate locations or in a

distributed network. This way, if data is lost or corrupted, it can be restored

quickly. Storage devices are often set up in a redundant array of independent

disks (RAID) configuration.

• The use of data loss prevention tools. DLP tools can help mitigate data

breaches and damage to data centers.

• Erasure coding. This data protection method breaks data into fragments,

expands it and then encodes it with redundant data pieces. The data is then

stored across a set of different locations or storage devices. If a drive fails or data

becomes corrupted, the data can be reconstructed from the segments stored on

the other drives.

• Following retention policies and procedures. If data or devices are no

longer needed, they should be either archived or securely disposed of.

• Automatically switching to backups. Flexibility can be added by

automatically switching to a backup or failover environment if a drive fails or data

is lost.

438 | P a g e

You might also like