Data Warehousing and Management (Compilation) Edited

UNIVERSITY OF CALOOCAN CITY
Biglang Awa St. Grace Park East, Caloocan City
CHAPTER 1:
THE DATABASE ENVIRONMENT AND
DEVELOPMENT PROCESS
Researched and presented by:
Acogido, Neil Angeli

Gregorio, Juvelyn
1|Page
Definitions
 Data- A “given fact; a number, a statement or a picture. Stored
representations of meaningful objects and events. Meaningful facts, text
graphics, images, sound, video segments. A collection of individual
responses from a marketing research
(1) Structured: numbers, text dates
(2) Unstructured: images, video, documents
 Database- organized collection of logically data.
 Information- data that have meaning within a context. Data processed to
increase knowledge in the person using the data.
 Metadata- data that describes the properties and context of user data.
Data that describes data.
 Database System- collection of electronic data. Central repository of
shared data. Stored in a standardized, convenient form. Requires a
Database Management System (DBMS)
2|Page
CONVENTIONAL FILE PROCESSING
Limitation of File Processing
 Program- Data Dependence- All programs maintain metadata for each file
they use.
 Duplication of Data- Different systems/ programs have separate copies of
the same data.
 Limited Data Sharing – No centralized control of data.
 Lengthy Development times- Programmers must design their own file
formats.
 Excessive Program Maintenance- 80% of information systems budget.
Problems with Data Dependency
 Non-standard file formats, lack of coordination and central control.
 Each application programmer must maintain his/ her own data
 Each application program needs to include code for the metadata of each
file. .
 Each application program must have its own processing routines for
reading, inserting, updating, and deleting data.
Problems with Data Redundancy
 duplicate data, data changes in one file could cause inconsistencies
3|Page
 Waste of space to have duplicate data.
 Causes more maintenance headaches.
 Compromises in data integrity
THE DATABASE APPROACH
Requires Database Management System that is used to create, maintain
and provide controlled access to user databases. Central repository of shared
data. Data is managed by a controlling agent. Stored in a standardized,
convenient form.
Database Management System
A database management system manages
data resources like an operating system manages
hardware resources.
Elements of Database Approach
 Data Models- Graphical diagram capturing
the nature and relationship of data.
 Rational Databases- database technology involving table representing
entities and primary representing relationships.
 Entities- Noun from describing person, place, object, event, or concept.
 Relationships- one-to-may, many-to-many, one-to-one.
4|Page
Advantages of the Database Approach
Program-data independence
Planned data redundancy
Improved data consistency
Improved data sharing
Program-data independence
Planned data redundancy
Improved data consistency
Improved data sharing
 Program- data independence  Enforcement of standard
 Planned data redundancy  Improved data quality
 Improved data consistency  Improved data accessibility
 Improved data sharing and responsiveness
 Increased application  Reduced program
development productivity maintenance
 Improved decision support
5|Page
Database Approach vs. Traditional File System
Costs and Risk of the Database Approach
 New, specialized personnel
Frequently, organizations that adopt the database approach need to
hire or train individuals to design and implement databases. This personnel
increase seems to be expensive, but an organization should not minimize the
6|Page
need for these specialized skills. Installing such a system may also require
upgrades to the hardware and data communications systems in the
organization.
 Installation and management costs complexity
A multi-user database management system is large and complex
software that has a high initial cost. It requires trained personnel to install and
operate, and also has annual maintenance costs.
 Conversion costs
The term “legacy systems” is used to refer to older applications in an
organization that are based on file processing. The cost of converting these
older systems to modern database technology may seem prohibitive to an
organization.
 Need for explicit backup and recovery
A shared database must be accurate and available at all times. This
raises the need to have backup copies of data for restoring a database when
damage occurs. A modern database management system normally
automates recovery tasks.
 Organizational conflict
A database requires an agreement on data definitions and ownership
as well as responsibilities for accurate data maintenance.
7|Page
Components of Database Management System
DBMS have several components, each performing very significant tasks in
the database management system environment. Below is a list of components
within the database and its environment.
 Software
This is the set of programs used to control and manage the overall
database. This includes the DBMS software itself, the Operating System,
the network software being used to share the data among users, and the
application programs used to access data in the DBMS.
 Hardware
Consists of a set of physical electronic devices such as computers, I/O
devices, storage devices, etc., this provides the interface between
computers and the real-world systems.
 Data
DBMS exists to collect, store, process and access data, the most
important component. The database contains both the actual or
operational data and the metadata.
 Procedures
These are the instructions and rules that assist on how to use the DBMS,
and in designing and running the database, using documented
procedures, to guide the users that operate and manage it.
8|Page
 Database_Access_Language
This is used to access the data to and from the database, to enter new
data, update existing data, or retrieve required data from databases. The
user writes a set of appropriate commands in a database access
language, submits these to the DBMS, which then processes the data and
generates and displays a set of results into a user readable form.
 Query_Processor
This transforms the user queries into a series of low level instructions.
This reads the online user’s query and translates it into an efficient series
of operations in a form capable of being sent to the run time data manager
for execution.
 Data_Manager
Also called the cache manger, this is responsible for handling of data in
the database, providing a recovery to the system that allows it to recover
the data after a failure.
 Database_Engine
The core service for storing, processing, and securing data, this provides
controlled access and rapid transaction processing to address the
requirements of the most demanding data consuming applications. It is
often used to create relational databases for online transaction processing
or online analytical processing data.
 Data Dictionary
This is a reserved space within a database used to store information about
9|Page
the database itself. A data dictionary is a set of read-only table and views,
containing the different information about the data used in the enterprise
to ensure that database representation of the data follow one standard as
defined in the dictionary.
 Report Writer
Also referred to as the report generator, it is a program that extracts
information from one or more files and presents the information in a
specified format. Most report writers allow the user to select records that
meet certain conditions and to display selected fields in rows and
columns, or also format the data into different charts.
Four Types of Database Management Systems
 Relational Database Management System
A relational database (RDB) is a collective set of multiple data sets
organized by tables, records and columns. RDBs establish a well-defined
relationship between database tables. Tables communicate and share
information, which facilitates data searchability, organization and
reporting. RDBs use Structured Query Language (SQL), which is a
standard user application that provides an easy programming interface for
database interaction. RDB is derived from the mathematical function
concept of mapping data sets and was developed by Edgar F. Codd.
RDBs organize data in different ways. Each table is known as a
relation, which contains one or more data category columns. Each table
10 | P a g e
record (or row) contains a unique data instance defined for a
corresponding column category. One or more data or record
characteristics relate to one or many records to form functional
dependencies. These are classified as follows:
 One to One: One table record relates to another record in another
table.
 One to Many: One table record relates to many records in another
table.
 Many to One: More than one table record relates to another table
record.
 Many to Many: More than one table record relates to more than one
record in another table.
RDB performs "select", "project" and "join" database operations,
where select is used for data retrieval, project identifies data attributes,
and join combines relations. RDBs have many other advantages,
including:
 Easy extendability, as new data may be added without
modifying existing records. This is also known as scalability.
 New technology performance, power and flexibility with
multiple data requirement capabilities.
11 | P a g e
 Data security, which is critical when data sharing is based on
privacy. For example, management may share certain data privileges
and access and block employees from other data, such as
confidential salary or benefit information.
These relations form functional dependencies within the database.
Some common examples of relational databases include MySQL,
Microsoft SQL Server, and Oracle.
 Hierarchical Database Systems
Hierarchical database model resembles a tree structure, similar to a
folder architecture in your computer system. The relationships between
records are pre-defined in a one-to-one manner, between 'parent and
child' nodes. They require the user to pass a hierarchy in order to access
needed data. Due to limitations, such databases may be confined to
specific uses.
 Network Database Systems
Network database models also have a hierarchical structure.
However, instead of using a single-parent tree hierarchy, this model
supports many to many relationships, as child tables can have more than
one parent.
 Object-Oriented Database Systems
12 | P a g e
In object-oriented databases, the information is represented as
objects, with different types of relationships possible between two or more
objects. Such databases use an object-oriented programming language
for development.
Systems Development Life Cycle
The
SDLC
is a
complete set of steps that a team of information systems professionals,
including database designers and programmers, follow in an organization
to specify, develop, maintain, and replace information systems. According
to Gillis (2019), the systems development life cycle (SDLC) is a
conceptual model used in project management that describes the stages
13 | P a g e
involved in an information system development project, from an initial
feasibility study through maintenance of the completed application. Gillis
(2019) added, that SDLC can be applied to technical and non-technical
systems. That in most use cases, a system is an IT technology such as
hardware and software. Project and program managers typically take part
in SDLC, along with system and software engineers, development teams
and end users.
 PLANNING—ENTERPRISE MODELING
The database development process begins with a review of
the enterprise modeling components that were developed during
the information systems planning process. During this step,
analysts review current databases and information systems;
analyze the nature of the business area that is the subject of the
development project; and describe, in general terms, the data
needed for each information system under consideration for
development. They determine what data are already available in
existing databases and what new data will need to be added to
support the proposed new project. Only selected projects move into
the next phase based on the projected value of each project to the
organization.
 PLANNING—CONCEPTUAL DATA MODELING
14 | P a g e
For an information systems project that is initiated, the
overall data requirements of the proposed information system must
be analyzed. This is done in two stages. First, during the Planning
phase, the analyst develops a diagram similar to Figure 1-3a, as
well as other documentation, to outline the scope of data involved
in this particular development project without consideration of what
databases already exist. Only high-level categories of data
(entities) and major relationships are included at this point. This
step in the SDLC is critical for improving the chances of a
successful development process. The better the definition of the
specific needs of the organization, the closer the conceptual model
should come to meeting the needs of the organization, and the less
recycling back through the SDLC should be needed.
 ANALYSIS—CONCEPTUAL DATA MODELING
During the Analysis phase of the SDLC, the analyst
produces a detailed data model that identifies all the organizational
data that must be managed for this information system. Every data
attribute is defined, all categories of data are listed, every business
relationship between data entities is represented, and every rule
that dictates the integrity of the data is specified. It is also during
the Analysis phase that the conceptual data model is checked for
consistency with other types of models developed to explain other
15 | P a g e
dimensions of the target information system, such as processing
steps, rules for handling data, and the timing of events.
 DESIGN—LOGICAL DATABASE DESIGN
Logical database design approaches database development
from two perspectives. First, the conceptual schema must be
transformed into a logical schema, which describes the data in
terms of the data management technology that will be used to
implement the database. For example, if relational technology will
be used, the conceptual data model is transformed and
represented using elements of the relational model, which include
tables, columns, rows, primary keys, foreign keys, and constraints.
 DESIGN—PHYSICAL DATABASE DESIGN AND DEFINITION
A physical schema is a set of specifications that describe
how data from a logical schema are stored in a computer’s
secondary memory by a specific database management system.
There is one physical schema for each logical schema. Physical
database design requires knowledge of the specific DBMS that will
be used to implement the database. In physical database design
and definition, an analyst decides on the organization of physical
records, the choice of file organizations, the use of indexes, and so
on.
 IMPLEMENTATION—DATABASE IMPLEMENTATION
16 | P a g e
In database implementation, a designer write, tests, and
installs the programs/scripts that access, create, or modify the
database. The designer might do this using standard programming
languages or in special database processing languages or use
special purpose nonprocedural languages to produce stylized
reports and displays, possibly including graphs. Also, during
implementation, the designer will finalize all database
documentation, train users, and put procedures into place for the
ongoing support of the information system (and database) users.
The last step is to load data from existing information sources (files
and databases from legacy applications plus new data now
needed). Loading is often done by first unloading data from existing
files and databases into a neutral format (such as binary or text
files) and then loading these data into the new database. Finally,
the database and its associated applications are put into production
for data maintenance and retrieval by the actual users. During
production, the database should be periodically backed up and
recovered in case of contamination or destruction.
 MAINTENANCE—DATABASE MAINTENANCE
The database evolves during database maintenance. In this step, the
designer adds, deletes, or changes characteristics of the structure of a
database in order to meet changing business conditions, to correct errors in
database design, or to improve the processing speed of database

17 | P a g e
applications. The designer might also need to rebuild a database if it
becomes contaminated or destroyed due to a program or computer system
malfunction. This is typically the longest step of database development,
because it lasts throughout the life of the database and its associated
applications. Each time the database evolves, view it as an abbreviated
database development process in which conceptual data modeling, logical
and physical database design, and database implementation occur to deal
with proposed changes.
Prototyping and Agile-development approaches
 Prototyping
i. It is an information-gathering technique useful for supplementing
the traditional SDLC; however, both agile methods and human–
computer interaction share roots in prototyping. When systems
analysts use prototyping, they are seeking user reactions,
suggestions, innovations, and revision plans to make improvements
to the prototype, and thereby modify system plans with a minimum
of expense and disruption. The four major guidelines for developing
a prototype are to (1) work in manageable modules, (2) build the
prototype rapidly, (3) modify the prototype, and (4) stress the user
interface.
ii. Although prototyping is not always necessary or desirable, it should
be noted that there are three main, interrelated advantages to using
18 | P a g e
it: (1) the potential for changing the system early in its development,
(2) the opportunity to stop development on a system that is not
working, and (3) the possibility of developing a system that more
closely addresses users’ needs and expectations. Users have a
distinct role to play in the prototyping process and systems analysts
must work systematically to elicit and evaluate users’ reactions to
the prototype.
iii. One particular use of prototyping is rapid application development
(RAD). It is an object-oriented approach with three phases:
requirements planning, the RAD design workshop, and
implementation.
 Agile modeling
i. It is a software development approach that defines an overall
plan quickly, develops and releases software quickly, and then
continuously revises software to add additional features. The
values of the agile approach that are shared by the customer as
well as the development team are communication, simplicity,
feedback, and courage. Agile activities include coding, testing,
listening, and designing. Resources available include time, cost,
quality, and scope.
ii. Agile core practices distinguish agile methods, including a type
of agile method called extreme programming (XP), from other
systems development processes. The four core practices of the
19 | P a g e
agile approach are (1) short releases, (2) 40-hour workweek, (3)
onsite customer, and (4) pair programming. The agile
development process includes choosing a task that is directly
related to a customer-desired feature based on user stories,
choosing a programming partner, selecting and writing
appropriate test cases, writing the code, running the test cases,
debugging it until all test cases run, implementing it with the
existing design, and integrating it into what currently exists.
Roles of an individual in Databases
 Data Administrators
The database and the DBMS are corporate resources that must be
managed like any other resource. The Data Administrator (DA) is
responsible for defining data elements, data names and their relationship
with the database. They are also known as Data Analyst.
 Database Administrators (DBA)
A Database Administrator (DBA) is an IT professional who works
on creating, maintaining, querying, and tuning the database of the
organization. They are also responsible for maintaining data security and
integrity. A DBA has many responsibilities. A good performing database is
in the hands of DBA.
DBA Responsibilities
20 | P a g e
 The life cycle of database starts from designing, implementing to
administration of it. A database for any kind of requirement needs
to be designed perfectly so that it should work without any issues.
 Once all the design is complete, it needs to be installed. Once this
step is complete, users start using the database. The database
grows as the data grows in the database. When the database
becomes huge, its performance comes down.
 Also accessing the data from the database becomes challenge.
These administration and maintenance of database is taken care
by database Administrator – DBA.
 Installing and upgrading the DBMS Servers
DBA is responsible for installing a new DBMS server for the
new projects. He is also responsible for upgrading these servers as
there are new versions comes in the market or requirement.
 Design and implementation
He should be able to decide proper memory management,
file organizations, error handling, log maintenance for the database.
 Performance Tuning
Since database is huge and it will have lots of tables, data,
constraints and indices, there will be variations in the performance
from time to time. It is responsibility of the DBA to tune the
database performance.
21 | P a g e
 Backup & Recovery
Proper backup and recovery programs need to be developed
by DBA and has to be maintained him. This is one of the main
responsibilities of DBA. Data should be backed up regularly so that
if there is any crash, it should be recovered without much effort and
data loss.
 Documentation
DBA should basically maintain all his installation, backup,
recovery, security methods. He should keep various reports about
database performance.
 Security
DBA is responsible for creating various database users and
roles, and giving them different levels of access rights.
 Database Designers
 Logical Database Designers
The logical database designer is concerned with identifying
the data (that is, the entities and attributes), the relationships
between the data, and the constraints on the data that is to be
stored in the database.
The logical database designer must have a thorough and
complete understanding of the organization’s data and any
constraints on this data.
22 | P a g e
 Physical Database Designers
The physical database designer decides how the logical
database design is to be physically realized.
 mapping the logical database design into a set of
tables and integrity constraints.
 selecting specific storage structures and access
methods for the data to achieve good performance.
 Application Developers
Once the database has
been implemented, the
application programs that
provide the required
functionality for the end-users
must be implemented. This is the responsibility of the application
developers.They are the developers who interact with the database by
means of DML queries. These DML queries are written in the application
programs like C, C++, JAVA, Pascal etc.
 End Users
The end-users are the ‘clients’ for the database, which has been
designed and implemented, and is being maintained to serve their
information needs.
23 | P a g e
 Sophisticated Users:The sophisticated end-user is familiar with the
structure of the database and the facilities offered by the DBMS.
 Naive Users: These are the users who use the existing application
to interact with the database. For example, online library system,
ticket booking systems, ATMs etc
The three schemas
 Internal Level/Schema
The internal schema defines the physical storage structure of the
database. The internal schema is a very low-level representation of the
entire database. It contains multiple occurrences of multiple types of
internal record. In the ANSI term, it is also called “stored record’.
Facts about Internal schema:
 The internal schema is the lowest level of data abstraction
 It helps you to keeps information about the actual representation of
the entire database. Like the actual storage of the data on the disk
in the form of records
 The internal view tells us what data is stored in the database and
how
 It never deals with the physical devices. Instead, internal schema
views a physical device as a collection of physical pages
 Conceptual Schema/Level
24 | P a g e
The conceptual schema describes the Database structure of the
whole database for the community of users. This schema hides
information about the physical storage structures and focuses on
describing data types, entities, relationships, etc.
This logical level comes between the user level and physical
storage view. However, there is only single conceptual view of a single
database.
Facts about Conceptual schema:
 Defines all database entities, their attributes, and their
relationships
 Security and integrity information
 In the conceptual level, the data available to a user must be
contained in or derivable from the physical level
 External Schema/Level
An external schema describes the part of the database which
specific user is interested in. It hides the unrelated details of the database
from the user. There may be “n” number of external views for each
database. Each external view is defined using an external schema, which
consists of definitions of various types of external record of that specific
view. An external view is just the content of the database as it is seen by
some specific particular user. For example, a user from the sales
department will see only sales related data.

25 | P a g e
Facts about external schema:
 An external level is only related to the data which is viewed
by specific end users.
 This level includes some external schemas.
 External schema level is nearest to the user
 The external schema describes the segment of the database
which is needed for a certain user group and hides the remaining
details from the database from the specific user group
 Goal of 3 level/schema of Database
Objectives of using Three Schema Architecture:
 Every user should be able to access the same data but able to see
a customized view of the data.
 The user need not to deal directly with physical database storage
detail.
 The DBA should be able to change the database storage structure
without disturbing the user’s views
 The internal structure of the database should remain unaffected
when changes made to the physical aspects of storage.
 Advantages Database Schema
 You can manage data independent of the physical storage and
Faster Migration to new graphical environments
26 | P a g e
 DBMS Architecture allows you to make changes on the
presentation level without affecting the other two layers
 As each tier is separate, it is possible to use different sets of
developers
 It is more secure as the client doesn’t have direct access to the
database business logic
 In case of the failure of the one-tier no data loss as you are always
secure by accessing the other tier
 Disadvantages Database Schema
 Complete DB Schema is a complex structure which is difficult to
understand for every one
 Difficult to set up and maintain; also, the physical separation of the
tiers can affect the performance of the Database
27 | P a g e
CHAPTER 2:
MODELING DATA IN THE
ORGANIZATION
28 | P a g e
Almadin, Catherine M.
Laxamana, Marlon F.
DATA MODELING
What is data modelling?
Data modeling is the process of creating a simple diagram of a complex software
system, using text and symbols to represent the way data will flow. The diagram
can be used to ensure efficient use of data as a blueprint for the construction of
new software or for reengineering a legacy application.
Data modeling is an important skill for data scientists and others involved with
data analysis. Traditionally, data models were built during the analysis and
design phases of a project to ensure that the requirements for a new application
are understood. A data model can become the basis for building a more detailed
data schema.
29 | P a g e
Data modeling is the process of creating a visual representation of either a whole
information system or parts of it to communicate connections between data
points and structures. The goal is to illustrate the types of data used and stored
within the system, the relationships among these data types, the ways the data
can be grouped and organized and its formats and attributes.
Data models are built around business needs. Rules and requirements are
defined upfront through feedback from business stakeholders so they can be
incorporated into the design of a new system or adapted in the iteration of an
existing one.
Data can be modeled at various levels of abstraction. The process begins by
collecting information about business requirements from stakeholders and end
users. These business rules are then translated into data structures to formulate
a concrete database design. A data model can be compared to a roadmap, an
architect’s blueprint or any formal diagram that facilitates a deeper understanding
of what is being designed.
Ideally, data models are living documents that evolve along with changing
business needs. They play an important role in supporting business processes
and planning IT architecture and strategy. Data models can be shared with
vendors, partners, and/or industry peers.
Why use Data Model?
The primary goal of using data model are:
30 | P a g e
 Ensures that all data objects required by the database are accurately
represented. Omission of data will lead to creation of faulty reports and
produce incorrect results.
 A data model helps design the database at the conceptual, physical and
logical levels.
 Data Model structure helps to define the relational tables, primary and
foreign keys and stored procedures.
 It provides a clear picture of the base data and can be used by database
developers to create a physical database.
 It is also helpful to identify missing and redundant data.
 Though the initial creation of data model is labor and time consuming, in
the long run, it makes your IT infrastructure upgrade and maintenance
cheaper and faster.
Data modeling is an essential step in the process of creating any complex
software. It helps developers understand the domain and organize their work
accordingly.
Higher Quality
Just as architects consider blueprints before constructing a building, you should
consider data before building an app. On average, about 70 percent of software
development efforts fail, and a major source of failure is premature coding. A
data model helps define the problem, enabling you to consider different
approaches and choose the best one.
31 | P a g e
Reduced cost
You can build applications at lower cost via data models. Data modeling typically
consumes less than 5-10 percent of a project budget, and can reduce the 65-75
percent of budget that is typically devoted to programming. Data modeling
catches errors and oversights early, when they are easy to fix. This is better than
fixing errors once the software has been written or – worse yet – is in customer
hands.
Clearer scope
A data model provides a focus for determining scope. It provides something
tangible to help business sponsors and developers agree over precisely what is
included with the software and what is omitted. Business users can see what the
developers are building and compare it with their understanding. Models promote
consensus among developers, customers and other stakeholders.
A data model also promotes agreement on vocabulary and jargon. The model
highlights the chosen terms so that they can be driven forward into software
artifacts. The resulting software becomes easier to maintain and extend.
Faster performance
A sound model simplifies database tuning. A well-constructed database typically
runs fast, often quicker than expected. To achieve optimal performance, the
concepts in a data model must be crisp and coherent. Then the proper rules
must be used for translating the model into a database design.
32 | P a g e
It is seldom a problem of the database software (Oracle, SQL Server, MySQL,
etc.) – but, rather, that the database is being used improperly. Once that problem
is fixed, the performance is just fine. Modeling provides a means to understand a
database so that you are able to tune it for fast performance.
Better documentation
Models document important concepts and jargon, proving a basis for long-term
maintenance. The documentation will serve you well through staff turnover.
Today, most application vendors can provide a data model of their application
upon request. That is because the IT industry recognizes that models are
effective at conveying important abstractions and ideas in a concise and
understandable manner.
Fewer application errors
A data model causes participants to crisply define concepts and resolve
confusion. As a result, application development starts with a clear vision.
Developers can still make detailed errors as they write application code, but they
are less likely to make deep errors that are difficult to resolve.
Fewer data errors
Data errors are worse than application errors. It is one thing to have an
application crash, necessitating a restart. It is another thing to corrupt data in a
large database.
33 | P a g e
A data model not only improves the conceptual quality of an application, it also
lets you leverage database features that improve data quality. Developers can
weave constraints into the fabric of a model and the resulting database. For
example, every table should normally have a primary key. The database can
enforce other unique combinations of fields. Referential integrity can ensure that
foreign keys are bona fide and not dangling.
Managed risk
You can use a data model to estimate the complexity of software, and gain
insight into the level of development effort and project risk. You should consider
the size of a model, as well as the intensity of inter-table connections.
Robert Hillard wrote an excellent book, “Information-Driven Business” in which he
equates a data model to a mathematical graph. He uses the graph as a basis for
assessing software complexity. An application database with heavily
interconnected tables is more complex and therefore prone to more risk of
development failure.
A good start for data mining
The documentation inherent in a model serves as a starting point for analytical
data mining. You can take day-to-day business data and load it into a dedicated
database, known as a “data warehouse.” Data warehouses are constructed
specifically for the purpose of data analysis, leveraging that data from routine
operations.
34 | P a g e
Why should you consider data modeling in your business?
The better data modeling you have, the more business benefits you receive on
the subject of productivity, efficiency, customer satisfaction, profitability, and a
better understanding of your core business needs. However, you have to
carefully consider the discovered data types to avoid the over-modeling issues
regarding the costs and speed of development optimization.
BUSINESS RULE
A business rule is a statement that describes a business policy or procedure.
Business rules are usually expressed at the atomic level -- that is, they cannot be
broken down any further. It imposes some form of constraint on a specific aspect
of the database, such as the elements within a field specification for a particular
field or the characteristics of a given relationship. You base a business rule on
the way the organization perceives and uses its data, which you determine from
the way the organization functions or conducts its business.
Business rules, the foundation of data models, are derived from policies,
procedures, events, functions, and other business objects, and they state
constraints on the organization. Business rules represent the language and
fundamental structure of an organization (Hay, 2003). Business rules formalize
the understanding of the organization-by-organization owners, managers, and
leaders with that of information systems architects.
35 | P a g e
Business rules are important in data modeling because they govern how data are
handled and stored. Examples of basic business rules are data names and
definitions.
SCOPE OF BUSINESS RULE
We are concerned with business rules that impact only an organization’s
databases. Most organizations have a host of rules and/or policies that fall
outside this definition. For example, the rule “Friday is business casual dress
day” may be an important policy statement, but it has no immediate impact on
databases. In contrast, the rule “A student may register for a section of a course
only if he or she has successfully completed the prerequisites for that course” is
within our scope because it constrains the transactions that may be processed
against the database. It causes any transaction that attempts to register a
student who does not have the necessary prerequisites to be rejected. Some
business rules cannot be represented in common data modeling notation; those
rules that cannot be represented in a variation of an entity-relationship diagram
are stated in natural language, and some can be represented in the relational
data model.
Business rules can be applied to computing systems and are designed to help an
organization achieve its goals. Software is used to automate business rules using
business logic.
Business rules can also be generated by internal or external necessity. For
example, a business can come up with business rules that are self-imposed to
36 | P a g e
meet leadership’s own goals, or in the pursuit of compliance with external
standards. Experts also point out that while there is a system of strategic
processes governing business rules, the business rules themselves are not
strategic, but simply directive in nature.
ENTITY RELATIONSHIP MODEL
ER model defines entity sets, not individual entities, but entity sets described in
terms of their attributes
An entity-relationship model (e-r model) is a detailed, logical representation of the
data for an organization or for a business area. The E-R model is expressed in
terms of entities in the business environment, the relationships (or associations)
among those entities, and the attributes (or properties) of both the entities and
their relationships. An E-R model is normally expressed as an entity-relationship
37 | P a g e
diagram (e-r diagram, or erD), which is a graphical representation of an E-R
model.
Entity-Relationship Model is the diagrammatical representation of a database
structure which is called an ER diagram. The ER diagram is considered a
blueprint of a database which has mainly two components i.e., relationship set,
and entity set. The ER diagram is used to represent the relationship exists
among the entity set. The entity set is considered as a group of entities of similar
type which contains attributes. According to the database system management
system the entity is considered as a table and attributes are columns of a table.
So, the ER diagram shows the relationship among tables in the database. The
entity is considered a real-world object which is stored physically in the database.
The entities have attributes that help to uniquely identify the entity. The entity set
can be considered as a collection of similar types of entities.
Why do we use the Entity diagram?
The entity diagram is used to represent the database in the diagram form. It
helps to properly understand the database. All the necessary details of the
database can be represented in the form of the ER diagram. The entities
represent all the tables of the database, attributes are the columns of tables and
the relationship represented the association among the tables of a database.
38 | P a g e
The figure represents the ER diagram of
the college student database. The
student, college, mechanical, electronics
and computer science are entities and the
enrolls in and specialized in are the
relationship. The attributes are name,
age, gender, DOB, affiliation, address.
Components of Entity-Relationship Model
The ER model is used as a conceptual view of the database. The ER model
consist of real-world entities and the related associations exist between them.
The ER model gives the complete idea of a database used for any application
and it is very easy to understand. The below section contains information about
the components of the ER diagram.
1. Entity
An entity is a person, a place, an object, an event, or a concept in the user
environment about which the organization wishes to maintain data. Thus, an
entity has a noun name. Some examples of each of these kinds of entities follow:
Person: Employee, Student, Patient; Place: Store, Warehouse, State; Object:
Machine, Building, Automobile; Event: Sale, Registration, Renewal; Concept:
39 | P a g e
Account, Course, Work Center. All type of entities has some attributes or the
properties which will help to give the proper idea of the entity. The entity set can
be considered as a collection of similar types of entities. In the entity set, there
can be some entities exist which can contain similar type of values. For example,
the employee set will contain information from all employees. The entity set does
not require to be disjoint.
An entity is an object or event in our environment that we want to keep track of. A
person is an entity. So is a building, a piece of inventory sitting on a shelf, a
finished product ready for sale, and a sales meeting (an event). An attribute is a
property or characteristic of an entity.
 Weak entity: The weak entity is considered an entity that can’t be easily
chosen by its attribute and which required some relationship with some
other entity. This type of entity is known as a weak entity. In the ER
diagram, the double rectangle is used for representing a weak entity. For
example- if there is only a bank account then it is considered as a weak
entity as the bank account cannot be identified which bank the bank
account belongs to.
An entity types whose existence depends on some other entity type.
(Some data modeling software, in fact, use the term dependent entity.) A
weak entity type has no business meaning in an E-R diagram without the
entity on which it depends. The entity type on which the weak entity type
40 | P a g e
depends is called the identifying owner (or simply owner for short). A weak
entity type does not typically have its own identifier. Generally, on an E-R
diagram, a weak entity type has an attribute that serves as a partial
identifier.
 Strong Entity- A strong entity type is one that exists independently of
other entity types. (Some data modeling software, in fact, use the term
independent entity.) Examples include Student, Employee, Automobile,
and Course. Instances of a strong entity type always have a unique
characteristic (called an identifier)—that is, an attribute or a combination of
attributes that uniquely distinguish each occurrence of that entity.
ENTITY TYPE VS. ENTITY INSTANCE
There is an important distinction between entity types and entity instances.
An entity type is a collection of entities that share common properties or
characteristics. Each entity type in an E-R model is given a name. Because the
name represents a collection (or set) of items, it is always singular. We use
capital letters for names of entity type(s). In an E-R diagram, the entity name is
placed inside the box representing the entity type.
It is the fundamental building block for describing the structure of data with the
Entity Data Model. In a conceptual model, entity types are constructed from
properties and describe the structure of top-level concepts, such as a customers
41 | P a g e
and orders in a business application. In the same way that a class definition in a
computer program is a template for instances of the class, an entity type is a
template for entities.
An entity instance is a single occurrence of an entity type. An entity type is
described just once (using metadata) in a database, whereas many instances of
that entity type may be represented by data stored in the database. For example,
there is one EMPLOYEE entity type in most organizations, but there may be
hundreds (or even thousands) of instances of this entity type stored in the
database. We often use the single term entity rather than entity instance when
the meaning is clear from the context of our discussion.
It is a manifestation of an entity within that category. For example, Cell could be
the entity type, but Cell_1 , Cell_2 , and Cell_3 would represent the actual items
within the network.
In simple words:
ENTITY- A person, a place, an object, an event, or a concept in the user
environment about which the organization wishes to maintain data
ENTITY TYPE- A collection of entities that share common properties or
characteristics
ENTITY INSTANCE- A single occurrence of an entity type.

42 | P a g e
2. Attributes
The entities are represented using some properties and these properties are
known as attributes. All the attributes have some value. For example- the
employee entity can have the following attributes – employee name, employee
age, employee contact details. For the attributes, there can be considered as a
domain of values that can be allocated to the attribute. For example, the
employee’s name cannot be assigned some numeric value. The employee’s
name should always be alphabetic. The employee age cannot be in a negative
number it should always be positive.
Attributes are facts or description of entities. They are also often nouns and
become the columns of the table. For example, for entity student, the attributes
can be first name, last name, email, address, and phone numbers.
Types of Attribute
The types of attributes are given below:
1. Simple attribute: The simple attribute can be considered as atomic values
that can’t be further segregated. For example- the employee phone
number cannot be further segregate to some other attribute. an attribute
that cannot be broken down into smaller components that are meaningful
43 | P a g e
for the organization. For example, all the attributes associated with
AUTOMOBILE are simple: Vehicle ID, Color, Weight, and Horsepower
2. Composite attribute: The composite attribute contains more than one
attribute in the group. For example, the employee’s name attribute can be
considered as a composite attribute as the employee’s name can be
further segregated to a first name and last name.
Composite attributes provide considerable flexibility to users, who can
either refer to the composite attribute as a single unit or else refer to
individual components of that attribute. Thus, for example, a user can
either refer to Address or refer to one of its components, such as Street
Address. The decision about whether to subdivide an attribute into its
component parts depends on whether users will need to refer to those
individual components, and hence, they have organizational meaning. Of
course, the designer must always attempt to anticipate possible future
usage patterns for the database.
3. Derived attribute: The derived attribute is the type of attribute which does
not exist in the database physically, however, the values derived are from
the other database which is present in the database physically. For eg; the
average salary of an employee is derived attribute as it is directly not
stored in the database. The value can e derived from other attributes
present in the database physically.
44 | P a g e
an attribute whose values can be calculated from related attribute values
(plus possibly data not in the database, such as today’s date, the current
time, or a security code provided by a system user). We indicate a derived
attribute in an E-R diagram by using square brackets around the attribute
name, as shown in Figure 2-8 for the Years Employed attribute. Some E-R
diagramming tools use a notation of a forward slash (/) in front of the
attribute name to indicate that it is derived. (This notation is borrowed from
UML for a virtual attribute.)
4. Single value attribute: The single attribute contains a single value. For
example -the security number.
5. Multi-value attribute: The multi-value attribute means the attribute which
contains more than value. For example, the employee can have more than
one email id and phone number. A multivalued attribute is an attribute that
may take on more than one value for a given entity (or relationship)
instance. In this text, we indicate a multivalued attribute with curly brackets
around the attribute name, as shown for the Skill attribute in the
EMPLOYEE. In Microsoft Visio, once an attribute is placed in an entity,
you can edit that attribute (column), select the Collection tab and choose
one of the options. (Typically, Multiset will be your choice, but one of the
other options may be more appropriate for a given situation.) Other E-R
diagramming tools may use an asterisk (*) after the attribute name, or you
45 | P a g e
may have to use supplemental documentation to specify a multivalued
attribute.
Primary Key
Primary Key* or identifier is an attribute or a set of attributes that uniquely
identifies an instance of the entity. For example, for a student entity, student
number is the primary key since no two students have the same student number.
We can have only one primary key in a table. It identifies uniquely every row and
it cannot be null.
Foreign key
A foreign key+ (sometimes called a referencing key) is a key used to link two
tables together. Typically, you take the primary key field from one table and insert
it into
the
other
table
where it
becomes a foreign key (it remains a primary key in the original table). We can
have more than one foreign key in a table.
46 | P a g e
How many entities are there in this diagram and what are they?
There are seven entities: STUDENT, COURSE, INSTRUCTOR, SEAT, CLASS,
SECTION and PROFESSOR.
What are the attributes for entity STUDENT?
The attributes for Entity STUDENT are: student_id, student_name and
student_address
What is the primary key for STUDENT?
The primary key for STUDENT is: student_id
What is the primary key for COURSE?
Not a trick question! There is only one primary key, but it is made up of two
attributes. This is called a compound key.
What foreign keys do STUDENT and COURSE contain?
STUDENT and COURSE contain no foreign keys in this diagram. This might
suggest that there are problems with the design... among them is the many to
many relationships here. This usually requires that we create a separate table to
describe the relationship. This type of table usually connects foreign ids to each
other.
In this
case, let's
add an
47 | P a g e
entity called REGISTRATION in the middle of the "takes" relationship. Since
students probably sit in different seats for each course they are registered in, lets
relate SEAT to REGISTRATON instead of STUDENT:
3. Relationship
The relationship is another type of component of the ER diagram which is used
to show the dependency among the entities of the database. In the ER diagram,
the relationship is represented by a diamond-shaped box. All the relationship
which exist between the entities is connected by a line which shows in the ER
diagram.
There are different type of relationship which are discussed below:
One-to-one: In this relationship, the one entity is related to some other entity is a
one-to-one relationship. For eg; an
individual has a passport and the
passport is allocated to one individual.
48 | P a g e
Many-to-one: In this relationship when many instances of an entity are linked to
one entity. For eg; many students can read
in one college.
One-to-many: When one entity is linked to more than one entity is a one-to-
many relationship. For eg; one customer placed
multiple orders.
Many-to-many: When many entities are linked to many entities is known as
many-to-many relationships. For eg; students
can have multiple projects and the project is
allocated to multiple students.
DEGREE OF A RELATIONSHIP
49 | P a g e
The degree of a relationship is the number of entity types that participate in that
relationship. Thus, the relationship Completes in Figure 2-11 is of degree 2,
because there are two entity types: EMPLOYEE and COURSE. The three most
common relationship
degrees in E-R models
are unary (degree 1),
binary (degree 2), and
ternary (degree 3).
Higher-degree
relationships are
possible, but they are
rarely encountered in
practice, so we restrict
our discussion to these
three cases. Examples
of unary, binary, and ternary relationships appear in Figure 2-12. (Attributes are
not shown in some figures for simplicity.) As you look at Figure 2-12, understand
that any particular data model represents a specific situation, not a
generalization. For example, consider the Manages relationship in Figure 2-12a.
In some organizations, it may be possible for one employee to be managed by
many other employees (e.g., in a matrix organization). It is important when you
develop an E-R model that you understand the business rules of the particular
organization you are modeling.
50 | P a g e
UNARY RELATIONSHIP
A unary relationship is a relationship between the instances of a single entity
type. (Unary relationships are also called recursive relationships.) Three
examples are shown in Figure 2-12a. In the first example, Is Married To is shown
as a one-to-one relationship between instances of the PERSON entity type.
Because this is a one-to-one relationship, this notation indicates that only the
current marriage, if one exists, needs to be kept about a person. What would
change if we needed to retain the history of marriages for each person? See
Review Question 2-20 and Problem and Exercise 2-34 for other business rules
and their effect on the Is Married To relationship representation. In the second
example, Manages is shown as a one-to-many relationship between instances of
the EMPLOYEE entity type. Using this relationship, we could identify, for
example, the employees who report to a particular manager. The third example is
one case of using a unary relationship to represent a sequence, cycle, or priority
list. In this example, sports teams are related by their standing in their league
(the Stands After relationship). (Note: In these examples, we ignore whether
these are mandatory- or optional-cardinality relationships or whether the same
entity instance can repeat in the same relationship instance; we will introduce
mandatory and optional cardinality in a later section of this chapter.)
Figure 2-13 shows an example of another unary relationship, called a bill-
ofmaterials structure. Many manufactured products are made of assemblies,
which in turn are composed of subassemblies and parts, and so on. As shown in
51 | P a g e
Figure 2-13a, we can represent this structure as a many-to-many unary
relationship. In this figure, the entity type ITEM is used to represent all types of
components, and we use Has Components for the name of the relationship type
that associates lower-level items with higher-level items.
Two occurrences of this bill-of-materials structure are shown in Figure 2-13b.
Each of these diagrams shows the immediate components of each item as well
as the quantities of that component. For example, item TX100 consists of item
BR450 (quantity 2) and item DX500 (quantity 1). You can easily verify that the
associations are in fact many-to-many. Several of the items have more than one
component type (e.g., item MX300 has three immediate component types:
HX100, TX100, and WX240). Also, some of the components are used in several
higher-level assemblies. For example, item WX240 is used in both item MX300
and item WX340, even at different levels of the billof-materials. The many-to-
many relationship guarantees that, for example, the same subassembly structure
of WX240 (not shown) is used each time item WX240 goes into making some
other item.
52 | P a g e
The presence of the attribute Quantity on the relationship suggests that the
analyst consider converting the relationship Has Components to an associative
entity. Figure 2-13c shows the entity type BOM STRUCTURE, which forms an
association between instances of the ITEM entity type. A second attribute
(named Effective
Date) has been
added to BOM
STRUCTURE to
record the date
when this
component was
first used in the
related assembly.
Effective dates are
often needed when
a history of values
is required. Other
data model structures can be used for unary relationships involving such
hierarchies;
BINARY RELATIONSHIP
A binary relationship is a relationship between the instances of two entity types
and is the most common type of relationship encountered in data modeling.
53 | P a g e
Figure 2-12b shows three examples. The first (one-to-one) indicates that an
employee is assigned one parking place, and that each parking place is assigned
to one employee. The second (one-to-many) indicates that a product line may
contain several products, and that each product belongs to only one product line.
The third (many-to-many) shows that a student may register for more than one
course, and that each course may have many student registrants.
 CONCEPTS IN ACTION
2-A THE WALT DISNEY COMPANY
The Walt Disney Company is world-famous for its many entertainment ventures
but it is especially identified with its theme parks. First there was Disneyland in
Los Angeles, then the mammoth Walt Disney World in Orlando. These were
followed by parks in Paris and Tokyo, and one now under development in Hong
Kong. The Disney theme parks are so well run that they create a wonderful
feeling of natural harmony with everyone and everything being in the right place
at the right time. When you're there, it's too much fun to stop to think about how
all this is organized and carried off with such precision. But, is it any wonder to
learn that databases play a major part?
One of the Disney theme parks' interesting database applications keeps track of
all of the costumes worn by the workers or “cast members” in the parks. The
system is called the Garment Utilization System or GUS (which was also the
name of one of the mice that helped Cinderella sew her dress!). Managing these
costumes is no small task. Virtually all of the cast members, from the actors and
54 | P a g e
dancers to the ride operators, wear some kind of costume. Disneyland in Los
Angeles has 684,000 costume parts (each costume is typically made up of
several garments), each of which is uniquely bar-coded, for its 46,000 cast
members. The numbers in Orlando are three million garments and 90,000 cast
members. Using bar-code scanning, GUS tracks the life cycle of every garment.
This includes the points in time when a garment is in the storage facility, is
checked out to a cast member, is in the laundry, or is being repaired (in house or
at a vendor). In addition to managing the day-to-day movements of the
costumes, the system also provides a rich data analysis capability. The industrial
engineers in Disney's business planning group use the accumulated data to
decide how many garments to keep in stock and how many people to have
staffing the garment checkout windows based on the expected wait times. They
also use the data to determine whether certain fabrics or the garments made by
specific manufacturers are not holding up well through a reasonable number of
uses or of launderings.
GUS, which was inaugurated at Disneyland in Los Angeles in 1998 and then
again at Walt Disney World in Orlando in 2002, replaced a manual system in
which the costume data was written on index cards. It is implemented in
Microsoft's SQL Server DBMS and runs on a Compaq server. It is also linked to
an SAP personnel database to help maintain the status of the cast members. If
GUS is ever down, the process shifts to a Palm Pilot-based backup system that
can later update the database. In order to keep track of the costume parts and
cast members, not surprisingly, there is a relational table for costume parts with
55 | P a g e
one record for each garment and there is a table for cast members with one
record for each cast member. The costume parts records include the type of
garment, its size, color, and even such details as whether its use is restricted to a
particular cast member and whether it requires a special laundry detergent.
Correspondingly, the cast member records include the person's clothing sizes
and other specific garment requirements.
Ultimately, GUS's database precision serves several purposes in addition to its
fundamental managerial value. The Walt Disney Company feels that consistency
in how its visitors or “guests” look at a given ride gives them an important comfort
level. Clearly, GUS provides that consistency in the costuming aspect. In
addition, GUS takes the worry out of an important part of each cast member's
workday. One of Disney's creeds is to strive to take good care of its cast
members so that they will take good care of Disney's guests. Database
management is a crucial tool in making this work so well.
FIGURE 2.2 A binary relationship
Cardinality
One-to-One Binary Relationship Figure 2.3 shows three binary relationships of
different cardinalities, representing the maximum number of entities that can be
involved in a particular relationship. Figure 2.3a shows a one-to-one (1-1) binary
56 | P a g e
relationship, which means that a single occurrence of one entity type can be
associated with a single occurrence of the other entity type and vice versa. A
particular salesperson is assigned to one office. Conversely, a particular office (in
this case they are all private offices!) has just one salesperson assigned to it.
Note the “bar” or “one” symbol on either end of the relationship in the diagram
indicating the maximum one cardinality. The way to read these diagrams is to
start at one entity, read the relationship on the connecting line, pick up the
cardinality on the other side of the line near the second entity, and then finally
reach the other entity. Thus, Figure 2.3a, reading from left to right, says, “A
salesperson works in one (really at most one, since it is a maximum) office.” The
bar or one symbol involved in this statement is the one just to the left of the office
entity box. Conversely, reading from right to left, “An office is occupied by one
salesperson.”
FIGURE 2.3 Binary
relationships with
cardinalities
One-to-Many Binary
Relationship Associations can also be multiple in nature. Figure 2.3b shows a
57 | P a g e
one-to-many (1-M) binary relationship between salespersons and customers.
The “crowy's foot” device attached to the customer entity box represents the
multiple association. Reading from left to right, the diagram indicates that a
salesperson sells to many customers. (Note that “many,” as the maximum
number of occurrences that can be involved, means a number that can be 1, 2,
3, …n. It also means that the number is not restricted to being exactly one, which
would require the “one” or “bar” symbol instead of the crow's foot.) Reading from
right to left, Figure 2.3b says that a customer buys from only one salesperson.
This is reasonable, indicating that in this company each salesperson has an
exclusive territory and thus each customer can be sold to by only one
salesperson from the company.
Many-to-Many Binary Relationship Figure 2.3c shows a many-to-many (M-M)
binary relationship among salespersons and products. A salesperson is
authorized to sell many products; a product can be sold by many salespersons.
By the way, in some circumstances, in either the 1-M or M-M case, “many” can
be either an exact number or have a known maximum value. For example, a
company rule may set a limit of a maximum of ten customers in a sales territory.
Then the “many” in the 1-M relationship of Figure 2.3b can never be more than
10 (a salesperson can have many customers but not more than 10). Sometimes
people include this exact number or maximum next to or even instead of the
crow's foot in the E-R diagram.
Modality
58 | P a g e
Figure 2.4 shows the addition of the modality, the minimum number of entity
occurrences that can be involved in a relationship. In our particular salesperson
environment, every salesperson must be assigned to an office. On the other
hand, a given office might be empty or it might be in use by exactly one
salesperson. This situation is recorded in Figure 2.4a, where the “inner” symbol,
which can be a zero or a one, represents the modality—the minimum—and the
“outer” symbol, which can be a one or a crow's foot, represents the cardinality—
the maximum. Reading Figure 2.4a from left to right tells us that a salesperson
works in a minimum of one and a maximum of one office, which is another way of
saying exactly one office. Reading from right to left, an office may be occupied by
or assigned to a minimum of no salespersons (i.e. the office is empty) or a
maximum of one salesperson.
Similarly, Figure 2.4b indicates that a salesperson may have no customers or
many customers. How could a salesperson have no customers? (What are we
paying her for?!?) Actually, this allows for the case in which we have just hired a
new salesperson and have not as yet assigned her a territory or any customers.
On the other hand, a customer is always assigned to exactly one salesperson.
We never want customers to be without a salesperson—how would they buy
anything from us when they need to? We never want to be in a position of losing
sales! If a salesperson leaves the company, the company's procedures require
that another salesperson or, temporarily, a sales manager be immediately
assigned the departing salesperson's customers. Figure 2.4c says that each
salesperson is authorized to sell at least one or many of our products and each
59 | P a g e
product can be sold by at least one or many of our salespersons. This includes
the extreme, but not surprising, case in which each salesperson is authorized to
sell all the products and each product can be sold by all the salespersons.
FIGURE2.4 Binary
relationships with
cardinalities
(maximums) and
modalities (minimums)
More About Many-to-
Many Relationships
Intersection Data Generally, we think of attributes as facts about entities. Each
salesperson has a salesperson number, a name, a commission percentage, and
a year of hire. At the entity occurrence level, for example, one of the
salespersons has salesperson number 528, the name Jane Adams, a
commission percentage of 15 %, and the year of hire of 2003. In an E-R diagram,
these attributes are written or drawn together with the entity, as in Figure 2.1 and
the succeeding figures. This certainly appears to be very natural and obvious.
Are there ever any circumstances in which an attribute can describe something
other than an entity?
60 | P a g e
Consider the many-to-many relationship between salespersons and products
in Figure 2.4c. As usual, salespersons are described by their salesperson
number, name, commission percentage, and year of hire. Products are described
by their product number, name, and unit price. But, what if there is a requirement
to keep track of the number of units (call it “quantity”) of a particular product that
a particular salesperson has sold? Can we add the quantity attribute to the
product entity box? No, because for a particular product, while there is a single
product number, product name, and unit price, there would be lots of “quantities,”
one for each salesperson selling the product. Can we add the quantity attribute to
the salesperson entity box? No, because for a particular salesperson, while there
is a single salesperson number, salesperson name, commission percentage, and
year of hire, there will be lots of “quantities,” one for each product that the
salesperson sells. It makes no sense to try to put the quantity attribute in either
the salesperson entity box or the product entity box. While each salesperson has
a single salesperson number, name, commission percentage, and year of hire,
each salesperson has many “quantities,” one for each product he sells. Similarly,
while each product has a single product number, product name, and unit price,
each product has many “quantities,” one for each salesperson who sells that
product. But an entity box in an E-R diagram is designed to list the attributes that
simply and directly describe the entity, with no complications involving other
entities. Putting quantity in either the salesperson entity box or the product entity
box just will not work.
61 | P a g e
The quantity attribute doesn't describe either the salesperson alone or the
product alone. It describes the combination of a particular salesperson and a
particular product. In general, we can say that it describes the combination of a
particular occurrence of one entity type and a particular occurrence of the other
entity type. Let's say that since salesperson number 137 joined the company, she
has sold 170 units of product number 24 013. The quantity 170 doesn't make
sense as a description or characteristic of salesperson number 137 alone. She
has sold many different kinds of products. To which one does the quantity 170
refer? Similarly, the quantity 170 doesn't make sense as a description or
characteristic of product number 24 013 alone. It has been sold by many different
salespersons.
In fact, the quantity 170 falls at the intersection of salesperson number 137 and
product number 24013. It describes the combination of or the association
between that particular salesperson and that particular product and it is known
as intersection data. Figure 2.5 shows the many-to-many relationship between
salespersons and products with the intersection data, quantity, represented in a
separate box attached to the relationship line. That is the natural place to draw it.
Pictorially, it looks as if it is at the intersection between the two entities, but there
is more to it than that. The intersection data describes the relationship between
the two entities. We know that an occurrence of the Sells relationship specifies
that salesperson 137 has sold some of product 24013. The quantity 170 is an
attribute of this occurrence of that relationship, further describing this occurrence
of the relationship. Not only do we know that salesperson 137 sold some of
62 | P a g e
product 24013 but we know how many units of that product that salesperson
sold.
FIGURE 2.5 Many-to-many
binary relationship with
intersection data

The Unique Identifier in
Many-to-Many
Relationships Since, as we
have just seen, a many-to-many relationship can appear to be a kind of an entity,
complete with attributes, it also follows that it should have a unique identifier, like
other entities. (If this seems a little strange or even unnecessary here, it will
become essential later in the book when we actually design databases based on
these E-R diagrams.) In its most basic form, the unique identifier of the many-to-
many relationship or the associative entity is the combination of the unique
identifiers of the two entities in the many-to-many relationship. So, the unique
identifier of the many-to-many relationship of Figure 2.5 or, as shown in Figure
2.6, of the associative entity, is the combination of the Salesperson Number and
Product Number attributes.
Sometimes, an additional attribute or attributes must be added to this
combination to produce uniqueness. This often involves a time element. As
currently constructed, the E-R diagram in Figure 2.6 indicates the quantity of a
particular product sold by a particular salesperson since the salesperson joined
63 | P a g e
the company. Thus, there can be only one occurrence of SALES combining a
particular salesperson with a particular product. But if, for example, we wanted to
keep track of the sales on an annual basis, we would have to include a year
attribute and the unique identifier would be Salesperson Number, Product
Number, and Year. Clearly, if we want to know how many units of each product
were sold by each salesperson each year, the combination of Salesperson
Number and Product Number would not be unique because for a particular
salesperson and a particular product, the combination of those two values would
be the same each year! Year must be added to produce uniqueness, not to
mention to make it clear in which year a particular value of the Quantity attribute
applies to a particular salesperson-product combination.
The third and last possibility occurs when the nature of the associative entity is
such that it has its own unique identifier. For example, a company might specify a
unique serial number for each sales record. Another example would be the
many-to-many relationship between motorists and police officers who give traffic
tickets for moving violations. (Hopefully it's not too many for each motorist!) The
unique identifier could be the combination of police officer number and motorist
driver's license number plus perhaps date and time. But, typically, each traffic
ticket has a unique serial number and this would serve as the unique identifier.
TERNARY RELATIONSHIP
A ternary relationship is a simultaneous relationship among the instances of
three entity types. A typical business situation that leads to a ternary relationship
is shown in Figure 2-12c. In this example, vendors can supply various parts to
64 | P a g e
warehouses. The relationship Supplies is used to record the specific parts that
are supplied by a given vendor to a particular warehouse. Thus, there are three
entity types: VENDOR, PART, and WAREHOUSE. There are two attributes on
the relationship Supplies: Shipping Mode and Unit Cost. For example, one
instance of Supplies might record the fact that vendor X can ship part C to
warehouse Y, that the shipping mode is next-day air, and that the cost is $5 per
unit.
Don’t be confused: A ternary relationship is not the same as three binary
relationships. For example, Unit Cost is an attribute of the Supplies relationship
in Figure 2-12c. Unit Cost cannot be properly associated with any one of the
three possible binary relationships among the three entity types, such as that
between PART and WAREHOUSE.
Thus, for example, if we were told that vendor X can ship part C for a unit cost of
$8, those data would be incomplete because they would not indicate to which
warehouse the parts would be shipped. As usual, the presence of an attribute on
the relationship Supplies in Figure 2-12c suggests converting the relationship to
an associative entity type. Figure 2-14 shows an alternative (and preferable)
representation of the ternary relationship shown in Figure 2-12c. In Figure 2-14,
the (associative) entity type SUPPLY SCHEDULE is used to replace the Supplies
relationship from Figure 2-12c. Clearly, the entity type SUPPLY SCHEDULE is of
independent interest to users. However, notice that an identifier has not yet been
assigned to SUPPLY SCHEDULE. This is acceptable. If no identifier is assigned
to an associative entity during E-R modeling, an identifier (or key) will be
65 | P a g e
assigned during logical modeling (discussed in Chapter 4). This will be a
composite identifier whose components will consist of the identifier for each of
the participating entity types (in this example, PART, VENDOR, and
WAREHOUSE). Can you think of other attributes that might be associated with
SUPPLY SCHEDULE?
As noted earlier, we do not label the lines from SUPPLY SCHEDULE to the three
entities. This is because these lines do not represent binary relationships. To
keep the same meaning as the ternary relationship of Figure 2-12c, we cannot
break the Supplies relationship into three binary relationships, as we have
already mentioned. So, here is a guideline to follow: Convert all ternary (or
higher) relationships to associative entities, as in this example. Song et al. (1995)
shows that participation constraints (described in a following section on
66 | P a g e
cardinality constraints) cannot be accurately represented for a ternary
relationship, given the notation with attributes on the relationship line. However,
by converting to an associative entity, the constraints can be accurately
represented. Also, many E-R diagram drawing tools, including most CASE tools,
cannot represent ternary relationships. So, although not semantically accurate,
you must use these tools to represent the ternary or higher order relationship
with an associative entity and three binary relationships, which have a mandatory
association with each of the three related entity types.
Convert many-to-many Relationships into one-to-many Relationships
Entities in a many-to-many relationship must be linked in a special way, that is
through a third entity, called a composite entity also known as an associative

[1]
entity. A composite entity has only one function: to provide an indirect link
between two entities in a M:N relationship.
67 | P a g e
In the language of tables, a composite entity is termed a linking table. A
composite entity has no key attribute of
its own; rather, it receives the key
attributes from each of the two entities it
links and combines them to form
a composite key attribute.
In the language of tables, a composite
key attribute is termed a composite
primary key.
The following graphic illustrates a composite entity that now indirectly links the
STUDENT and CLASS entities:
Create a composite entity called STUDENT CLASSES from a STUDENT entity
and CLASS entity.
The M:N relationship between STUDENT and CLASS has been dissolved into
two one-to-many relations:
1. The 1:N relationship between STUDENT and STUDENT CLASSES reads
this way: for one instance of STUDENT, there exists zero, one, or many
instances of STUDENT CLASSES; but for one instance of STUDENT
CLASSES, there exists zero or one instance of STUDENT.
2. The 1:N relationship between CLASS and STUDENT CLASSES reads
this way: For one instance of CLASS, there exists zero, one, or many
68 | P a g e
instances of STUDENT CLASSES; but for one instance of STUDENT
CLASSES, there exists zero or one instance of CLASS.
Sometimes, but by no means always, the composite entity will “swipe”
attributes from one or both entities it links, because those attributes would be
more logically placed in the composite entity. In the case of STUDENT
CLASSES, however, none of the non-key attributes from STUDENT or
CLASS should be removed to the composite entity. The designer makes this
decision on a case-by-case basis. The next lesson describes types of
participation in relationships.
Many-to-Many Relationships
Intersection Data Generally, we think of attributes as facts about entities. Each
salesperson has a salesperson number, a name, a commission percentage, and
a year of hire. At the entity occurrence level, for example, one of the
salespersons has salesperson number 528, the name Jane Adams, a
commission percentage of 15 %, and the year of hire of 2003. In an E-R diagram,
these attributes are written or drawn together with the entity, as in Figure 2.1 and
the succeeding figures. This certainly appears to be very natural and obvious.
Are there ever any circumstances in which an attribute can describe something
other than an entity?
Consider the many-to-many relationship between salespersons and products
in Figure 2.4c. As usual, salespersons are described by their salesperson
69 | P a g e
number, name, commission percentage, and year of hire. Products are described
by their product number, name, and unit price. But, what if there is a requirement
to keep track of the number of units (call it “quantity”) of a particular product that
a particular salesperson has sold? Can we add the quantity attribute to the
product entity box? No, because for a particular product, while there is a single
product number, product name, and unit price, there would be lots of “quantities,”
one for each salesperson selling the product. Can we add the quantity attribute to
the salesperson entity box? No, because for a particular salesperson, while there
is a single salesperson number, salesperson name, commission percentage, and
year of hire, there will be lots of “quantities,” one for each product that the
salesperson sells. It makes no sense to try to put the quantity attribute in either
the salesperson entity box or the product entity box. While each salesperson has
a single salesperson number, name, commission percentage, and year of hire,
each salesperson has many “quantities,” one for each product he sells. Similarly,
while each product has a single product number, product name, and unit price,
each product has many “quantities,” one for each salesperson who sells that
product. But an entity box in an E-R diagram is designed to list the attributes that
simply and directly describe the entity, with no complications involving other
entities. Putting quantity in either the salesperson entity box or the product entity
box just will not work.
The quantity attribute doesn't describe either the salesperson alone or the
product alone. It describes the combination of a particular salesperson and a
particular product. In general, we can say that it describes the combination of a
70 | P a g e
particular occurrence of one entity type and a particular occurrence of the other
entity type. Let's say that since salesperson number 137 joined the company, she
has sold 170 units of product number 24 013. The quantity 170 doesn't make
sense as a description or characteristic of salesperson number 137 alone. She
has sold many different kinds of products. To which one does the quantity 170
refer? Similarly, the quantity 170 doesn't make sense as a description or
characteristic of product number 24 013 alone. It has been sold by many different
salespersons.
In fact, the quantity 170 falls at the intersection of salesperson number 137 and
product number 24013. It describes the combination of or the association
between that particular salesperson and that particular product and it is known
as intersection data. Figure 2.5 shows the many-to-many relationship between
salespersons and products with the intersection data, quantity, represented in a
separate box attached to the relationship line. That is the natural place to draw it.
Pictorially, it looks as if it is at the intersection between the two entities, but there
is more to it than that. The intersection data describes the relationship between
the two entities. We know that an occurrence of the Sells relationship specifies
that salesperson 137 has sold some of product 24013. The quantity 170 is an
attribute of this occurrence of that relationship, further describing this occurrence
of the relationship. Not only do we know that salesperson 137 sold some of
product 24013 but we know how many units of that product that salesperson
sold.
71 | P a g e
FIGURE 2.5 Many-to-many
binary relationship with
intersection data
CHAPTER 3:
72 | P a g e
THE ENHANCED E-R MODEL
Antolino Jr, Mike F.

Mendador, Jonnabelle
Definitions
 Entity–relationship model (or ER model) - describes interrelated things of
interest in a specific domain of knowledge. A basic ER model is composed
of entity types (which classify the things of interest) and specifies
73 | P a g e
relationships that can exist between entities (instances of those entity
types).
 Supertype - is an entity type that has got relationship (parent to child
relationship) with one or more subtypes and it contains attributes that are
common to its subtypes.
 Subtypes - are subgroups of the supertype entity and have unique
attributes, but they will be different from each subtype.
 Generalization - It works on the principle of bottom up approach.
 Specialization - is a top-down approach where higher level entity is
specialized into two or more lower level entities.
 Disjointness constraints - You will need to decide whether a supertype
instance may simultaneously be a member of two or more subtypes.
 Disjoint rule - an instance of a supertype may not simultaneously be a
member of two (or more) subtypes.
 Overlapping Rule - an instance of a supertype may simultaneously be a
member of two (or more) subtypes.
 Completeness constraints - decide whether a supertype instance must
also be a member of at least one subtype.
 Total Specialization Rule - Each entity instance of a supertype must also
be a member of some subtype.
 Partial Specialization Rule - An entity instance of a supertype may or may
not belong to any subtype.
74 | P a g e
 Supertype/Subtype Hierarchy - a structure comprises a combination of
supertype/subtype relationships, that structure
 Subtype Discriminator - is an attribute of a supertype whose values
determine the target subtype or subtypes.
 Universal data model - is a generic or template data model that can be
reused as
a starting point for a data modeling project.
ENHANCED E-R MODEL
The basic E-R model described in the previous chapter was first introduced
during the mid-1970s. It has been suitable for modeling most common business
problems and has enjoyed widespread use. However, the business environment
has changed dramatically since that time. Business relationships are more
complex, and as a result, business data are much more complex as well. For
75 | P a g e
example, organizations must be prepared to segment their markets and to
customize their products, which places much greater demands on organizational
databases. To cope better with these changes, researchers and consultants
have continued to enhance the E-R model so that it can more accurately
represent the complex data encountered in today’s business environment. The
term enhanced entity-relationship (EER) model is used to identify the model that
has resulted from extending the original E-R model with these new modeling
constructs. These extensions make the EER model semantically similar to
object-oriented data modeling
SUPERTYPE AND SUBTYPE
Recognize when to use supertype / subtype relationship in data modelling
At times, few entities in a data model may share some common properties
(attributes) within themselves apart from having one or more distinct attributes.
Based on the attributes, these entities are categorized as Supertype and Subtype
entities.
Supertype is an entity type that has got relationship (parent to child relationship)
with one or more subtypes and it contains attributes that are common to its
subtypes.
Subtypes are subgroups of the supertype entity and have unique attributes, but
they will be different from each subtype.
76 | P a g e
Supertypes and Subtypes are parent and child entities respectively and the
primary keys of supertype and subtype are always identical.
When designing a data model for PEOPLE, you can have a supertype entity
of PEOPLE and its subtype entities can be vendor, customer, and employee.
People entity will have attributes like Name, Address, and Telephone number,
which are common to its subtypes and you can design entities employee, vendor,
and consumer with their own unique attributes. Based on this scenario, employee
entity can be further classified under different subtype entities like HR employee,
IT employee etc. Here employee will be the superset for the entities HR
Employee and IT employee, but again it is a subtype for the PEOPLE entity.
Let us illustrate supertype/subtype relationships with a simple yet common
example. Suppose that an organization has three basic types of employees:
hourly employees, salaried employees, and contract consultants. The following
are some of the important attributes for each of these types of employees:
 Hourly employees: Employee Number, Employee Name, Address, Date
Hired,Hourly Rate
 Salaried employees: Employee Number, Employee Name, Address, Date
Hired,Annual Salary, Stock Option
 Contract consultants: Employee Number, Employee Name, Address, Date
Hired, Contract Number, Billing Rate
Notice that all of the employee types have several attributes in common:
Employee Number, Employee Name, Address, and Date Hired. In addition, each
77 | P a g e
type has one or more attributes distinct from the attributes of other types (e.g.,
Hourly Rate is unique to hourly employees). If you were developing a conceptual
data model in this situation, you might consider three choices:
1. Define a single entity type called EMPLOYEE. Although conceptually simple,
this approach has the disadvantage that EMPLOYEE would have to contain all of
the attributes for the three types of employees. For an instance of an hourly
employee (for example), attributes such as Annual Salary and Contract Number
would not apply (optional attributes) and would be null or not used. When taken
to a development environment, programs that use this entity type would
necessarily need to be quite complex to deal with the many variations.
2. Define a separate entity type for each of the three entities. This approach
would fail to exploit the common properties of employees, and users would have
to be careful to select the correct entity type when using the system.
3. Define a supertype called EMPLOYEE with subtypes HOURLY EMPLOYEE,
SALARIED EMPLOYEE, and
CONSULTANT. This approach
exploits the common properties of
all employees, yet it recognizes
the distinct properties of each
type.
The below figure shows a
representation of the EMPLOYEE supertype with its three subtypes, using
78 | P a g e
enhanced E-R notation. Attributes shared by all employees are associated with
the EMPLOYEE entity type. Attributes that are peculiar to each subtype are
included with that subtype only.
Purpose of the Supertypes and Subtypes
- Supertypes and subtypes occur frequently in the real world:
 food order types (eat in, to go)
 grocery bag types (paper, plastic)
 payment types (check, cash, credit)
- You can typically associate ‘choices’ of something with supertypes and
subtypes.
- For example, what will be the method of payment – cash, check or credit card?
- Understanding real world examples helps us understand how and when to
model them.
Subdivide an Entity
 Sometimes it makes sense to subdivide an entity into subtypes.
 This may be the case when a group of instances has special properties,
such as attributes or relationships that exist only for that group.
 In this case, the entity is called a “supertype” and each group is called a
“subtype”.
Subtype Characteristics
79 | P a g e
A subtype:
 Inherits all attributes of the supertype
 Inherits all relationships of the supertype
 Usually has its own attributes or relationships
 Is drawn within the supertype
 Never exists alone
 May have subtypes of its own
Always More Than One Subtype
 When an ER model is complete, subtypes never stand alone. In other
words, if an entity has a subtype, a second subtype must also exist.
 A single subtype is exactly the same as the supertype.
 This idea leads to the two subtype rules:
 Exhaustive: Every instance of the supertype is also an instance of one of
the subtypes. All subtypes are listed without omission.
 Mutually Exclusive: Each instance of a supertype is an instance of only
one possible subtype.
At the conceptual modeling stage, it is good practice to include an OTHER
subtype to make sure that your subtypes are exhaustive — that you are handling
every instance of the supertype.
Subtypes Always Exist
80 | P a g e
Any entity can be subtyped by making up a rule that subdivides the
instances into groups.
- But being able to subtype is not the issue—having a reason to subtype is the
issue.
- When a need exists within the business to show similarities and differences
between instances, then subtype.
Correctly Identifying Subtypes
When modeling supertypes and subtypes, you can use three questions to
see if the subtype is correctly identified:
1. Is this subtype a kind of supertype?
2. Have I covered all possible cases? (exhaustive)
3. Does each instance fit into one and only one subtype? (mutually
exclusive)
SPECIALIZATION AND GENERALIZATION
Specialization and generalization as techniques for defining supertype /
subtype relationships.
Generalization
It works on the principle of bottom up approach. In Generalization lower level
functions are combined to form higher level function which is called as entities.
This process is repeated further to make advanced level entities.
81 | P a g e
In the Generalization process properties are drawn from particular entities
and thus we can create generalized entity. We can summarize Generalization
process as it combines subclasses to form superclass.
An example of generalization is shown in below figure. In the upper figure, three
entity types have been defined: CAR, TRUCK, and MOTORCYCLE. At this
stage, the data modeler intends to represent these separately on an E-R
diagram. However, on closer examination, we see that the three entity types
have a number of attributes in common: Vehicle ID (identifier), Vehicle Name
(with components Make and Model), Price, and Engine Displacement. This fact
(reinforced by the presence of a common identifier) suggests that each of the
three entity types is really a version of a more general entity type.This more
general entity type (named VEHICLE) together with the resulting
supertype/subtype relationships is shown in Figure b. The entity CAR has the
specific attribute No Of Passengers, whereas TRUCK has two specific attributes:
Capacity and Cab Type. Thus, generalization has allowed us to group entity
types along with their
common attributes and at
the same time preserve
specific attributes that
are peculiar to each
subtype.
82 | P a g e
Notice that the entity type MOTORCYCLE is not included in the relationship. Is
this simply an omission? No. Instead, it is deliberately not included because it
does not satisfy the conditions for a subtype discussed earlier. Comparing the
two figures you will notice that the only attributes of MOTORCYCLE are those
that are common to all vehicles; there are no attributes specific to motorcycles.
Furthermore, MOTORCYCLE does not have a relationship to another entity type.
Thus, there is no need to create a MOTORCYCLE subtype.
The fact that there is no MOTORCYCLE subtype suggests that it must be
possible to have an instance of VEHICLE that is not a member of any of its
subtypes.
Specialization
We can say that Specialization is opposite of Generalization. In
Specialization things are broken down into smaller things to simplify it further. We
can also say that in Specialization a particular entity gets divided into sub entities
and it’s done on the basis of it’s characteristics. Also in Specialization Inheritance
takes place.
An example of specialization is shown in Figure 3-5. Figure 3-5a shows an
entity type named PART, together with several of its attributes. The identifier is
Part No, and other attributes are Description, Unit Price, Location, Qty On Hand,
Routing Number,and Supplier. (The last attribute is multivalued and composite
because there may be more than one supplier with an associated unit price for a
83 | P a g e
part.)
In discussions with users, we discover that there are two possible sources for
parts: Some are
manufactured internally,
whereas others are
purchased from outside
suppliers. Further, we
discover that some parts
are obtained from both
sources. In this case, the choice depends on factors such as manufacturing
capacity, unit price of the parts, and so on.
Some of the attributes in Figure 3-5a apply to all parts, regardless of source.
However,others depend on the source. Thus, Routing Number applies only to
manufactured parts, whereas Supplier ID and Unit Price apply only to purchased
parts. These factors suggest that PART should be specialized by defining the
subtypes MANUFACTURED PART and PURCHASED PART (Figure 3-5b).
In Figure 3-5b, Routing Number is associated with MANUFACTURED PART.
The data modeler initially planned to associate Supplier ID and Unit Price with
PURCHASED PART. However, in further discussions with users, the data
modeler suggested instead that they create a SUPPLIER entity type and an
associative entity linking PURCHASED PART with SUPPLIER. This associative
entity (named SUPPLIES in Figure 3-5b) allows users to more easily associate
purchased parts with their suppliers. Notice that the attribute Unit Price is now
84 | P a g e
associated with the associative entity so that the unit price for a part may vary
from one supplier to another. In this example, specialization has permitted a
preferred representation of the problem domain.
Figure 3-5a
Figure 3-5b
COMPLETENESS AND DISJOINT CONSTRAINTS
Completeness constraints and disjointness constraints in modelling
supertype / subtype relationships.
Disjointness constraints - You will need to decide whether a supertype
instance may simultaneously be a member of two or more subtypes. It has two
rules. The disjoint rule forces subclasses to have disjoint sets of entities. The
overlap rule forces a subclass (also known as a supertype instance) to have
overlapping sets of entities.
DISJOINT RULE

85 | P a g e
(an instance of a supertype may not simultaneously be a member of two (or
more) subtypes.)
OVERLAP RULE

an instance of a
supertype may simultaneously be a member of two (or more) subtypes
Completeness constraints - decide whether a supertype instance must also be
a member of at least one subtype. The total specialization rule demands that
every entity in the superclass belong to some subclass. Just as with a regular
ERD, total specialization is symbolized with a double line connection between
entities. The partial specialization rule allows an entity to not belong to any of the
subclasses. It is represented with a single line connection.
TOTAL SPECIALIZATION RULE

Each entity instance of a supertype
must also be a member of some
subtype.
PARTIAL SPECIALIZATION RULE
86 | P a g e
An entity instance of a supertype may or may not belong to any subtype.
SUPERTYPE AND SUBTYPE HIERARCY
A supertype entity in one
relationship may be a subtype
entity in another relationship.
When a structure comprises a
combination of supertype/subtype
relationships, that structure is
called a supertype/subtype
hierarchy, or generalization
hierarchy.
Generalization can also be described in terms of inheritance, which specifies that
all the attributes of a supertype are propagated down the hierarchy to entities of a
lower type. Generalization may occur when a generic entity, which we call the
supertype entity, is partitioned by different values of a common attribute.
SUBTYPE DISCRIMINATOR
A subtype discriminator is an attribute of the supertype that indicates an entity's
subtype. The attribute's values are what determine the target subtype.
Disjoint subtypes - simple attributes that must have alternative values to indicate
any possible subtypes.
87 | P a g e
Overlapping subtypes - composite attributes whose subparts pertain to various
subtypes. Each subpart has a Boolean value that indicates whether or not the
instance belongs to the associated subtype.
SUBTYPE DISCRIMINATION: DISJOINT SUBTYPES
 Specialization and
Disjoint
 Employee: Hourly,
Salaried, Consultant
 Employee Type = The
Discriminator
 Code: “H” = hourly
 Code: “S” = Salaried
 Code: “C” = Consultant
SUBTYPES DISCRIMINATION: OVERLAPPING SUBTYPES
- More than one subtype
- The components are
Manufactured? And
Purchased?
- Where to be stored?
88 | P a g e
The code will be:
ENTITY CLUSTER
 EER diagrams are difficult to read
when there are too many entities
and relationship.
 Solution: Group entities and
relationships into Entity Clusters.
 Entity Cluster: Set of one or more
entity types and associated
relationships grouped into a single
abstract entity type.
89 | P a g e
Turn into this after clustering.
Manufacturing Cluster
90 | P a g e
PACKAGE DATA MODEL
a. The age of the data modeler as engineer is dawning
b. Key strategic for long-term success, game changer for data modeling
c. Acquiring a packaged or predefined data model
d. NOT fixed can be customized to fit the business rules
e. Best-practices data model for the industry choose functional area
f. Not inexpensive, but found in publications
g. Data model patterns code for programs(just a good start for success)
ADVANTAGE OF DATA PACKAGE MODEL
 Use proven model components
 Save time and cost
 Less likelihood of data model errors
 Easier to evolve and modify over time
 Aid in requirements determination
 Easier to read
 Supertype/subtype hierarchies promote reuse
 Many-to-many relationships enhance model flexibility
91 | P a g e
 Vendor-supplied data model fosters integration with vendor’s applications
 Universal models support inter-organizational systems
CHAPTER 4:
LOGICAL DATABASE ESIGN AND THE

RELATIONAL MODEL
Baccol, Jonalyn G.
Mequin, Mary Joyce M.
92 | P a g e
1. List five properties of relations.
PROPERTIES OF RELATIONS
We have defined relations as two-dimensional tables of data. However, not all
tables are relations. Relations have several properties that distinguish them from
non-relational tables. We summarize these properties next:
1. Each relation (or table) in a database has a unique name.
2. An entry at the intersection of each row and column is atomic (or single
valued).
There can be only one value associated with each attribute on a specific row of a
table; no multivalued attributes are allowed in a relation.
3. Each row is unique; no two rows in a relation can be identical.
4. Each attribute (or column) within a table has a unique name.
5. The sequence of columns (left to right) is insignificant. The order of the
columns in a relation can be changed without changing the meaning or use of the
relation; the sequence of rows (top to bottom) is insignificant. As with columns,
the order of the rows of a relation may be changed or stored in any sequence.
REMOVING MULTIVALUED ATTRIBUTES FROM TABLES
The second property of relations listed in the preceding segment states that no
multivalued attributes are allowed in a relation. Thus, a table that contains one or
93 | P a g e
more multivalued attributes is not a relation. For example, Figure 1(a) shows the
employee data from the EMPLOYEE1 relation extended to include courses that
may have been taken by those employees. Because a given employee may have
taken more than one course, Course Title and Date Completed are multivalued
attributes. For example, the employee with EmpID 100 has taken two courses. If
an employee has not taken any courses, the Course Title and Date Completed
attribute values are null. (See the employee with EmpID 190 for an example.)
We show how to eliminate the multivalued attributes in Figure 1(b) by filling the
relevant data values into the previously vacant cells of Figure 1(a). As a result,
the table in
Figure 1(b) has only single-valued attributes and now satisfies the atomic
property of relations. The name EMPLOYEE2 is given to this relation to
distinguish it from EMPLOYEE1. However, as you will see, this new relation does
have some undesirable properties.
94 | P a g e
Figure 1 Eliminating multivalued attributes
2. State two essential properties of a candidate key.
CANDIDATE KEYS
A candidate key is an attribute, or combination of attributes, that uniquely
identifies a row in a relation. A candidate key must satisfy the following
properties (Dutka and Hanson, 1989), which are a subset of the six properties
of a relation previously listed:
1. Unique identification: For every row, the value of the key must uniquely
identify that row. This property implies that each non key attribute is
functionally dependent on that key.
2. Non redundancy: No attribute in the key can be deleted without
destroying the property of unique identification.
Figure 2 Representing
functional
dependencies
95 | P a g e
We represent the functional dependencies for a relation using the notation shown
in Figure 2. Figure 2(a) shows the representation for EMPLOYEE1. The
horizontal line in the figure portrays the functional dependencies. A vertical line
drops from the primary key (EmpID) and connects to this line. Vertical arrows
then point to each of the nonkey attributes that are functionally dependent on the
primary key.
For the relation EMPLOYEE2 (Figure 1(b)), notice that (unlike EMPLOYEE1)
EmpID does not uniquely identify a row in the relation. For example, there are
two rows in the table for EmpID number 100. There are two types of functional
dependencies in this relation:
1. EmpID → Name, Dept Name, Salary
2. EmpID, Course Title → Date Completed
The functional dependencies indicate that the combination of EmpID and Course
Title is the only candidate key (and therefore the primary key) for EMPLOYEE2.
In other words, the primary key of EMPLOYEE2 is a composite key. Neither
EmpID nor Course Title uniquely identifies a row in this relation and therefore
(according to property 1) cannot by itself be a candidate key. Examine the data in
Figure 1(b) to verify that the combination of EmpID and Course Title does
uniquely identify each row of EMPLOYEE2. We represent the functional
dependencies in this relation in Figure 2(b). Notice that Date Completed is the
96 | P a g e
only attribute that is functionally dependent on the full primary key consisting of
the attributes EmpID and Course Title.
We can summarize the relationship between determinants and candidate keys as
follows: A candidate key is always a determinant, whereas a determinant may or
may not be a candidate key. For example, in EMPLOYEE2, EmpID is a
determinant but not a candidate key. A candidate key is a determinant that
uniquely identifies the remaining (nonkey) attributes in a relation. A determinant
may be a candidate key (such as EmpID in EMPLOYEE1), part of a composite
candidate key (such as EmpID in EMPLOYEE2), or a nonkey attribute. We will
describe examples of this shortly.
As a preview to the following illustration of what normalization accomplishes,
normalized relations have as their primary key the determinant for each of the
nonkeys, and within that relation there are no other functional dependencies.
(Determinant: The attribute on the left side of the arrow in a functional
dependency)
3. Give a concise definition of each of the following: First normal form,
second normal form and third normal form.
A normal form is a state of a relation that requires that certain rules regarding
relationships between attributes (or functional dependencies) are satisfied. We
describe these rules briefly in this section and illustrate them in detail in the
following sections:
97 | P a g e
1. First normal form. Any multivalued attributes (also called repeating groups)
have been removed, so there is a single value (possibly null) at the intersection
of each row and column of the table (as in Figure 4-2b).
2. Second normal form. Any partial functional dependencies have been
removed
(i.e., nonkey attributes are identified by the whole primary key).
3. Third normal form. Any transitive dependencies have been removed (i.e.,
nonkey attributes are identified by only the primary key).
4. Boyce-Codd normal form. Any remaining anomalies that result from
functional dependencies have been removed (because there was more than one
possible primary key for the same nonkeys).
5. Fourth normal form. Any multivalued dependencies have been removed.
6. Fifth normal form. Any remaining anomalies have been removed.
Up to the
Boyce-Codd
normal form,
normalization is
based on the
analysis of
functional
dependencies.
98 | P a g e
A functional dependency is a constraint between two attributes or two sets of

attributes.
4. Briefly four problems that may arise when merging relations.
1. Synonyms
 Two or more attributes with different names but same meaning
 Is an alias or alternate name for a table, view, sequence, or other schema
object
 They are used mainly to make it easy for users to access database
objects owned by other users
 Provides an alternative name for another database object, referred to as
the base object, that can exist on a local or remote server
 Choose either of the two attribute names and eliminate the other synonym
or use a new attribute name to replace both synonyms
For example:
ITEM (Item no, Color, Supplier code)
SUPPLIER (Supplier id, Supplier Name)
2. Homonyms
 Attributes with same name but different meanings
 A single attribute may have more than one meaning
 Homonyms are those fields of data that have different values but have
similar names.
99 | P a g e
 The name of the attribute will be the same but the attribute refers to
different things.
For example: The example below is between student and customer in the
database. In the database for Students we can say that the F Name is for the
First Name of the Father, while in Customer the F Name can be the First name of
that customer. They have the same attributes but different meanings.

STUDENT CUSTOMER
Transitive dependencies
 Even if relations are in 3rd Normal Form prior to merging, they may not be
after merging
 An indirect relationship between values in the same table that causes a
functional dependency.
 To achieve the normalization standard of Third Normal Form (3NF), you
must eliminate any transitive dependency.
 Remove transitive dependencies by creating 3 NF relations
100 | P a g e
For example:
4. Supertype/subtype relationships
 May be hidden prior to merging
 Is a generic entity type that has a relationship with one or more subtypes
 Is meaningful to the organization and that shares common attributes or
relationships distinct from other subgroups.
 If there are two or more different types of a relation but they contain some
characteristics common to all
For example:
Patient 1 (Patient No. , Name, Address)
Patient 2 (Patient No. , Room No.)
Patient
 INPATIENT ( Date admitted)
 OUTPATIENT ( Date Treated)
5. Transform an ER (EER) diagram into a logically equivalent set of
relations (table)
An Entity–relationship model (ER model) describes the structure of a
database with the help of a diagram, which is known as Entity Relationship
Diagram (ER Diagram). An ER model is a design or blueprint of a database that
can later be implemented as a database.

101 | P a g e
Entity Relationship (ER) Model, when conceptualized into diagrams, gives
a good overview of entity-relationship, which is easier to understand. ER
diagrams can be mapped to relational schema, that is, it is possible to create
relational schema using ER diagram. We cannot import all the ER constraints
into a relational model, but an approximate schema can be generated.
An ER diagram shows the relationship among entity sets. An entity set is
a group of similar entities and these entities can have attributes. In terms of
DBMS, an entity is a table or attribute of a table in a database, so by showing
relationship among tables and their attributes, ER diagram shows the complete
logical structure of a database.
Facts about ER Diagram Model
 ER model allows you to draw Database Design
 It is an easy-to-use graphical tool for modeling data
 Widely used in Database Design
 It is a GUI representation of the logical structure of a Database
 It helps you to identifies the entities which exist in a system and the
relationships between those entities
Why use ER Diagrams?
Here, are prime reasons for using the ER Diagram
 Helps you to define terms related to entity relationship modeling
102 | P a g e
 Provide a preview of how all your tables should connect, what fields are
going to be on each table
 Helps to describe entities, attributes, relationships
 ER diagrams are translatable into relational tables which allows you to
build databases quickly
 ER diagrams can be used by database designers as a blueprint for
implementing data in specific software applications
For us to understand the transformation of ER diagrams let us first define
what is Logical design and relational model?
Logical design
 Logical design is an entity design without regard to a relational database
management system.
 Logical design the same, regardless of the DBMS
 Limitation for features of a particular DBMS should not be considered
 A logical design is a conceptual, abstract design. You do not deal with the
physical implementation details yet; you deal only with defining the types
of information that you need.
 The process of logical design involves arranging data into a series of
logical relationships called entities and attributes.
103 | P a g e
Relational Database Model
 Data represented as a set of related tables or relations
 Relations:
 A named, two-dimensional table of data. Each relation consists of a set of
named columns and an arbitrary number of unnamed rows
 Properties
 Entries in cells are simple
 Entries in columns are from the same set of values
 Each row is unique
 The sequence of columns can be interchanged without changing the
meaning or use of the relation
 The rows may be interchanged or stored in any sequence
 Well-Structured Relation
 A relation that contains a minimum amount of redundancy and allows
users to insert, modify and delete the rows without errors or
inconsistencies
A simple ER Diagram
In the following
diagram we have two
entities Student and
College and their
104 | P a g e
relationship. The relationship between Student and College is many to one as a
college can have many students however a student cannot study in multiple
colleges at the same time. Student entities have attributes such as Stu_Id,
Stu_Name & Stu_Addr and College entities have attributes such as Col_ID &
Col_Name.
Here are the geometric shapes and their meaning in an E-R Diagram. We
will discuss these terms in detail in the next section (Components of an ER
Diagram) of this guide so don’t worry too much about these terms now, just go
through them once.
 Rectangle: Represents Entity sets.
 Ellipses: Attributes
 Diamonds: Relationship Set
 Lines: They link attributes to Entity Sets and Entity sets to Relationship
Set
 Double Ellipses: Multivalued Attributes
 Dashed Ellipses: Derived Attributes
 Double Rectangles: Weak Entity Sets
 Double Lines: Total participation of an entity in a relationship set
105 | P a g e
As shown in the above diagram, an ER diagram has three main

components:
A. Entity
B. Attribute
C. Relationship
Conversion of ER Diagram to Relational model
A. Entity
 An entity is an object or component of data.
 An entity is represented as a rectangle in an ER diagram.
 Is an object that can exist ( a single thing, person, object, place)
 Set is a group of similar entities and these entities can have attributes
For example: In the following ER diagram we have two entities Student and
College and these two entities have many to one relationship as many students
study in a single college. We will read more about relationships later, for now
focus on entities.
106 | P a g e
Mapping strong entity (2 cases)
For each strong entity set creates a new relational independent table that
includes all attributes as columns. For composite attributes include only
component attributes. There are the following:
1. Case: For Strong Entity Set with Only Simple Attributes
 A strong entity set with only simple attributes will require only one table in
the relational model.
 Attributes of the table will be the attributes of the entity set. The primary
key of the table will be the key attribute of the entity set.
2. Case: For Strong Entity Set With Composite Attributes
107 | P a g e
 A strong entity set with any number of composite attributes will require
only one table in relational model.
 While conversion, simple attributes of the composite attributes are taken
into account and not the composite attribute.
108 | P a g e
Mapping weak entity
 Convert every weak entity set into a table where we take the
discrimination attribute of the weak entity set and takes the primary key of
the strong entity set as a foreign key and then declared the combination of
discriminator attribute and foreign key as a primary key.
 Weak entity set always appears in association with identifying
relationships with total participation constraint.
 Weak entities are represented with double rectangular box in the ER
Diagram and the identifying relationships are represented with double
diamond. Partial Key attributes are represented with dotted lines.
 Entities cannot be identified by the values of their attributes
 There is no primary key made from its own attributes
 An entity can be identified by a combination of their attributes
(“discriminator”) and the relationship they have with another entity set
(“identifying relationship”)
109 | P a g e
 A weak entity is a type of entity which doesn't have its key attribute. It can
be identified uniquely by considering the primary key of another entity. For
that, weak entity sets need to have participation
B. Attribute
 An attribute describes the property of an entity.
 The information about the entity that needs to be stored
 An attribute is represented as Oval in an ER diagram.
There are four types of attributes:
1. Key attribute 2. Composite attribute
110 | P a g e
3. Multivalued attribute 4. Derived attribute
1. Key attributes
 A key attribute can uniquely identify an entity from an entity set.
 Used to establish relationships between the different tables and columns
of a relational database.
 a set of attributes that help to uniquely identify a tuple (or row) in a
relation (or table).
For example: Student roll numbers can uniquely identify a student from a set of
students. Key attribute is
represented by oval same
as other attributes
however the text of key
attribute is underlined.
2. Composite attribute
 An attribute that is a combination of other attributes is known as a
composite attribute.
 is an attribute where the values of that attribute can be further subdivided
into meaningful sub-parts
 There are values that are to be stored in an attribute that can be further
divided into meaningful values (sub-values).
111 | P a g e
For example: In student entities,
the student address is a
composite attribute as an address
is composed of other attributes
such as pin code, state, country.
3. Multivalued attribute
 An attribute that can hold multiple values is known as a multivalued
attribute. It is represented with double ovals in an ER Diagram.
 For every multi-valued attribute, we will make a new table where we will
take the primary key of the main table as a foreign key and multi-valued
attribute as a primary key.
 A strong entity set with any number of multi valued attributes will require
two tables in relational model.
 One table will contain all the simple attributes with the primary key.
 Other table will contain the primary key and all the multi valued attributes.
For example: A person can have more than one
phone number so the phone number attribute is
multivalued.
4. Derived attribute
 A derived attribute is one whose value is dynamic and derived from
another attribute. It is represented by a dashed oval in an ER Diagram.
112 | P a g e
 an attribute whose values are calculated from other attributes.
 are the attributes that do not exist in the physical database,
For example: Person age is a derived
attribute as it changes over time and can
be derived from another attribute (Date
of birth).
B. Relationship
Cardinality: Defines the numerical attributes of the relationship between two
entities or entity sets.
A relationship is represented by a diamond shape in ER diagram; it shows
the relationship among entities. There are four types of cardinal relationships:
1. One to One 3. Many to One
2. One to Many 4. Many to Man
1. One-to-one relationships
 In a one-to-one relationship, one record in a table is associated with one
and only one record in another table.
 When a single instance of an entity is associated with a single instance of
another entity then it is called one to one relationship.
113 | P a g e
For example 1
 In a school database, each student has only one student ID, and each
student ID is assigned to only one person.
 In this example, the key field in each table, Student ID, is designed to
contain unique values. In the Students table, the Student ID field is the
primary key; in the Contact Info table, the Student ID field is a foreign key.
 This relationship returns related records when the value in the Student ID
field in the Contact Info table is the same as the Student ID field in the
Students table.
Example 2: An employee can work in at most one department, and a department
can have at most one employee.
114 | P a g e
For example 3: a person has only one passport and a passport is given to one
person.
2.
One-to-
many relationship
 When a single instance of an entity is associated with more than one
instance of another entity then it is called one into many relationships.
 In a one-to-many relationship, one record in a table can be associated
with one or more records in another table.
For example: each customer can have many sales orders. A customer can
place many orders but an order cannot be placed by many customers.
3. Many to One
Relationship
 When more than one
instance of an entity is
115 | P a g e
associated with a single instance of another entity then it is called many to
one relationship.
For example: many students can study in a single college but a student cannot
study in many colleges at the same time.
4. Many-to-many relationship
 When more than one instance of an entity is associated with more than
one instance of another entity then it is called many into many
relationships.
 A many-to-many relationship occurs when multiple records in a table are
associated with multiple records in another table.
 A many-to-many relationship exists between customers and products:
customers can purchase various products, and products can be
purchased by many
customers.
For example 1: a student can be
assigned to many projects and a
project can be assigned to many
students.
6. Create relational tables that incorporate entity integrity and referential
integrity constraints.
116 | P a g e
Constraints
 Are the rules enforced on the data columns of a table
 Here is used to limit the type of data that can go into a table.
 May apply to each attribute or they may apply to relationships between
tables
 This ensures the accuracy and reliability of the data in the database.
 Constraints could be either on a column level or a table level. The column
level constraints are applied only to one column, whereas the table level
constraints are applied to the whole table.
Integrity constraints
 A set of rules that the database is not permitted to violate.
 Ensure that changes (update, deletion, insertion) made to the database by
authorized users do not result in a loss of data consistency.
 Integrity constraints guard against accidental damage to the database.
 An important functionality of DBMS
Example: A blood type group must be A, B, AB or O only cannot have any other
values.
Types of integrity constraints:
1. Entity integrity
 Focuses on Primary keys.
117 | P a g e
 Each table should have a primary key and each record must be unique
and not null.
 This makes sure that records in a table are not duplicated and remain
intact during insert, update and retrieval.
 Describes a condition in which all tuples within a table are uniquely
identified by their primary key. The unique value requirement prohibits a null
primary key value, because nulls are not unique.
 To ensure entity integrity, it is required that every table has a primary key.
Neither the PK nor any part of it can contain null values. This is because
null values for the primary key mean we cannot identify some rows.
For example 1:
Example 2:
2. Referential integrity
 Focuses on Foreign
keys.
 Specified between two table
 Null
118 | P a g e
 Is the total absence of a value in a certain field and means that the
field value in unknown
 Null is not the same as a zero value for a numerical field or space
value
 Implies that a database field value has not been stored
 Foreign keys are designed to keep relationships between records of a
table to records of another table.
 Referential integrity requires that a foreign key must have a matching
primary key or it must be null. This constraint is specified between two tables
(parent and child); it maintains the correspondence between rows in these
tables. It means the reference from a row in one table to another table must
be valid.
 Referential integrity can be enforced by working with primary and foreign
keys. Each foreign key must have a matching primary key so that reference
from one table to another must always be valid.
Example 1
Rule 1: You can’t delete from a primary table if matching records exist in a
related table.
Rule 2: You can’t change a primary key value in the primary table of that records
has related records
119 | P a g e
Example 2
Rule 3: You can’t insert a
value in the foreign key field of
the related table that doesn’t
exist in the primary key of the
primary table.
Example 3:
Key Terms
120 | P a g e
Alternate key: all candidate keys not chosen as the primary key candidate key:
a simple or composite key that is unique (no two rows in a table may have the
same value) and minimal (every column is necessary)
Characteristic entities: entities that provide more information about another
table
Composite attributes: attributes that consist of a hierarchy of attributes
Composite key: composed of two or more attributes, but it must be minimal
Dependent entities: these entities depend on other tables for their meaning
Derived attributes: attributes that contain values calculated from other attributes
Derived entities: see dependent entities
EID: employee identification (ID)
Entity: a thing or object in the real world with an independent existence that can
be differentiated from other objects
Entity relationship (ER) data model: also called an ER schema, are
represented by ER diagrams. These are well suited to data modeling for use with
databases.
Entity relationship schema: see entity relationship data model
Entity set: a collection of entities of an entity type at a point of time
Entity type: a collection of similar entities
Foreign key (FK): an attribute in a table that references the primary key in
another table OR it can be null
Independent entity: as the building blocks of a database, these entities are what
other tables are based on
121 | P a g e
Kernel: see independent entity
Key: an attribute or group of attributes whose values can be used to uniquely
identify an individual entity in an entity set
Multivalued attributes: attributes that have a set of values for each entity
N-ary: multiple tables in a relationship
Null: a special symbol, independent of data type, which means either unknown
or inapplicable; it does not mean zero or blank
Recursive relationship: see unary relationship
Relationships: the associations or interactions between entities; used to
connect related information between tables
Relationship strength: based on how the primary key of a related entity is
defined
Secondary key an attribute used strictly for retrieval purposes
Simple attributes: drawn from the atomic value domains
SIN: social insurance number
Single-valued attributes: see simple attributes
Stored attribute: saved physically to the database
Ternary relationship: a relationship type that involves many to many
relationships between three tables.
122 | P a g e
CHAPTER 5:
PHYSICAL DATABASE DESIGN AND

PERFORMANCE
123 | P a g e
Ramada, Julie Mae

Beñeras, Jhasper M.
PHYSICAL DATABASE DESIGN
The physical design of your database optimizes performance while ensuring data
integrity by avoiding unnecessary data redundancies. The task of building the
physical design is a job that truly never ends. You need to continually monitor the
performance and data integrity as time passes. Many factors necessitate periodic
refinements to the physical design.
124 | P a g e
Physical database design does not include implementing files and databases
(i.e., creating them and loading data into them). Physical database design
produces the technical specifications that programmers, database administrators,
and others involved in information systems construction will use during the
implementation phase.
Purpose––translate the logical description of data into the of data into the
technical specifications for storing and retrieving data storing and retrieving data.
Goal––create a design for storing data that will provide adequate performance
and insure database integrity, security and recoverability.
Because physical design is related to how data are physically stored, we need to
consider a few underlying concepts about physical storage. One goal of physical
design is optimal performance and storage space utilization. Physical design
includes data structures and
file organization, keeping in mind that the database software will communicate
with your computer’s operating system.
PHYSICAL DESIGN PROCESS
Designing physical files and databases requires certain information that
should have been collected and produced during prior systems development
phases.The information needed for physical file and database design includes
these requirements:
125 | P a g e
• Normalized relations, including estimates for the range of the
number of rows in each table
• Definitions of each attribute, along with physical specifications such
as maximum possible length
• Descriptions of where and when data are used in various ways
(entered, retrieved, deleted, and updated, including typical frequencies of
these events)
• Expectations or requirements for response time and data security,
backup, recovery, retention, and integrity
• Descriptions of the technologies (database management systems)
used for implementing the database.
Physical database design requires several critical decisions that will affect
the integrity and performance of the application system. These key decisions
include the following:
• Choosing the storage format (called data type) for each attribute
from the logical data model. The format and associated parameters are chosen
to maximize data
integrity and to minimize storage space.
• Giving the database management system guidance regarding how
to group attributes from the logical data model into physical records. You will
discover that
126 | P a g e
although the columns of a relational table as specified in the logical design are a
natural definition for the contents of a physical record, this does not always form
the foundation for the most desirable grouping of attributes in the physical
design.
• Giving the database management system guidance regarding how
to arrange similarly structured records in secondary memory (primarily hard
disks), using
a structure (called a file organization) so that individual and groups of records
can
be stored, retrieved, and updated rapidly. Consideration must also be given to
protecting data and recovering data if errors are found.
• Selecting structures (including indexes and the overall database
architecture) for storing and connecting files to make retrieving related data more
efficient.
• Preparing strategies for handling queries against the database that
will optimize performance and take advantage of the file organizations and
indexes that you
have specified. Efficient database structures will be beneficial only if queries
and the database management systems that handle those queries are tuned to
intelligently use those structures.
127 | P a g e
DATA VOLUME AND USAGE ANALYSIS
Data volume and frequency-of-use statistics are important inputs to the physical
database design process, particularly in the case of very largescale database
implementations. Thus, it is beneficial to maintain a good understanding of the
size and usage patterns of the database throughout its life cycle.
Estimates of database size are used to select physical storage devices and
storage costs estimation and estimates of usage paths or pattern are used to
select file organization and access methods. Plans for the use of indexes, and
plan a strategy for database distribution.
Why do we need to estimate?
Data volume and usage estimation is crucial for the proper administration of
databases. As you all know, we need a storage space to store and maintain our
database. In order to make the proper storage size decision for our database we
need to estimate the data volume and usage.
What happens if we don't estimate?
The consequences of NOT estimating data volume and usage frequency is
severe. Think about an e-tailer (web-based retailer). Let's assume that the e-
tailer's management chose a database storage space using the cost as the sole
criterion. Since the e-tailer wants to save bucks from the initial set-up costs, they
chose the smallest storage space available by the vendor. After a serious
advertising campaign using web and other media , they started their online
128 | P a g e
operations. Everything was going fine, until one day they found out that their web
site is crashed due to data overload and high level of usage frequency. Now the
company ended up having:
 upset customers, who are waiting for their orders (most probably the
customer would switch to another provider)
 a bill from the vendor in order to fix the issue (the bill of course includes
the additional storage space. Because, right now the company deems it
necessary to have the proper amount of database storage space)
 lost business because the web site is down
An easy way to show the statistics about data volumes and usage is by adding
notation to the EER diagram that represents the final set of normalized relations
from logical database design.
129 | P a g e
Figure 5-1 shows the EER diagram (without attributes) for a simple inventory
database for Pine Valley Furniture Company. This EER diagram represents the
normalized relations constructed during logical database design for the original
conceptual data model of this situation depicted in Figure 3-5b.
Both data volume and access frequencies are shown in Figure 5-1. For
example,
there are 3,000 PARTs in this database. The supertype PART has two subtypes,
MANUFACTURED (40 percent of all PARTs are manufactured) and
PURCHASED (70 percent are purchased; because some PARTs are of both
subtypes, the percentages sum to more than 100 percent). The analysts at Pine
Valley estimate that there are typically 150 SUPPLIERs, and Pine Valley
130 | P a g e
receives, on average, 40 SUPPLIES instances from each SUPPLIER, yielding a
total of 6,000 SUPPLIES. The dashed arrows represent access frequencies. So,
for example, across all applications that use this database, there are on average
20,000 accesses per hour of PART data, and these yield, based on subtype
percentages, 14,000 accesses per hour to PURCHASED PART data.
There are an additional 6,000 direct accesses to PURCHASED PART data. Of
this total of 20,000 accesses to PURCHASED PART, 8,000 accesses then also
require SUPPLIES data and of these 8,000 accesses to SUPPLIES, there are
7,000 subsequent accesses to SUPPLIER data. For online and Web-based
applications, usage maps should show the accesses per second. Several usage
maps may be needed to show vastly different usage patterns for different times
of day. Performance will also be affected by network specifications. The volume
and frequency statistics are generated during the systems analysis phase of the
systems development process when systems analysts are studying current and
proposed data processing and business activities. The data volume statistics
represent the size of the business and should be calculated assuming business
growth over a period of at least several years. The access frequencies are
estimated from the
timing of events, transaction volumes, the number of concurrent users, and
reporting and querying activities. Because many databases support ad hoc
accesses, and such accesses may change significantly over time, and known
database access can peak and dip over a day, week, or month, the access
131 | P a g e
frequencies tend to be less certain and even than the volume statistics.
Fortunately, precise numbers are not necessary. What is crucial is the relative
size of the numbers, which will suggest where the greatest attention needs to be
given during physical database design in order to achieve the best possible
performance. For example, in Figure 5-1, notice the following:
• There are 3,000 PART instances, so if PART has many attributes and
some, like description, are quite long, then the efficient storage of PART might be
important.
• For each of the 4,000 times per hour that SUPPLIES is accessed via
SUPPLIER, PURCHASED PART is also accessed; thus, the diagram would
suggest possibly combining these two co-accessed entities into a database table
(or file). This act of combining normalized tables is an example of
denormalization, which we discuss later in this chapter.
• There is only a 10 percent overlap between MANUFACTURED and
PURCHASED parts, so it might make sense to have two separate tables for
these entities and redundantly store data for those parts that are both
manufactured and purchased; such planned redundancy is acceptable if
purposeful. Further, there are a total of 20,000 accesses an hour of
PURCHASED PART data (14,000 from access to
PART and 6,000 independent access of PURCHASED PART) and only 8,000
accesses of MANUFACTURED PART per hour. Thus, it might make sense to
organize tables for MANUFACTURED and PURCHASED PART data differently
132 | P a g e
due to the significantly different access volumes. It can be helpful for subsequent
physical
DESIGNING FIELDS
A field is the smallest unit of application data recognized by system software,
such as
a programming language or database management system. A field corresponds
to a
simple attribute in the logical data model, and so in the case of a composite
attribute, a
field represents a single component.
Basic Decisions in specifying a Field:
 Specification of the type of data used to represent values of the field
 Data integrity controls built into the database
 Describe the mechanisms that the DBMS should use to handle missing
values for the field.
 Specify the Display Format
CHOOSING DATA TYPES
133 | P a g e
As a typical company’s amount of data has grown exponentially it’s become even
more critical to optimize data storage. The size of your data doesn’t just impact
storage size
and costs, it also affects query performance. A key factor in determining the size
of your data is the data type you select.
Selecting a data type involves four objectives that will have different relative
levels of importance for different applications: 1. Represent all possible values. 2.
Improve data integrity. 3. Support all data manipulations. 4. Minimize storage
134 | P a g e
space
135 | P a g e
 If the data is numeric, favor SMALLINT, INTEGER, BIGINT, or DECIMAL
data types. DECFLOAT and FLOAT are also options for very large
numbers.
 If the data is character, use CHAR or VARCHAR data types.
 If the data is date and time, use DATE, TIME, and TIMESTAMP data
types.
 If the data is multimedia, use GRAPHIC, VARGRAPHIC, BLOB, CLOB, or
DBCLOB data types.
136 | P a g e
CODING TECHNIQUES
Some attributes have a sparse set of values or are so large that, given data
volumes, considerable storage space will be consumed. A field with a limited
number of possible values can be translated into a code that requires less space.
Consider the example of the ProductFinish field illustrated in Figure 5-2. Products
at Pine Valley Furniture come in only a limited number of woods: Birch, Maple,
and Oak. By creating a code or translation table, each ProductFinish field value
can be replaced by a code, a cross-reference to the lookup table, similar to a
foreign key. This will decrease the amount of space for the ProductFinish field
and hence for the PRODUCT file. There will be additional space for the
PRODUCT FINISH lookup table, and when the ProductFinish field value is
needed,
137 | P a g e
an extra access (called a join) to this lookup table will be required. If the
ProductFinish field is infrequently used or if the number of distinct ProductFinish
values is very large, the relative advantages of coding may outweigh the costs.
Note that the code table would not appear in the conceptual or logical model. The
code table is a physical construct to achieve data processing performance
improvements, not a set of data with business value.
CONTROLLING DATA INTEGRITY
Default Value - A default value is the value a field will assume unless a user
enters an explicit value for an instance of that field. Assigning a default value to a
field can reduce data entry time because entry of a value can be skipped. It can
also help to reduce data entry errors for the most common value.
Range control - A range control limits the set of permissible values a field may
assume. The range may be a numeric lower-to-upper bound or a set of specific
values. Range controls must be used with caution because the limits of the range
may change over time. A combination of range controls and coding led to the
year 2000 problem that many organizations faced, in which a field for year was
represented by only the numbers 00 to 99. It is better to implement any range
controls through a DBMS because range controls in applications may be
inconsistently enforced. It is also more difficult to find and change them in
applications than in a DBMS.
138 | P a g e
Null value control - A null value was defined in Chapter 4 as an empty value.
Each primary key must have an integrity control that prohibits a null value. Any
other required field may
also have a null value control placed on it if that is the policy of the organization.
Referential integrity - The term referential integrity was defined in Chapter 4.
Referential integrity on a field is a form of range control in which the value of that
field must exist as the value in some field in another row of the same or (most
commonly) a different table. That is, the range of legitimate values comes from
the dynamic contents of a field in a database table, not from some pre-specified
set of values. Note that referential integrity only guarantees that some existing
cross-referencing value is used, not that it is the correct one. A coded field will
have referential integrity with the primary key of the associated lookup table.
HANDLING MISSING DATA
• Substitute an estimate of the missing value. For example, for a missing sales
value when computing monthly product sales, use a formula involving the mean
of the existing monthly sales values for that product indexed by total sales for
that month across all products. Such estimates must be marked so that users
know that these are not actual values.
• Track missing data so that special reports and other system elements cause
people to resolve unknown values quickly. This can be done by setting up a
trigger in the database definition. A trigger is a routine that will automatically
139 | P a g e
execute when some event occurs or time period passes. One trigger could log
the missing entry to a file when a null or other missing value is stored, and
another trigger could run periodically to create a report of the contents of this log
file.
• Perform sensitivity testing so that missing data are ignored unless knowing a
value might significantly change results (e.g., if total monthly sales for a particular
salesperson are almost over a threshold that would make a difference in that
person’s compensation). This is the most complex of the methods mentioned and
hence requires the most sophisticated programming. Such routines for handling
missing data may be written in application programs. All relevant modern DBMSs
now have more sophisticated programming capabilities, such as case
expressions, user-defined functions, and triggers, so that such logic can be
available in the database for all users without application-specific programming.
FILE ORGANIZATION
File organization refers to the way data is stored in a file. File organization is very
important because it determines the methods of access, efficiency, flexibility and storage
devices to use.
Some factors to consider the file organization:
a) Fast data retrieval
b) High throughput for processing data input and maintenance transactions
c) Efficient use of storage space
140 | P a g e
d) Protection from failures or data loss
e) Minimizing need for reorganization
f) Accommodating growth
g) Security from unauthorized use
Types of file organization
1. Sequential 2) Indexed 3) Hashed
Sequential file organizations - In a sequential file organization, the records in the
file are stored in sequence according to a primary key value (see Figure 5-7a).
To locate a particular record, a program must normally scan the file from the
beginning until the desired record is located. A common example of a sequential
file is the alphabetical list of persons in the white pages of a telephone directory
(ignoring any index that may be included with the directory).
Indexed file organizations - contains records ordered by a record key. A record
key uniquely identifies a record and determines the sequence in which it is
accessed with respect to other records.
Each record contains a field that contains the record key. A record key for a
record might be, for example, an employee number or an invoice number.
An indexed file can also use alternate indexes, that is, record keys that let you
access the file using a different logical arrangement of the records. For example,
141 | P a g e
you could access a file through employee department rather than through
employee number.
The possible record transmission (access) modes for indexed files are
sequential, random, or dynamic. When indexed files are read or written
sequentially, the sequence is that of the key values.
142 | P a g e
Hashed file organization - Hash File Organization uses the computation of hash
function on some fields of the records. The hash function's output determines the
location of disk block where the records are to be placed.
For example, suppose that an organization has a set of approximately 1,000
employee records to be stored on magnetic disk. A suitable prime number would
be 997, because it is close to 1,000. Now consider the record for employee
12,396. When we divide this number by 997, the remainder is 432. Thus, this
record is stored at location 432 in the file
CLUSTER FILE ORGANIZATION
In this method two or more table which are frequently used to join and get the
results are stored in the same file called clusters. These files will have two or
more tables in the same data block and the key columns which map these tables
are stored only once. This method hence reduces the cost of searching for
various records in different files. All the records are found at one place and hence
making search efficient.
DENORMALIZING AND PARTITIONING DATA
Modern database management systems have an increasingly important role
in determining how the data are actually stored on the storage media. The
efficiency of database processing is, however, significantly affected by how the
logical relations are structured as database tables. The purpose of this section is
to discuss denormalization as a mechanism that is often used to improve efficient
143 | P a g e
processing of data and quick access to stored data. It first describes the best-
known denormalization approach: combining several logical tables into one
physical table to avoid the need to bring related data back together when they
are retrieved from the database. Then the section will discuss another form of
denormalization called partitioning, which also leads to differences between the
logical data model and the physical tables, but in this case one relation is
implemented as multiple tables.
Denormalization is the process of transforming normalized relations into
nonnormalized physical record specifications. We will review various forms of,
reasons for, and cautions about denormalization in this section. In general,
denormalization may partition a relation into several physical records, may
combine attributes from several relations together into one physical record, or
may do a combination of both.
144 | P a g e
Denormalization is the process of adding precomputed redundant data to an
otherwise normalized relational database to improve read performance of the
database. Normalizing a database involves removing redundancy so only a
single copy exists of each piece of information. Denormalizing a database
requires data has first been
normalized. With denormalization, the database administrator selectively adds
back specific instances of redundant data after the data structure has been
normalized. A denormalized database should not be confused with a database
that has never been normalized. Using normalization in SQL, a database will
store different but related types of data in separate logical tables, called relations.
145 | P a g e
When a query combines data from multiple tables into a single result table, it is
called a join. The performance of such a join in the face of complex queries is
often the occasion for the administrator to explore the denormalization
alternative.
Another approach is to denormalize the logical data design. With care this can
achieve a similar improvement in query response, but at a cost—it is now the
database designer's responsibility to ensure that the denormalized database
does not become inconsistent. This is done by creating rules in the database
called constraints, that specify how the redundant copies of information must be
kept synchronized, which may easily make the de-normalization procedure
pointless. It is the increase in logical complexity of the database design and the
added complexity of the additional constraints that make this approach
hazardous. Moreover, constraints introduce a trade-off, speeding up reads
(SELECT in SQL) while slowing down writes
(INSERT, UPDATE, and DELETE). This means a denormalized database under
heavy write load may offer worse performance than its functionally equivalent
normalized counterpart. In a traditional normalized database, we store data in
separate logical tables and attempt to minimize redundant data. We may strive to
have only one
copy of each piece of data in database. For example, in a normalized database,
we might have a Courses table and a Teachers table. Each entry in Courses
would store the teacherID for a Course but not the teacherName. When we need
146 | P a g e
to retrieve a list of all Courses with the Teacher’s name, we would do a join
between these two tables. In some ways, this is great; if a teacher changes his or
her name, we only have to update the name in one place. The drawback is that if
tables are large, we may spend an unnecessarily long time doing joins on tables.
Denormalization, then, strikes a different compromise. Under denormalization,
we decide that we’re okay with some redundancy and some extra effort to update
the database in order to get the efficiency advantages of fewer joins.
It is into this world of normalization with its order and useful arrangement of data
that the issue of denormalization is raised. Denormalization is the evaluated
introduction of instability into the stabilized (normalized) data structure.
If one went to such great lengths to arrange the data in normal form, why would
one change it? In order to improve performance is almost always the answer. In
the relational database environment, denormalization can mean fewer objects,
fewer joins, and faster access paths. These are all very valid reasons for
considering it. It is an
evaluative decision however and should be based on the knowledge that the
normalized model shows no bias to either update or retrieval but gives advantage
to neither.Overall, denormalization should be justified and documented so future
additions to the database or increased data sharing can address the
denormalization issues. If necessary, the database might have to be
147 | P a g e
renormalized and then denormalized with new information.
The Reason for Denormalization
Only one valid reason exists for denormalizing a relational design – to enhance
performance. However, there are several indicators which will help to identify
systems and tables which are potentialdenormalization candidates. These are:
 ●Many critical queries and reports exist which rely upon data from more
than one table. Often times these requests need to be processed in an on-
line environment.
●Repeating groups exist which need to be processed in a group instead of
individually.
●Many calculations need to be applied to one or many columns before
queries can be successfully answered.
●Tables need to be accessed in different ways by different users during
the same timeframe.
●Many large primary keys exist which are clumsy to query and consume a
large amount of DASD when carried as foreign key columns in related
tables.
●Certain columns are queried a large percentage of the time. Consider
60% or greater to be a cautionary number flagging denormalization as an
option.
Advantages of Database denormalization:
148 | P a g e
 Increased query execution speed. As there is no need to use joins
between tables, it is possible to extract the necessary information from
one table, which automatically increases the speed of query execution.
Additionally, this solution saves memory.
 Writing queries is much easier.If the table is properly reorganized for the
most common needs, you can extract data from only one table and not
waste time looking for join keys. However, one should remember about
data redundancy and update the query accordingly.
 No need to obtain data from dictionary tables where the values are
constant over time. Tables with country dictionaries are good examples. If
a company operates in a fixed number of world markets, it seems
unnecessary to make continuous joins with the dictionary table with
countries. In this case, it is worth adding a column with the name of the
country to, for example, a sales table.
 Ability to add aggregate data, which can be used for more efficient
reporting. Certain statistics, such as the number of sales actions, average
sales, etc., are very necessary to analyze various areas of the company’s
operation. Therefore, it may be easier to define key statistics and include
them in one table than to retrieve them by joining multiple tables.
 Reduction of the number of tables in a relational database. In case of a
complex relational database architecture, obtaining data from the multiple
tables can be tricky. If the database is properly denormalized, the number
149 | P a g e
of these tables can be effectively reduced and, consequently, the
database architecture can be simplified.
Disadvantages of Database denormalization:
 Increased processing size. due to data redundancy and possible data
duplication, the size of query processing increases.
 Increased table sizes. As a result of the denormalization of the database,
the table may significantly increase its size, which may be associated with
the load on the storage space.
 Increased costs of updating tables and inserts. In a table where data has
undergone redundancy due to the database denormalization, data update
may be a problem. For example, let’s assume that an additional column
that contains data about customer’s address has been added. Updating
this data can be burdensome and costly if the customer changes the
address. If the database is normalized, updating can only be done in the
dictionary table at a much lower cost. It is similar with inserts. Due to the
redundancy of data as a result of joining multiple tables, obtaining many
data for one table may be burdensome.
 Data may be inconsistent. Before executing the query, it is necessary to
get to know the table thoroughly and to take into account data duplication.
The query that will extract the necessary data without a risk of data
inconsistency should be comprehensively prepared.
150 | P a g e
Partitioning
A reserved part of a storage drive (hard disk, SSD) that is treated as a separate
drive. Even a single drive that takes all the storage space is assigned a partition.
For example, early Windows PCs came with the entire disk partitioned as drive
C:. New Windows PCs often come with the storage drive partitioned into C: and
D:. The main drive is C:and D: contains a recovery system in the event Windows
has to be re-installed. In addition, users may wish to have several drives for
organizational purposes, and utility programs come with every computer for
adding and modifying partitions. See primary partition, extended partition, basic
disk and dynamic disk.
On Microsoft operating systems, a hard disk is divided into drives. The first drive
has one drive in the partition called the primary drive and is generally "C:", which
is the active partition that boots the OS. Extended partitions can be added such
as "D:" and "E:" have more than one drive and are used for other storage such as
programs, data files, CD-ROM, or USB drives.
A Unix OS such as Linux and some older versions of Mac OS X use multiple
partitions on a disk from secondary storage called swap partitioning or paging.
This type of partition scheme allows directories with a file system hierarchy
standard (FHS) or home directory to be assigned their own file systems. A typical
Linux system has two partitions
that hold a file system that is attached to “/”, which is located in the root directory
or swap partition. Generally, an unlimited number of partitions can be created in
151 | P a g e
a Linux OS. A Mac OS X system uses one partition for the whole file system. It
uses a swap file method within the file system instead of a swap partition.
The partitioning can be done by either building separate smaller databases (each
with its own tables, indices, and transaction logs), or by splitting selected
elements, for example just one table.
 Horizontal partitioning involves putting different rows into different

tables. For example, customers with ZIP codes less than 50000 are stored in
CustomersEast, while customers with ZIP codes greater than or equal to
50000 are stored in CustomersWest. The two partition tables are then
CustomersEast and CustomersWest, while a view with a union might be
created over both of them to provide a complete view of all customers.
 Vertical partitioning involves creating tables with fewer columns and
using additional tables to store the remaining columns. Generally, this
[1]
practice is known as normalization. However, vertical partitioning extends

further and partitions columns even when already normalized. This type of
partitioning is also called "row splitting", since rows get split by their columns,
and might be performed explicitly or implicitly. Distinct physical machines
might be used to realize vertical partitioning: Storing infrequently used or very
wide columns, taking up a significant amount of memory, on a different
machine, for example, is a method of vertical partitioning. A common form of
vertical partitioning is to split static data from dynamic data, since the former
is faster to access than the latter, particularly for a table where the dynamic
data is not used as often as the static. Creating a view across the two newly
created tables restores the original table with a performance penalty, but
accessing the static data alone will show higher performance. A columnar
database can be regarded as a database that has been vertically partitioned
until each column is stored in its own table.
Indexing makes columns faster to query by creating pointers to where data is
stored within a database.
Imagine you want to find a piece of information that is within a large database. To
get this information out of the database the computer will look through every row
until it finds it. If the data you are looking for is towards the very end, this query
152 | P a g e
would take a long time to run.
If the table was ordered alphabetically, searching for a name could happen a lot
faster because we could skip looking for the data in certain rows. If we wanted to
search for “Zack” and we know the data is in alphabetical order we could jump
down to halfway through the data to see if Zack comes before or after that row.
We could then half the remaining rows and make the same comparison.
153 | P a g e
An index is a structure that holds the field the index is sorting and a pointer from
each record to their corresponding record in the original table where the data is
actually stored. Indexes are used in things like a contact list where the data may
be physically stored in the order you add people’s contact information but it is
easier to find people when listed out in alphabetical order.
Let’s look at the index from the previous example and see how it maps back to
the original Friends table:
We can see here that the table has the data stored ordered by an incrementing id
based on the order in which the data was added. And the Index has the names
stored in alphabetical order.
When to use Indexes?
Indexes are meant to speed up the performance of a database, so use indexing
whenever it significantly improves the performance of your database. As your

154 | P a g e
database becomes larger and larger, the more likely you are to see benefits from
indexing.
When not to use Indexes?
When data is written to the database, the original table (the clustered index) is
updated first and then all of the indexes off of that table are updated. Every time
a write is made to the database, the indexes are unusable until they have
updated. If the database is constantly receiving writes, then the indexes will
never be usable. This is why indexes are typically applied to databases in data
warehouses that get new data updated on a scheduled basis (off-peak hours)
and not production databases which might be receiving new writes all the time.
155 | P a g e
CHAPTER 6:
INTRODUCTION TO SQL AND

ADVANCED SQL
Briones, Joshua
Ramos, , Eden Marie C.
Reyes, Ana Marie
156 | P a g e
Introduction and History of SQL
SQL - the most common language for relational systems.
SQL stands for Structured Query Language
 Initially called SEQUEL (Structured English Query Language) and based
on their original language called SQUARE (Specifying Queries As
Relational Expressions). SEQUEL was later renamed to SQL by dropping
the vowels, because SEQUEL was a trade mark registered by the Hawker
Siddeley aircraft company.
 TABLE is also called the relation or data set that is organized w/ rows and
columns
Pronounced “S-Q-L” by some and “sequel” by others
 SQL stands for Structured Query Language
 SQL lets you access and manipulate databases
 SQL became a standard of the American National Standards Institute
(ANSI) in 1986, and of the International Organization for Standardization
(ISO) in 1987
157 | P a g e
 is a domain-specific language used in programming and designed for
managing data held in a relational database management system
(RDBMS), or for stream processing in a relational data stream
management system (RDSMS). It is particularly useful in handling
structured data, i.e. data incorporating relations among entities and
variables.
The first commercial DBMS that supported SQL was Oracle in 1979. Oracle is
now available in mainframe, client/server, and PC-based platforms for many
operating systems, including various UNIX, Linux, and Microsoft Windows
operating systems. IBM’s DB2, Informix, and Microsoft SQL Server are available
for this range of operating systems also.
The concepts of relational database technology were first articulated in 1970
They used a language called Sequel, also developed at the San Jose IBM
Research Laboratory.
Purposes of SQL
The following were the original purposes of the SQL standard:
1. To specify the syntax and semantics of SQL data definition and

manipulation
languages
2. To define the data structures and basic operations for designing,

accessing,
maintaining, controlling, and protecting an SQL database
158 | P a g e
3. To provide a vehicle for portability of database definition and application
modules between conforming DBMSs
4. To specify both minimal (Level 1) and complete (Level 2) standards, which

permit
different degrees of adoption in products
5. To provide an initial standard, although incomplete, that will be enhanced
later to include specifications for handling such topics as referential
integrity, transaction management, user-defined functions, join operators
beyond the equi-join, and national character sets
Advantages of SQL
 Reduced training costs Training in an organization can concentrate on

one
language. A large labor pool of IS professionals trained in a common language
reduces retraining for newly hired employees.
 Productivity IS professionals can learn SQL thoroughly and become

proficient
with it from continued use. An organization can afford to invest in tools to help
IS professionals become more productive. Because they are familiar with the
language in which programs are written, programmers can more quickly maintain
existing programs.
 Application portability Applications can be moved from one context to
another when each environment uses SQL. Further, it is economical for
159 | P a g e
the computer software industry to develop off-the-shelf application
software when there
is a standard language.
 Application longevity A standard language tends to remain so for a long
time; hence there will be little pressure to rewrite old applications. Rather,
applications will simply be updated as the standard language is enhanced
or new versions of
DBMSs are introduced.
 Reduced dependence on a single vendor When a nonproprietary

language is
used, it is easier to use different vendors for the DBMS, training and educational
services, application software, and consulting assistance; further, the market for
such vendors will be more competitive, which may lower prices and improve
service.
 Cross-system communication Different DBMSs and application

programs can
more easily communicate and cooperate in managing data and processing user
programs.
Disadvantages of SQL
 A standard may be difficult to change (because so many vendors have a
vested interest in it), so fixing deficiencies may take considerable effort.
160 | P a g e
 Standards that can be extended with proprietary features is that using
special features added to SQL by a particular vendor may result in the
loss of some advantages, such as application portability
Data Table
Writing single-table queries using SQL Commands.
Name Surname Subject Age PassMark

Joshua Briones PCE007 16 1.00
Ana Marie Reyes PCE 007 21 1.25
Eden Ramos PCE 007 21 1.25
Mike Antolino PCE007 21 1.50
SELECT – used to get data from tables in a database. It is also one of the most
important commands in SQL.
Elements of the SELECT Statement
The purpose of a SELECT statement is to query tables, apply some logical
manipulation, and return a result. In this section, I talk about the phases involved
in logical query processing. I describe the logical order in which the different
query clauses are processed, and what happens in each phase.
Note that by “logical query processing,” I’m referring to the conceptual way in
which standard SQL defines how a query should be processed and the final
result achieved. Don’t be alarmed if some logical processing phases that I
describe here seem inefficient. The Microsoft SQL Server engine doesn’t have to
161 | P a g e
follow logical query processing to the letter; rather, it is free to physically process
a query differently by rearranging processing phases, as long as the final result
would be the same as that dictated by logical query processing. SQL Server can
—and in fact, often does—make many shortcuts in the physical processing of a
query.
The FROM Clause
The FROM clause is the very first query clause that is logically processed. In this
clause, you specify the names of the tables that you want to query and table
operators that operate on those tables.
The WHERE Clause
In the WHERE clause, you specify a predicate or logical expression to filter the
rows returned by the FROM phase. Only rows for which the logical expression
evaluates to TRUE are returned by the WHERE phase to the subsequent logical
query processing phase. In the sample query in Listing 2-1, the WHERE phase
filters only orders placed by customer 71.
Referential integrity means that a value in the matching column on the many
side must correspond to a value in the primary key for some row in the table on
the one side or be NULL.
Referential integrity is a property of data stating that all its references are valid.
In the context of relational databases, it requires that if a value of one attribute
162 | P a g e
(column) of a relation (table) references a value of another attribute (either in the
same or a different relation), then the referenced value must exist.[1]
For referential integrity to hold in a relational database, any column in a base
table that is declared a foreign key can only contain either null values or values
from a parent table's primary key or a candidate key.[2] In other words, when a
foreign key value is used it must reference a valid, existing primary key in the
parent table. For instance, deleting a record that contains a value referred to by a
foreign key in another table would break referential integrity. Some relational
database management systems (RDBMS) can enforce referential integrity,
normally either by deleting the foreign key rows as well to maintain integrity, or by
returning an error and not performing the delete. Which method is used may be
determined by a referential integrity constraint defined in a data dictionary.
The adjective 'referential' describes the action that a foreign key performs,
'referring' to a linked column in another table. In simple terms, 'referential
integrity' guarantees that the target 'referred' to will be found. A lack of referential
integrity in a database can lead relational databases to return incomplete data,
usually with no indication of an error.
DISCUSS SQL:1999 and SQL:2016 STANDARDS
SQL:1999 (also called SQL 3) was the fourth revision of the SQL standard.
Starting with this version, the standard name used a colon instead of a hyphen to
163 | P a g e
be consistent with the names of other ISO standards. This standard was
published in multiple installments between 1999 and 2002.
The first installment of SQL:1999 had five parts:
 Part 1: SQL/Framework (100 pages) defined the fundamental concepts of
SQL.
 Part 2: SQL/Foundation (1050 pages) defined the fundamental syntax and
operations of SQL: types, schemas, tables, views, query and update
statements, expressions, and so forth. This part is the most important for
regular SQL users.
 Part 3: SQL/CLI (Call Level Interface) (514 pages) defined an application
programming interface for SQL.
 Part 4: SQL/PSM (Persistent Stored Modules) (193 pages) defined
extensions that make SQL procedural.
 Part 5: SQL/Bindings (270 pages) defined methods for embedding SQL
statements in application programs written in a standard programming
language.
Three more parts, also considered part of SQL:1999, were published later.
SQL:1999 introduced many important features that are part of modern SQL.
Among the most important were COMMON TABLE EXPRESSIONS (CTEs). This
is a very useful feature that lets you ORGANIZE LONG AND COMPLEX SQL
164 | P a g e
QUERIES and make them more readable. When the WITH [RECURSIVE] syntax
is used, CTEs can also RECURSIVELY PROCESS HIERARCHICAL DATA.
SQL:1999 also introduced OLAP (Online Analytical Processing) capabilities,
which includes features that are helpful when preparing business reports.
The GROUP BY extensions ROLLUP, CUBE, and GROUPING SETS entered
the standard at this time.
Some minor additions in SQL:1999 standard include using expressions in
ORDER BY, the inclusion of data types for large binary objects (LOB and CLOB),
and the introduction of triggers.
The size of the SQL standard grew significantly between 1992 and 1999. The
SQL-92 standard had almost 600 pages, but it was still accessible to regular SQL
users. Books like A Guide to the SQL Standard by Christopher Date and Hugh
Darwen discussed and explained the SQL-92 standard.
Starting with SQL:1999 the standard – now over 2,000 pages – was no longer
accessible to regular SQL users. It has become a resource for database experts
and database vendors. The standard guides the development of SQL in major
databases; it shows which new language features are worth implementing to stay
current. It also standardizes the syntax of new SQL features, making sure that
major databases implement them in a similar way, using similar syntax and
semantics.
165 | P a g e
The change in the role of the SQL standard is emphasized by the fact that there
is no longer an official body that certifies compliance with the standard. Until
1996, the National Institute of Standards and Technology (NIST) data
management standards program certified SQL DBMS compliance with the SQL
standard. Now, vendors self-certify the compliance of their products.
SQL:2003 and beyond
In the 21st century, the SQL standard has been regularly updated.
The SQL:2003 standard was published on March 1, 2004. Its major addition
was WINDOW FUNCTIONS, a powerful analytical feature that allows you to
compute summary statistics without collapsing rows. Window functions
significantly increased the expressive power of SQL. They are extremely useful
in PREPARING ALL KINDS OF BUSINESS REPORTS , ANALYZING TIME
SERIES DATA, and ANALYZING TRENDS. The addition of window functions to
the standard coincided with the popularity of OLAP and data warehouses. People
started using databases to make data-driven business decisions. This trend is
only gaining momentum, thanks to the growing amount of data that all
businesses collect. You can learn window functions with our WINDOW
FUNCTIONS course. (READ ABOUT THE COURSE or WHY IT’S WORTH
LEARNING SQL WINDOW FUNCTIONS HERE.) SQL:2003 also introduced
XML-related functions, sequence generators, and identity columns.
166 | P a g e
After 2004, there were no major ground-breaking additions to the language. The
changes in the SQL standard reflected the changes in technology at the time.
SQL:2003 introduced XML-related functions to allow for interoperability between
databases and XML technologies, which were the hot new thing in the early
2000s. SQL:2006 further specified how to use SQL with XML. It was not a
revision of the complete SQL standard, just Part 14, which deals with SQL-XML
interoperability.
The next revisions of the standard brought minor enhancements to the
language. SQL:2008 legalized the use of ORDER BY outside cursor
definitions(!), and added INSTEAD OF triggers, the TRUNCATE statement, and
the FETCH clause. SQL:2011 added temporal data and some enhancements to
window functions and the FETCH clause.
SQL:2016 added row pattern matching and polymorphic table functions as well
as long-awaited JSON support. In the 2010s, JSON replaced XML as the
common data exchange format; modern Internet applications use JSON instead
of XML as their data format. The emerging NoSQL movement also popularized
JSON; document databases store JSON files, and key-value stores are
compatible with the JSON format. The SQL standard added JSON support to
allow for interoperability with modern applications and new types of databases.
The current SQL standard is SQL:2019. It added Part 15, which defines
multidimensional array support in SQL.
167 | P a g e
SQL:2016 or ISO/IEC 9075:2016 (under the general title "Information technology
– Database languages – SQL") is the eighth revision of the ISO (1987)
and ANSI (1986) standard for the SQL database query language. It was formally
adopted in December 2016. The standard consists of 9 parts which are
described in some detail in SQL.
SQL:2016 New features
SQL:2016 introduced 44 new optional features. 22 of them belong to the JSON
functionality, ten more are related to polymorphic table functions. The additions
to the standard include:
 JSON: Functions to create JSON documents, to access parts of JSON
documents and to check whether a string contains valid JSON data
 Row Pattern Recognition: Matching a sequence of rows against a regular
expression pattern
 Date and time formatting and parsing
 LISTAGG: A function to transform values from a group of rows into a
delimited string
 Polymorphic table functions: table functions without predefined return type
 New data type DECFLOATAdvanced SQL
Processing Multiple Tables
Now that we have explored some of the possibilities for working with a single
table, it’s time to bring out the light sabers, jet packs, and tools for heavy lifting:
168 | P a g e
We will work with multiple tables simultaneously. The power of RDBMSs is
realized when working with multiple tables. When relationships exist among
tables, the tables can be linked together in queries. Remember from the previous
chapter that these relationships are established by including a common
column(s) in each table where a relationship is needed. In most cases this is
accomplished by setting up a primary key—foreign key relationship, where the
foreign key in one table references the primary key in another, and the values in
both come from a common domain. We can use these columns to establish a link
between two tables by finding common values in the columns.
The linking of related tables varies among different types of relational systems. In
SQL, the WHERE clause of the SELECT command is also used for multiple-table
operations. In fact, SELECT can include references to two, three, or more tables
in the same command. As illustrated next, SQL has two ways to use SELECT for
combining data from related tables.
The most frequently used relational operation, which brings together data from
two or more related tables into one resultant table, is called a join. Originally,
SQL specified a join implicitly by referring in a WHERE clause to the matching of
common columns over which tables were joined. Since SQL-92, joins may also
be specified in the FROM clause. In either case, two tables may be joined when
each contains a column that shares a common domain with the other. As
mentioned previously, a primary key from one table and a foreign key that
references the table with the primary key will share a common domain and are
169 | P a g e
frequently used to establish a join. In special cases, joins will be established
using columns that share a common domain but not the primary-foreign key
relationship, and that also works (e.g., we might join customers and salespersons
based on common postal codes, for which there is no relationship in the data
model for the database). The result of a join operation is a single table. Selected
columns from all the tables are included. Each row returned contains data from
rows in the different input tables where values for the common columns match.
What Is an SQL JOIN?
In other guides, you have learned how to write basic SQL queries to retrieve data
from a table. In real-life applications, you would need to fetch data from multiple
tables to achieve your goals. To do so, you would need to use SQL joins. In this
guide, you will learn how to query data from multiple tables using joins.
A JOIN clause is used when you need to combine data from two or more tables
into one data set. Records from both tables are matched based on a condition
(also called a JOIN predicate) you specify in the JOIN clause. If the condition is
met, the records are included in the output. According to the article in
learnsql.com, they explain the SQL JOIN concept and the different JOIN types
using examples. So, before we go any further, let's take a look at the tables that
we are going to use in this article.
Get to Know the Database
170 | P a g e
We are going to use tables from a fictional bank database. The first table
is called account and it contains data related to customer bank accounts:
account_id overdraft_amt customer_i type_i segment
d d
2556889 12000 4 2 RET
132359879 1550 1 1 RET
5
2225546 5000 5 2 RET
5516229 6000 4 5 RET
5356222 7500 5 5 RET
2221889 5400 1 2 RET
2455688 12500 50 2 CORP
132248865 2500 51 1 CORP
6
132359879 3100 52 1 CORP
5
132311159 1220 53 1 CORP
5
account table
This table contains 10 records (10 accounts) and five columns:
 account_id – Uniquely identifies each account.
 overdraft_amount – The overdraft limit for each account.
 customer_id – Uniquely identifies each customer.
 type_id – Identifies the type of that account.
 segment – Contains the values ‘RET’ (for retail clients) and ‘CORP’ (for
corporate clients).
The second table is called customer and contains customer-related data:
171 | P a g e
customer_i name lastname gender marital_status
d
1 MARC TESCO M Y
2 ANNA MARTIN F N
3 EMMA JOHNSON F Y
4 DARIO PENTAL M N
5 ELENA SIMSON F N
6 TIM ROBITH M N
7 MILA MORRIS F N
8 JENNY DWARTH F Y
customer table
This table contains eight records and five columns:
 customer_id – Uniquely identifies each account.
 name – The customer’s first name.
 lastname – The customer’s last name.
 gender– The customer’s gender (M or F).
 marital_status – If the customer is married (Y or N).
Now that we have these two tables, we can combine them to display additional
results related to customer or account data. JOIN can help us to get answers to
questions like:
1. Who owns each account in the account table?
2. How many accounts does Marc Tesco have ?
3. How many accounts are owned by a female customer?
4. What is the total overdraft amount for all of Emma Johnson’s accounts?
To answer each of these questions, we need to combine two tables
(account and customer) using a column that appears in both tables (in this
172 | P a g e
case, customer_id). Once we merge the two tables, we will have account and
customer information in a single output.
Keep in mind that in the account table we have some customers that can’t be
found in the customer table. (Info about corporate clients is stored somewhere
else.) Also, keep in mind that some customer IDs are not present in
the account table; some customers don't have accounts.
There are several ways we can combine two tables. Or, put another way, we can
say that there are several different SQL JOIN types.
SQL’s 4 JOIN Types
SQL JOIN types include:
 INNER JOIN (also known as a ‘simple’ JOIN). This is the most common
type of JOIN.
 LEFT JOIN (or LEFT OUTER JOIN)
 RIGHT JOIN (or RIGHT OUTER JOIN)
 FULL JOIN (or FULL OUTER JOIN)
 Self joins and cross joins are also possible in SQL
Let's dive deeper into the first four SQL JOIN types. I will use an example to
explain the logic and the syntax of each type. Sometimes people use Venn
diagrams when explaining SQL JOIN types. I’m not going to use them here, but if
that’s your thing then check out the article HOW TO LEARN SQL JOINS.
173 | P a g e
INNER JOIN
INNER JOIN is used to display matching records from both tables. This is also
called a simple JOIN; if you omit the INNER keyword (or any other keyword,
like LEFT, RIGHT, or FULL) and just use JOIN, this is the type of join you’ll get
by default.
There are usually two (or more) tables in a join statement. We call them the left
and right tables. The left table is in the FROM clause – and thus to the left of
the JOIN keyword. The right table is between the JOIN and ON keywords, or to
the right of the JOIN keyword.If the JOIN condition is met in an INNER JOIN, that
record is included in the data set. It can be from either table. If the record does
not match the criteria, it’s not included. The image below shows what would
happen if the color blue was the join criteria for the left and right tables:
Let's take a look how INNER JOIN works in our example. I’m going to do a
simple JOIN on account and customer to
display account and customer information in one output:
SELECT account.*,
customer.name,
customer.lastname,
customer.gender,
customer.marital_status
FROM account
174 | P a g e
JOIN customer
ON account.customer_id=customer.customer_id
Here is a short explanation of what’s going on:
 I’m using JOIN because we are merging
the account and customer tables.
175 | P a g e
 The JOIN predicate here is defined by equality: account.customer_id =
customer.customer_id
In other words, records are matched by values in the customer_id column:
176 | P a g e
 Records that share the same customer ID value are matched. (They are
shown in color in the above image.) Records that don’t have a match in
either table (shown in gray) are not included in the result set.
 For records that have a match, all attributes from the account table are
displayed in the result set. The name, last name, gender, and marital
status attributes from the customer table are also displayed.
After running this code, SQL returns following:
INNER JOIN result
As we mentioned earlier, only colored (matching) records were returned; all
others are discarded. In business terms, we displayed all the retail accounts with
detailed information about their owners. Non-retail accounts were not displayed
because their customer information is not stored in the customer table.
LEFT JOIN
Sometimes you’ll need to keep all records from the left table – even if some don't
have a match in the right table. In the last example, the gray rows were not
177 | P a g e
displayed in the output. Those are corporate accounts. In some cases, you may
want to have them in the data set, even if their customer data is left empty. If we
would like to return unpaired records from the left table, then we should write
a LEFT JOIN. Below, you can see that the LEFT JOIN returns everything in the
left table and matching rows in the right table.
Here is how the previous query would look if we used LEFT JOIN instead
of INNER JOIN:
SELECT account.*,
customer.name,
customer.lastname,
customer.gender,
FROM account
178 | P a g e
LEFT JOIN customer
The syntax is identical. The result, however, is not the same?. Now we can see
the corporate accounts (gray records) in the results:
Left join - account with customer
Notice how attributes like name, last name, gender, and marital status in the last
four rows are populated with NULLs. This is because these gray rows don’t have
matches in the customer table (i.e. customer_id values of 50, 51 ,52 , and 53
are not present in the customer table). Thus, those attributes have been left
NULL in this result.
RIGHT JOIN
179 | P a g e
Similar to LEFT JOIN, RIGHT JOIN keeps all records from the right table (even if
there is no matching record in the left table). Here’s that familiar image to show
you how it works:
180 | P a g e
Once again, we use the same example. However, we’ve replaced LEFT
JOIN with RIGHT JOIN:
SELECT account.account_id,
account.overdraft_amount,
account.type_id,
account.segment,
account.customer_id,
customer.customer_id
customer.name,
customer.lastname,
customer.gender,
FROM account
RIGHT JOIN customer
The syntax is mostly the same. I’ve made one more small change: In addition
to account.customer_id, I’ve also added customer.customer_id column to the
result set. I did this to show you what happens to records from
the customer table that don't have a match on the left (account) table.
181 | P a g e
Here is the result:
RIGHT JOIN result
As you can see, all records from the right table have been included in the result
set. Keep in mind:
 Unmatched customer IDs from the right table (numbers 2,3, 6,7, and 8,
shown in gray) have their account attributes set to NULL in this result set.
They are retail customers that don’t have a bank account – and thus no
records in the account table.
 You might expect that the resulting table will have eight records because
that is the total number of records in the customer table. However, this is
not the case. We have 11 records because customer IDs 1, 4, and 5 each
have two accounts in the account table. All possible matches are
displayed.
182 | P a g e
FULL (OUTER) JOIN
I’ve shown you how to keep all records from the left or right tables. But what if
you want to keep all records from both tables? In our case, you’d want to display
all matching records plus all corporate accounts plus all customers without
accounts. To do this, you can use FULL OUTER JOIN. This JOIN type will pair
all matching columns and will also display all unmatching columns from both
tables. Unfamiliar attributes will be populated with NULLs. Have a look at the
image below:
Here is the FULL OUTER JOIN syntax:
SELECT account.*,
CASE WHEN customer.customer_id IS NUL
THEN account.customer_id
183 | P a g e
customer.lastname,
customer.gender,
FROM account
FULL JOIN customer
ON account.customer_id=customer.customer_id;
Now the result looks like this:
Full outer join result
Notice how the last five rows have account attributes populated with NULLs. This
is because these customers do not have records in the account table. Notice
also how customers 50, 51, 52, and 53 have first or last names and other
attributes from the customer table populated with NULLs. This is because they
184 | P a g e
don't exist in the customer table. Here, customer_id in the result table is never
NULL because we defined customer_id with a CASE WHEN statement:
CASE WHEN customer.customer_id IS NULL
THEN account.customer_id
ELSE customer.customer_id END customer_i
This actually means that customer_id in the result table is a combination
of account.customer_id and customer.customer_id (i.e. when one is NULL, use
the other one). We could also display both columns in the output, but this CASE
WHEN statement is more convenient.
Most Common Questions asked about Joins
Question 1: What is a Natural Join and in which situations is a natural join
used?
Solution:
A Natural Join is also a Join operation that is used to give you an output based
on the columns in both the tables between which, this join operation must be
implemented. To understand the situations n which natural join is used, you need
to understand the difference between Natural Join and Inner Join.
The main difference the Natural Join and the Inner Join relies on the number of
columns returned. Refer below for example.
185 | P a g e
Now, if you apply INNER JOIN on these 2 tables, you will see an output as
below:
If you apply NATURAL JOIN, on the above two tables, the output will be as
below:
From the above example, you can clearly see that the number of columns
returned from the Inner Join is more than that of the number of columns returned
from Natural Join. So, if you wish to get an output, with less number of columns,
then you can use Natural.
Question 2: How to map many-to-many relationships using joins?
Solution:
To map many to many relationships using joins, you need to use two JOIN
statements.
For example, if we have three tables(Employees, Projects and Technologies),
and let us assume that each employee is working on a single project. So, one
186 | P a g e
project cannot be assigned to more than one employee. So, this is basically, a
one-to-many relationship.
Now, similarly, if you consider that, a project can be based on multiple
technologies, and any technology can be used in multiple projects, then this kind
of relationship is a many-to-many relationship.
To use joins for such relationships, you need to structure your database with 2
foreign keys. So, to do that, you have to create the following 3 tables:
 Projects
 Technologies
 projects_to_technologies
The project_to_technologies table holds the combinations of project-technology
in every row. This table maps the items on the projects table to the items on the
technologies table so that multiple projects can be assigned to one or more
technologies.
Once the tables are created, use the following two JOIN statements to link all the
above tables together:
 projects_to_technologies to projects
 proejcts_to-technologies to technologies
Question 3: What is a Hash Join?
Solution:
187 | P a g e
Hash joins are also a type of joins which are used to join large tables or in an
instance where the user wants most of the joined table rows.
The Hash Join algorithm is a two-step algorithm. Refer below for the steps:
 Build phase: Create an in-memory hash index on the left side input
 Probe phase: Go through the right side input, each row at a time to find
the matches using the index created in the above step.
Question 4: What is Self & Cross Join?
Solution:
Self Join
SELF JOIN in other words is a join of a table to itself. This implies that each row
in a table is joined with itself.
Cross Join
The CROSS JOIN is a type of join in which a join clause is applied to each row of
a table to every row of the other table. Also, when the WHERE condition is used,
this type of JOIN behaves as an INNER JOIN, and when the WHERE condition is
not present, it behaves like a CARTESIAN product.
Question 5: Can you JOIN 3 tables in SQL?
Solution:
Yes. To perform a JOIN operation on 3 tables, you need to use 2 JOIN
statements. You can refer to the second question for an understanding of how to
join 3 tables with an example.
What is subquery in SQL?
188 | P a g e
A subquery is a SQL query nested inside a larger query.
 A subquery may occur in :
o - A SELECT clause
o - A FROM clause
o - A WHERE clause
 The subquery can be nested inside a SELECT, INSERT, UPDATE, or
DELETE statement or inside another subquery.
 A subquery is usually added within the WHERE Clause of another SQL
SELECT statement.
 You can use the comparison operators, such as >, <, or =. The
comparison operator can also be a multiple-row operator, such as IN,
ANY, or ALL.
 A subquery is also called an inner query or inner select, while the
statement containing a subquery is also called an outer query or outer
select.
 The inner query executes first before its parent query so that the results of
an inner query can be passed to the outer query.
You can use a subquery in a SELECT, INSERT, DELETE, or UPDATE statement
to perform the following tasks:
 Compare an expression to the result of the query.
 Determine if an expression is included in the results of the query.
189 | P a g e
 Check whether the query selects any rows.
Syntax :
 The subquery (inner query) executes once before the main query (outer
query) executes.
 The main query (outer query) uses the subquery result.
SQL Subqueries Example :
In this section, you will learn the requirements of using subqueries. We have the
following two tables 'student' and 'marks' with common field 'StudentID'.
Student Marks
Now we want to write a query to identify all students who get better marks than
that of the student who's StudentID is 'V002', but we do not know the marks of
'V002'.
190 | P a g e
- To solve the problem, we require two queries. One query returns the marks
(stored in Total_marks field) of 'V002' and a second query identifies the students
who get better marks than the result of the first query.
First query:
SELECT *
FROM `marks`
WHERE studentid = 'V002';
Query result:
The result of the query is 80.
- Using the result of this query, here we have written another query to identify the
students who get better marks than 80. Here is the query :
Second query:
SELECT a.studentid, a.name, b.total_marks
FROM student a, marks b
WHERE a.studentid = b.studentid
AND b.total_marks >80;
Query result:
191 | P a g e
Above two queries identified students who get the better number than the student
who's StudentID is 'V002' (Abhay).
You can combine the above two queries by placing one query inside the other.
The subquery (also called the 'inner query') is the query inside the parentheses.
See the following code and query result:
SQL Code:
SELECT a.studentid, a.name, b.total_marks
FROM student a, marks b
WHERE a.studentid = b.studentid AND b.total_marks >
(SELECT total_marks
FROM marks
WHERE studentid = 'V002');
Query result:
Pictorial Presentation of SQL Subquery:
192 | P a g e
Subqueries: General Rules
A subquery SELECT statement is almost similar to the SELECT statement and it
is used to begin a regular or outer query. Here is the syntax of a subquery:
Syntax:
(SELECT [DISTINCT] subquery_select_argument
FROM {table_name | view_name}
{table_name | view_name} ...
193 | P a g e
[WHERE search_conditions]
[GROUP BY aggregate_expression [, aggregate_expression] ...]
[HAVING search_conditions])
Subqueries: Guidelines
There are some guidelines to consider when using subqueries:
 A subquery must be enclosed in parentheses.
 A subquery must be placed on the right side of the comparison operator.
 Subqueries cannot manipulate their results internally, therefore ORDER
BY clause cannot be added into a subquery. You can use an ORDER BY
clause in the main SELECT statement (outer query) which will be the last
clause.
 Use single-row operators with single-row subqueries.
 If a subquery (inner query) returns a null value to the outer query, the
outer query will not return any rows when using certain comparison
operators in a WHERE clause.
Type of Subqueries
 Single row subquery: Returns zero or one row.
 Multiple row subquery: Returns one or more rows.
 Multiple column subqueries: Returns one or more columns.
194 | P a g e
 Correlated subqueries: Reference one or more columns in the outer SQL
statement. The subquery is known as a correlated subquery because the
subquery is related to the outer SQL statement.
 Nested subqueries: Subqueries are placed within another subquery.
Understanding Correlated and Uncorrelated Sub-queries in SQL
Sub-queries are queries within another query. The result of the inner sub-query
is fed to the outer query, which uses that to produce its outcome. If that outer
query is itself the inner query to a further query, then the query will continue until
the final outer query completes.
There are two types of sub-queries in SQL however, correlated sub-queries and
uncorrelated sub-queries. Let’s take a look at these.
Uncorrelated Sub-query
A uncorrelated sub-query is a type of sub-query where inner query doesn’t
depend upon the outer query for its execution. It can complete its execution as a
standalone query. Let us explain uncorrelated sub-queries with the help of an
example.
Suppose, you have database “schooldb” which has two tables: student and
department. A department will have many students. This means that the student
table has a column “dep_id” which contains the id of the department to which that
student belongs. Now, suppose we want to retrieve records of all students from
the “Computer” department.
195 | P a g e
The sub-query used in this case will be uncorrelated sub-query since the inner
query will retrieve the id of the computer department from the department table;
the result of this inner query will be directly fed into the outer query which
retrieves records of students from the student table where “dep_id” column’s
value is equal to value retrieved by inner query.
The inner query which retrieves the id of the department using name can be
executed as standalone query as well.
Correlated Sub-query
A correlated sub-query is a type of query, where inner query depends upon the
outcome of the outer query in order to perform its execution.
Suppose we have a student and department table in “schooldb” as discussed
above. We want to retrieve the name, age and gender of all the students whose
age is greater than the average age of students within their department.
In this case, the outer query will retrieve records of all the students iteratively and
each record is passed to the inner query. For each record, the inner query will
retrieve average age of the department for the student record passed by the
outer query. If the age of the student is greater than average age, the record of
the student will be included in the result, and if not not. Let’s see this in action.
Preparing the Data
Let’s create a database named “schooldb”. Run the following SQL in your query
window:
196 | P a g e
The above command will create a database named “schooldb” on your database
server.
Next, we need to create a “department” table within the “schooldb” database. The
department table shall have three columns: id, name and capacity. To create
department table, execute following query:
1 CREATE TABLE department
2 (
3 id INT PRIMARY KEY,
4 name VARCHAR(50) NOT NULL,
5 capacity INT NOT NULL,
6 )
Next lets add some dummy data to the table so that we can execute our sub-
queries. Execute the following to create 5 departments: English, Computer, Civil,
Maths and History.
1 USE schooldb;
2
3 INSERT INTO department
4 VALUES (1, 'English', 300),
5 (2, 'Computer', 450),
6 (3, 'Civil', 400),
7 (4, 'Maths', 400),
8 (5, 'History', 300)
197 | P a g e
Next we need to create a “student” table within our database. The student table
will have five columns: id, name, age, gender, and dep_id.
The dep_id column will act as the foreign key column and will have values from
the id column of the department table. This will create a one to many relationship
between the department and student tables. Execute following query to create
student table.
1
USE schooldb;
2

3
CREATE TABLE student
4
(
5
id INT PRIMARY KEY,
6
name VARCHAR(50) NOT NULL,
7
gender VARCHAR(50) NOT NULL,
8
age INT NOT NULL,
9
dep_id INT NOT NULL
1
)
0
198 | P a g e
USE schooldb;
1

2
INSERT INTO student
3
VALUES (1, 'Jolly', 'Female', 20,
4
4),
5
(2, 'Jon', 'Male', 22, 3),
6
(3, 'Sara', 'Female', 25, 4),
7
(4, 'Laura', 'Female', 18, 2),
8
(5, 'Alan', 'Male', 20, 3),
9
(6, 'Kate', 'Female', 22, 2),
10
(7, 'Joseph', 'Male', 18, 2),
11
(8, 'Mice', 'Male', 23, 1),
12
(9, 'Wise', 'Male', 21, 5),
13
(10, 'Elis', 'Female', 27, 2);
Notice that values in “dep_id” column of the student table exists in the id column
of the department table.
Now, let us see examples of both correlated and uncorrelated sub-queries.
Uncorrelated Sub-query Example
Let us execute a uncorrelated sub-query which retrieves records of all the
students who belong to “Computer” department.
1 USE schooldb;
2
3 SELECT * FROM
199 | P a g e
student
4
WHERE dep_id =
5
(
6
SELECT id from department WHERE name =
7
'Computer'
8
);
The output of the above SQL will be:
Gender age dep_id
Female 18 2
Female 22 2
Male 18 2
Female 27 2
You can see that there are two queries. The inner query retrieves id of the
“Computer” department while the outer query retrieves student records with that
id value in the dep_id column.
We know that in the case of uncorrelated sub-queries the inner query can be
executed as standalone query and it will still work. Let’s check if this is true in this
case. Execute the following query on the server.
1 SELECT id from department WHERE name = 'Computer';
200 | P a g e
The above query will execute successfully and will return 2 i.e. the of the
“Computer” department. This is a uncorrelated sub-query.
Correlated Sub-query Example
201 | P a g e
We know that in case of correlated sub-queries, the inner query depends upon
the outer query and cannot be executed as a standalone query.
Lets execute a correlated sub-query that retrieves results of all the students with
age greater than average age within their department as discussed above.
USE schooldb;
1

2
SELECT name, gender, age
3
FROM student Greater
4
WHERE age >
5
(SELECT AVG (age)
6
FROM student average
7
WHERE greater.dep_id =
8
average.dep_id) ;
The output of the above query will be:
gender age
Female 22
Female 27
Male 22
Female 25
202 | P a g e
We know that in the case of a correlated sub-query, the inner query cannot be
executed as standalone query. You can verify this by executing the following
inner query on it’s own:
SELECT AVG (age)

1
FROM student average
2
WHERE greater.dep_id =
3
average.dep_id
The above query will throw an error.
Other small differences between correlated and uncorrelated sub-queries are:
1. The outer query executes before the inner query in the case of a
correlated sub-query. On the other hand in case of a uncorrelated
sub-query the inner query executes before the outer query.
2. Correlated sub-queries are slower. They take M x N steps to execute
a query where M is the records retrieved by outer query and N is the
number of iteration of inner query. Uncorrelated sub-queries
complete execution in M + N steps.
SubQuery vs Join in SQL
Any information which you retrieve from the database using subquery can be
retrieved by using different types of joins also. SQL is flexible and it provides
different ways of doing the same thing. Some people find SQL Joins confusing
203 | P a g e
and subquery specially noncorrelated more intuitive but in terms of performance
SQL Joins are more efficient than subqueries.
Important points about SubQuery in DBMS
1. Almost whatever you want to do with subquery can also be done using join, it
is just a matter of choice subquery seems more intuitive to many users.
2. Subquery normally returns a scaler value as a result or result from one column
if used along with IN Clause.
3. You can use subqueries in four places: subquery as a column in select clause,
4. In the case of correlated subquery outer query gets processed before the inner
query.
That's all about subquery in SQL. It's an important concept to learn and
understand, as both correlated and non-correlated subquery is essential to solve
SQL query-related problems. They are not just important from the SQL interview
point of view but also from the Data Analysis point of view.
4. Understand the use of SQL in procedural languages, both standard
(e.g PHP) and proprietary (e.g. PL/SQL).
The transaction controls help manage transaction processing, ensuring that
transactions are either completed or rolled back if errors or problems occur. The
204 | P a g e
security statements are used to control database access as well as to create
user roles and permissions.
SQL syntax is the coding format used in writing statements
Commonly used SQL statements include select, add, insert, update, delete,
create, alter and truncate.
The first thing to understand about SQL is that SQL isn’t a procedural language,
as are Python, C, C++, C#, and Java. To solve a problem in a procedural
language, you write a procedure — a sequence of commands that performs one
specific operation after another until the task is complete. The procedure may be
a straightforward linear sequence or may loop back on itself, but in either case,
the programmer specifies the order of execution
SQL, on the other hand, is nonprocedural. To solve a problem using SQL, simply
tell SQL what you want (as if you were talking to Aladdin’s genie) instead of
telling the system how to get you what you want. The database management
system (DBMS) decides the best way to get you what you request.
All right. You were just told that SQL is not a procedural language — and that’s
essentially true. However, millions of programmers out there (and you’re
probably one of them) are accustomed to solving problems in a procedural
manner. So, in recent years, there has been a lot of pressure to add some
procedural functionality to SQL — and SQL now incorporates features of a
procedural language: BEGIN blocks, IF statements, functions, and (yes)
205 | P a g e
procedures. With these facilities added, you can store programs at the server,
where multiple clients can use your programs repeatedly.
To illustrate what is meant by “tell the system what you want,” suppose you have
an EMPLOYEE table from which you want to retrieve the rows that correspond to
all your senior people. You want to define a senior person as anyone older than
age 40 or anyone earning more than $100,000 per year. You can make the
desired retrieval by using the following query:
SELECT * FROM EMPLOYEE WHERE Age > 40 OR Salary > 100000 ;
This statement retrieves all rows from the EMPLOYEE table where either the
value in the Age column is greater than 40 or the value in the Salary column is
greater than 100,000. In SQL, you don’t have to specify how the information is
retrieved. The database engine examines the database and decides for itself
how to fulfill your request. You need only specify what data you want to retrieve.
SQL-on-Hadoop is a class of analytical application tools that combine
established SQL-style querying with newer Hadoop data framework elements.
By supporting familiar SQL queries, SQL-on-Hadoop lets a wider group of
enterprise developers and business analysts work with Hadoop on commodity
computing clusters. Because SQL was originally developed for relational
databases, it has to be modified for the Hadoop 1 model, which uses the Hadoop
Distributed File System and Map-Reduce or the Hadoop 2 model, which can
work without either HDFS or Map-Reduce.
206 | P a g e
The different means for executing SQL in Hadoop environments can be divided
into (1) connectors that translate SQL into a MapReduce format; (2) "push down"
systems that forgo batch-oriented MapReduce and execute SQL within Hadoop
clusters; and (3) systems that apportion SQL work between MapReduce-HDFS
clusters or raw HDFS clusters, depending on the workload.
One of the earliest efforts to combine SQL and Hadoop resulted in the Hive data
warehouse, which featured HiveQL software for translating SQL-like queries into
MapReduce jobs. Other tools that help support SQL-on-Hadoop include BigSQL,
Drill, Hadapt, Hawq, H-SQL, Impala, JethroData, Polybase, Presto, Shark (Hive
on Spark), Spark, Splice Machine, Stinger, and Tez (Hive on Tez).
A (very) little SQL history
SQL originated in one of IBM’s research laboratories, as did relational database
theory. In the early 1970s, as IBM researchers developed early relational DBMS
(or RDBMS) systems, they created a data sublanguage to operate on these
systems. They named the pre-release version of this sublanguage SEQUEL
(Structured English QUEry Language). However, when it came time to formally
release their query language as a product, they found that another company had
already trademarked the product name “Sequel.” Therefore, the marketing
geniuses at IBM decided to give the released product a name that was different
from SEQUEL but still recognizable as a member of the same family. So they
named it SQL, pronounced ess-que-ell. Although the official pronunciation is ess-
que-ell, people had become accustomed to pronouncing it “Sequel” in the early
207 | P a g e
pre-release days and continued to do so. That practice has persisted to the
present day; some people will say “Sequel” and others will say “S-Q-L,” but they
are both talking about the same thing.
PL/SQL
PL/SQL is a combination of SQL along with the procedural features of
programming languages. It was developed by Oracle Corporation in the early
90's to enhance the capabilities of SQL. PL/SQL is one of three key programming
languages embedded in the Oracle Database, along with SQL itself and Java.
This tutorial will give you great understanding on PL/SQL to proceed with Oracle
database and other advanced RDBMS concepts.
The PL/SQL programming language was developed by Oracle Corporation in the
late 1980s as procedural extension language for SQL and the Oracle relational
database. Following are certain notable facts about PL/SQL −
PL/SQL is a completely portable, high-performance transaction-processing
language.
PL/SQL provides a built-in, interpreted and OS independent programming
environment.
PL/SQL can also directly be called from the command-line SQL*Plus interface.
Direct call can also be made from external programming language calls to
database.
208 | P a g e
PL/SQL's general syntax is based on that of ADA and Pascal programming
language.
Apart from Oracle, PL/SQL is available in TimesTen in-memory database and
IBM DB2.
Features of PL/SQL
PL/SQL has the following features:
-PL/SQL is tightly integrated with SQL.
-It offers extensive error checking.
-It offers numerous data types.
-It offers a variety of programming structures.
-It supports structured programming through functions and procedures.
-It supports object-oriented programming.
-It supports the development of web applications and server pages.
Advantages of PL/SQL
PL/SQL has the following advantages:
-SQL is the standard database language and PL/SQL is strongly integrated with
SQL. PL/SQL supports both static and dynamic SQL. Static SQL supports DML
operations and transaction control from PL/SQL block. -In Dynamic SQL, SQL
allows embedding DDL statements in PL/SQL blocks.
209 | P a g e
-PL/SQL allows sending an entire block of statements to the database at one
time. This reduces network traffic and provides high performance for the
applications.
-PL/SQL gives high productivity to programmers as it can query, transform, and
update data in a database.
-PL/SQL saves time on design and debugging by strong features, such as
exception handling, encapsulation, data hiding, and object-oriented data types.
-Applications written in PL/SQL are fully portable.
-PL/SQL provides high security level.
-PL/SQL provides access to predefined SQL packages.
-PL/SQL provides support for Object-Oriented Programming.
-PL/SQL provides support for developing Web -Applications and Server Pages.
In this chapter, we will discuss the Environment Setup of PL/SQL. PL/SQL is not
a standalone programming language; it is a tool within the Oracle programming
environment. SQL* Plus is an interactive tool that allows you to type SQL and
PL/SQL statements at the command prompt. These commands are then sent to
the database for processing. Once the statements are processed, the results are
sent back and displayed on screen.
To run PL/SQL programs, you should have the Oracle RDBMS Server installed
in your machine. This will take care of the execution of the SQL commands. The
210 | P a g e
most recent version of Oracle RDBMS is 11g. You can download a trial version
of Oracle 11g from the following link −
Download Oracle 11g Express Edition
You will have to download either the 32-bit or the 64-bit version of the installation
as per your operating system. Usually there are two files. We have downloaded
the 64-bit version. You will also use similar steps on your operating system, does
not matter if it is Linux or Solaris.
win64_11gR2_database_1of2.zip
win64_11gR2_database_2of2.zip
After downloading the above two files, you will need to unzip them in a single
directory database and under that you will find the following sub-directories −
Oracle Sub Directries
Step 1
Let us now launch the Oracle Database Installer using the setup file. Following is
the first screen. You can provide your email ID and check the checkbox as
shown in the following screenshot. Click the Next button.
Oracle Install 1
Step 2
You will be directed to the following screen; uncheck the checkbox and click the
Continue button to proceed.
211 | P a g e
Oracle install error
Step 3
Just select the first option Create and Configure Database using the radio button
and click the Next button to proceed.
Oracle Install 2
Step 4
We assume you are installing Oracle for the basic purpose of learning and that
you are installing it on your PC or Laptop. Thus, select the Desktop Class option
and click the Next button to proceed.
Oracle Install 3
Step 5
Provide a location, where you will install the Oracle Server. Just modify the
Oracle Base and the other locations will set automatically. You will also have to
provide a password; this will be used by the system DBA. Once you provide the
required information, click the Next button to proceed.
Oracle Install 4
Step 6
Again, click the Next button to proceed.
212 | P a g e
Oracle Install 5
Step 7
Click the Finish button to proceed; this will start the actual server installation.
Oracle Install 6
Step 8
This will take a few moments, until Oracle starts performing the required
configuration.
Oracle Install 7
Step 9
Here, Oracle installation will copy the required configuration files. This should
take a moment −
Oracle Configuration
Step 10
Once the database files are copied, you will have the following dialogue box. Just
click the OK button and come out.
Oracle Configuration
Step 11
Upon installation, you will have the following final window.
213 | P a g e
Oracle Install 8
Final Step
It is now time to verify your installation. At the command prompt, use the
following command if you are using Windows −
sqlplus "/ as sysdba"
You should have the SQL prompt where you will write your PL/SQL commands
and scripts −
PL/SQL Command Prompt
Text Editor
Running large programs from the command prompt may land you in inadvertently
losing some of the work. It is always recommended to use the command files. To
use the command files −
Type your code in a text editor, like Notepad, Notepad+, or EditPlus, etc.
Save the file with the .sql extension in the home directory.
Launch the SQL*Plus command prompt from the directory where you created
your PL/SQL file.
Type @file_name at the SQL*Plus command prompt to execute your program.
If you are not using a file to execute the PL/SQL scripts, then simply copy your
PL/SQL code and right-click on the black window that displays the SQL prompt;
214 | P a g e
use the paste option to paste the complete code at the command prompt. Finally,
just press Enter to execute the code, if it is not already executed.
In this chapter, we will discuss the Data Types in PL/SQL. The PL/SQL variables,
constants and parameters must have a valid data type, which specifies a storage
format, constraints, and a valid range of values. We will focus on the SCALAR
and the LOB data types in this chapter. The other two data types will be covered
in other chapters.
S.No Category & Description
1 Scalar
Single values with no internal components, such as a NUMBER, DATE, or
BOOLEAN.
2 Large Object (LOB)
Pointers to large objects that are stored separately from other data items, such
as text, graphic images, video clips, and sound waveforms.
3 Composite
Data items that have internal components that can be accessed individually. For
example, collections and records.
4 Reference
Pointers to other data items.
215 | P a g e
PL/SQL Scalar Data Types and Subtypes
PL/SQL Scalar Data Types and Subtypes come under the following categories −
S.No Date Type & Description
1 Numeric
Numeric values on which arithmetic operations are performed.
2 Character
Alphanumeric values that represent single characters or strings of characters.
3 Boolean
Logical values on which logical operations are performed.
4 Datetime
Dates and times.
PL/SQL provides subtypes of data types. For example, the data type NUMBER
has a subtype called INTEGER. You can use the subtypes in your PL/SQL
program to make the data types compatible with data types in other programs
while embedding the PL/SQL code in another program, such as a Java program.
PL/SQL Numeric Data Types and Subtypes
Following table lists out the PL/SQL pre-defined numeric data types and their
sub-types −
S.No Data Type & Description
216 | P a g e
1 PLS_INTEGER
Signed integer in range -2,147,483,648 through 2,147,483,647, represented in
32 bits
2 BINARY_INTEGER
Signed integer in range -2,147,483,648 through 2,147,483,647, represented in
32 bits
3 BINARY_FLOAT
Single-precision IEEE 754-format floating-point number
4 BINARY_DOUBLE
Double-precision IEEE 754-format floating-point number
5 NUMBER(prec, scale)
Fixed-point or floating-point number with absolute value in range 1E-130 to (but
not including) 1.0E126. A NUMBER variable can also represent 0
6 DEC(prec, scale)
ANSI specific fixed-point type with maximum precision of 38 decimal digits
7 DECIMAL(prec, scale)
IBM specific fixed-point type with maximum precision of 38 decimal digits
8 NUMERIC(pre, secale)
217 | P a g e
Floating type with maximum precision of 38 decimal digits
9 DOUBLE PRECISION
ANSI specific floating-point type with maximum precision of 126 binary digits
(approximately 38 decimal digits)
10 FLOAT
ANSI and IBM specific floating-point type with maximum precision of 126 binary
digits (approximately 38 decimal digits)
11 INT
ANSI specific integer type with maximum precision of 38 decimal digits
12INTEGER
ANSI and IBM specific integer type with maximum precision of 38 decimal digits
13 SMALLINT
ANSI and IBM specific integer type with maximum precision of 38 decimal digits
14 REAL
Floating-point type with maximum precision of 63 binary digits (approximately 18
decimal digits)
Following is a valid declaration −

218 | P a g e
DECLARE
num1 INTEGER;
num2 REAL;
num3 DOUBLE PRECISION;
BEGIN
null;
END;
When the above code is compiled and executed, it produces the following result
PL/SQL procedure successfully completed
PL/SQL Character Data Types and Subtypes
Following is the detail of PL/SQL pre-defined character data types and their sub-
types −
S.No Data Type & Description
1 CHAR
Fixed-length character string with maximum size of 32,767 bytes
219 | P a g e
2 VARCHAR2
Variable-length character string with maximum size of 32,767 bytes
3 RAW
Variable-length binary or byte string with maximum size of 32,767 bytes, not
interpreted by PL/SQL
4 NCHAR
Fixed-length national character string with maximum size of 32,767 bytes
5 NVARCHAR2
Variable-length national character string with maximum size of 32,767 bytes
6 LONG
Variable-length character string with maximum size of 32,760 bytes
7 LONG RAW
Variable-length binary or byte string with maximum size of 32,760 bytes, not
interpreted by PL/SQL
8 ROWID
Physical row identifier, the address of a row in an ordinary table
9 UROWID
220 | P a g e
Universal row identifier (physical, logical, or foreign row identifier)
PL/SQL Boolean Data Types
The BOOLEAN data type stores logical values that are used in logical
operations. The logical values are the Boolean values TRUE and FALSE and the
value NULL.
However, SQL has no data type equivalent to BOOLEAN. Therefore, Boolean
values cannot be used in −
SQL statements
Built-in SQL functions (such as TO_CHAR)
PL/SQL functions invoked from SQL statements
PL/SQL Datetime and Interval Types
The DATE datatype is used to store fixed-length datetimes, which include the
time of day in seconds since midnight. Valid dates range from January 1, 4712
BC to December 31, 9999 AD.
The default date format is set by the Oracle initialization parameter
NLS_DATE_FORMAT. For example, the default might be 'DD-MON-YY', which
includes a two-digit number for the day of the month, an abbreviation of the
month name, and the last two digits of the year. For example, 01-OCT-12.
221 | P a g e
Each DATE includes the century, year, month, day, hour, minute, and second.
The following table shows the valid values for each field −
Field Name Valid Datetime Values Valid Interval Values
YEAR -4712 to 9999 (excluding year 0) Any nonzero integer
MONTH 01 to 12 0 to 11
DAY 01 to 31 (limited by the values of MONTH and YEAR, according to the rules
of the calendar for the locale) Any nonzero integer
HOUR 00 to 23 0 to 23
MINUTE 00 to 59 0 to 59
SECOND 00 to 59.9(n), where 9(n) is the precision of time fractional seconds 0
to 59.9(n), where 9(n) is the precision of interval fractional seconds
TIMEZONE_HOUR -12 to 14 (range accommodates daylight savings time
changes) Not applicable
TIMEZONE_MINUTE 00 to 59 Not applicable
TIMEZONE_REGION Found in the dynamic performance view
V$TIMEZONE_NAMES Not applicable
TIMEZONE_ABBR Found in the dynamic performance view
V$TIMEZONE_NAMES Not applicable
222 | P a g e
PL/SQL Large Object (LOB) Data Types
Large Object (LOB) data types refer to large data items such as text, graphic
images, video clips, and sound waveforms. LOB data types allow efficient,
random, piecewise access to this data. Following are the predefined PL/SQL
LOB data types −
Data Type Description Size
BFILE Used to store large binary objects in operating system files outside the
database. System-dependent. Cannot exceed 4 gigabytes (GB).
BLOB Used to store large binary objects in the database. 8 to 128 terabytes (TB)
CLOB Used to store large blocks of character data in the database. 8 to 128 TB
NCLOB Used to store large blocks of NCHAR data in the database. 8 to 128 TB
PL/SQL User-Defined Subtypes
A subtype is a subset of another data type, which is called its base type. A
subtype has the same valid operations as its base type, but only a subset of its
valid values.
PL/SQL predefines several subtypes in package STANDARD. For example,
PL/SQL predefines the subtypes CHARACTER and INTEGER as follows −
223 | P a g e
SUBTYPE CHARACTER IS CHAR;
SUBTYPE INTEGER IS NUMBER(38,0);
You can define and use your own subtypes. The following program illustrates
defining and using a user-defined subtype −
DECLARE
SUBTYPE name IS char(20);
SUBTYPE message IS varchar2(100);
salutation name;
greetings message;
BEGIN
salutation := 'Reader ';
greetings := 'Welcome to the World of PL/SQL';
dbms_output.put_line('Hello ' || salutation || greetings);
END;
224 | P a g e
When the above code is executed at the SQL prompt, it produces the following
result −
Hello Reader Welcome to the World of PL/SQL
PL/SQL procedure successfully completed.
NULLs in PL/SQL
PL/SQL NULL values represent missing or unknown data and they are not an
integer, a character, or any other specific data type. Note that NULL is not the
same as an empty data string or the null character value '\0'. A null can be
assigned but it cannot be equated with anything, including itself.
PHP
The PHP Hypertext Preprocessor (PHP) is a programming language that allows
web developers to create dynamic content that interacts with databases. PHP is
basically used for developing web based software applications. This tutorial helps
you to build your base with PHP.
Why to Learn PHP?
225 | P a g e
PHP started out as a small open source project that evolved as more and more
people found out how useful it was. Rasmus Lerdorf unleashed the first version
of PHP way back in 1994.
PHP is a MUST for students and working professionals to become a great
Software Engineer specially when they are working in Web Development
Domain. I will list down some of the key advantages of learning PHP:
PHP is a recursive acronym for "PHP: Hypertext Preprocessor".
PHP is a server side scripting language that is embedded in HTML. It is used to
manage dynamic content, databases, session tracking, even build entire e-
commerce sites.
It is integrated with a number of popular databases, including MySQL,
PostgreSQL, Oracle, Sybase, Informix, and Microsoft SQL Server.
PHP is pleasingly zippy in its execution, especially when compiled as an Apache
module on the Unix side. The MySQL server, once started, executes even very
complex queries with huge result sets in record-setting time.
PHP supports a large number of major protocols such as POP3, IMAP, and
LDAP. PHP4 added support for Java and distributed object architectures (COM
and CORBA), making n-tier development a possibility for the first time.
PHP is forgiving: PHP language tries to be as forgiving as possible.
226 | P a g e
PHP Syntax is C-Like.
Characteristics of PHP
Five important characteristics make PHP's practical nature possible −
Simplicity
Efficiency
Security
Flexibility
Familiarity
PHP functions are similar to other programming languages. A function is a piece
of code which takes one more input in the form of parameter and does some
processing and returns a value.
You already have seen many functions like fopen() and fread() etc. They are
built-in functions but PHP gives you the option to create your own functions as
well.
There are two parts which should be clear to you −
Creating a PHP Function
Calling a PHP Function
227 | P a g e
In fact you hardly need to create your own PHP function because there are
already more than 1000 of built-in library functions created for different area and
you just need to call them according to your requirement.
Please refer to PHP Function Reference for a complete set of useful functions.
Creating PHP Function
Its very easy to create your own PHP function. Suppose you want to create a
PHP function which will simply write a simple message on your browser when
you will call it. Following example creates a function called writeMessage() and
then calls it just after creating it.
PHP Functions with Parameters
PHP gives you option to pass your parameters inside a function. You can pass
as many as parameters your like. These parameters work like variables inside
your function. Following example takes two integer parameters and add them
together and then print them.
Passing Arguments by Reference
It is possible to pass arguments to functions by reference. This means that a
reference to the variable is manipulated by the function rather than a copy of the
variable's value.
Any changes made to an argument in these cases will change the value of the
original variable. You can pass an argument by reference by adding an
228 | P a g e
ampersand to the variable name in either the function call or the function
definition.
5. Understand common uses of database triggers and stored
procedures
DATABASE TRIGGERS
A trigger resides in the database and anyone who has the required privilege can
use it, a trigger lets you write a set of SQL statements that multiple applications
can use. It lets you avoid redundant code when multiple programs need to
perform the same database operation.
You can use triggers to perform the following actions, as well as others that are
not found in this list:
Create an audit trail of activity in the database. For example, you can track
updates to the orders table by updating corroborating information to an audit
table.
Implement a business rule. For example, you can determine when an order
exceeds a customer's credit limit and display a message to that effect.
Derive additional data that is not available within a table or within the database.
For example, when an update occurs to the quantity column of the items table,
you can calculate the corresponding adjustment to the total_price column.
229 | P a g e
Enforce referential integrity. When you delete a customer, for example, you can
use a trigger to delete corresponding rows that have the same customer number
in the orders table.
Benefits of Triggers
Following are the benefits of triggers.
Generating some derived column values automatically
Enforcing referential integrity
Event logging and storing information on table access
Auditing
Synchronous replication of tables
Imposing security authorizations
Preventing invalid transactions
Types of Triggers in Oracle
Triggers can be classified based on the following parameters.
Classification based on the timing
BEFORE Trigger: It fires before the specified event has occurred.
AFTER Trigger: It fires after the specified event has occurred.
INSTEAD OF Trigger: A special type. You will learn more about the further
topics. (only for DML )

230 | P a g e
Classification based on the level
STATEMENT level Trigger: It fires one time for the specified event statement.
ROW level Trigger: It fires for each record that got affected in the specified event.
(only for DML)
Classification based on the Event
DML Trigger: It fires when the DML event is specified
(INSERT/UPDATE/DELETE)
DDL Trigger: It fires when the DDL event is specified (CREATE/ALTER)
DATABASE Trigger: It fires when the database event is specified
(LOGON/LOGOFF/STARTUP/SHUTDOWN)
Syntax Explanation:
The above syntax shows the different optional statements that are present in
trigger creation.
BEFORE/ AFTER will specify the event timings.
INSERT/UPDATE/LOGON/CREATE/etc. will specify the event for which the
trigger needs to be fired.
231 | P a g e
ON clause will specify on which object the above-mentioned event is valid. For
example, this will be the table name on which the DML event may occur in the
case of DML Trigger.
Command “FOR EACH ROW” will specify the ROW level trigger.
WHEN clause will specify the additional condition in which the trigger needs to
fire.
The declaration part, execution part, exception handling part is same as that of
the other PL/SQL blocks. Declaration part and exception handling part are
optional.
:NEW and :OLD Clause
In a row level trigger, the trigger fires for each related row. And sometimes it is
required to know the value before and after the DML statement.
Oracle has provided two clauses in the RECORD-level trigger to hold these
values. We can use these clauses to refer to the old and new values inside the
trigger body.
:NEW – It holds a new value for the columns of the base table/view during the
trigger execution
:OLD – It holds old value of the columns of the base table/view during the trigger
execution
232 | P a g e
This clause should be used based on the DML event. Below table will specify
which clause is valid for which DML statement (INSERT/UPDATE/DELETE).
NSTEAD OF Trigger
“INSTEAD OF trigger” is the special type of trigger. It is used only in DML
triggers. It is used when any DML event is going to occur on the complex view.
Consider an example in which a view is made from 3 base tables. When any
DML event is issued over this view, that will become invalid because the data is
taken from 3 different tables. So in this INSTEAD OF trigger is used. The
INSTEAD OF trigger is used to modify the base tables directly instead of
modifying the view for the given event.
Example 1: In this example, we are going to create a complex view from two
base table.
Table_1 is emp table and
Table_2 is department table.
Then we are going to see how the INSTEAD OF trigger is used to issue UPDATE
the location detail statement on this complex view. We are also going to see how
the :NEW and :OLD is useful in triggers.
Step 1: Creating table ’emp’ and ‘dept’ with appropriate columns
Step 2: Populating the table with sample values
Step 3: Creating view for the above created table
233 | P a g e
Step 4: Update of view before the instead-of trigger
Step 5: Creation of the instead-of trigger
Step 6: Update of view after instead-of trigger
Step 1) Creating table ’emp’ and ‘dept’ with appropriate columns
Database Stored Procedure
Database-stored procedures are sets of pre-compiled SQL statements created in
the server, called and executed by database applications. It is very simple and
the same result can be archived by SQL query.
Stored Procedures Advantages
Stored procedures increase the performance of an application. Once created,
stored procedure is compiled and stored in the database catalog. It runs faster
than an uncompiled SQL commands which are sent from application
Stored procedure reduces the traffic between application and database server
because instead of sending multiple uncompiled long SQL commands statement,
application has only to send the stored procedure name and get the result back.
Stored procedure is reusable and transparent to any application which wants to
use it. Stored procedure exposes the database interface to all applications so
developer doesn’t have to program the functions which are already supported in
stored procedure in all programs.
234 | P a g e
Stored procedure is secured. Database administrator can grant the right to
application which to access which stored procedures in database catalog without
granting any permission on the underlying database table.
Stored Procedures Disadvantages
Stored procedure make the database server high load in both memory and
processors. Instead of being focused on the storing and retrieving data, you
could be asking the database server to perform a number of logical operations or
a complex of business logic which is not the role of it.
Stored procedure only contains declarative SQL so it is very difficult to write a
procedure with complexity of business like other languages in application layer
such as Java, C#, C++…
You cannot debug stored procedure in almost RDMBSs and in MySQL also.
There are some workarounds on this problem but it still not good enough to do
so.
Writing and maintain stored procedure usually required specialized skill set that
not all developers possess. This introduced the problem in both application
development and maintain phrase.
235 | P a g e
CHAPTER 7:
DATABASE APPLICATION
DEVELOPMENT
Celino, Ralph Stephen

Rodriguez, Zyra Mae M.
236 | P a g e
Client-server System
Client-server system is a computing system that is composed of two
logical parts: a server, which provides response for services, and a client, which
requests them. The two parts can run on
separate machines on a network, allowing
users to access powerful server resources
from their personal computers. Mc-Grawhill,
(2003)
THREE COMPONENTS OF CLIENT/SERVER SYSTEMS
 Data presentation services
It is the input/output (I/O), or presentation logic, component. This
component is responsible for formatting and presenting data on the user’s screen
or other output device and for managing user input from a keyboard or other
input device. Presentation logic often resides on the client and is the mechanism
with which the user interacts with the system.
 Input: Keyboard, Mouse
 Output: Monitor, Printer
 Graphical User Interface (GUI): is an interface through which a
user interacts with electronic devices such as computers and
smartphones through the use of icons, menus and other visual
indicators or representations (graphics).
237 | P a g e
 Processing services
This handles data processing logic, business rules logic, and data
management logic. Processing logic resides on both the client and servers.
 Input/Output processing: includes such activities as data
validation and identification of processing errors.
 Business rules logic: have not been coded at the DBMS level
maybe coded in the processing component.
 Data management logic: identifies the data necessary for
processing the transaction or query.
 Storage services
The component responsible for data storage and retrieval from the
physical storage devices associated with the application. Storage logic usually
resides on the database server, close to the physical location of the data.
Activities of a DBMS occur in the storage logic component.
 Data storage: is the recording (storing) of information (data).
 Data retrieval: is the process of identifying and extracting data
from a database, based on a query provided by the user or
application.
 Database Management System (DBMS): are software systems
used to store, retrieve, and run queries on data. A DBMS serves
238 | P a g e
as an interface between an end-user and a database, allowing
users to create, read, update, and delete data in the database.
 DBMS Activities such as; data dictionary management,
data storage management, data security management
and data integrity management.
In the fat client, the application processing occurs entirely on the client,
whereas in the thin client, this processing occurs primarily on the server. In the
distributed example or
also what we called
hybrid client,
application processing
is partitioned between
the client and the
server.
According to Spacey, J. (n.d.-b), a thin client is software that is primarily
designed to
communicate with a
server. It’s features
are produced by
servers such as a
cloud platform. A thick
client is a software that
239 | P a g e
implements its own features. It may connect to servers but it remains mostly
functional when it’s disconnected.
TWO-TIER AND THREE-TIER ARCHITECTURE
 TWO TIER ARCHITECTURE
 It is a client-
server
application
 It was built in
1980’s and can
support up to
100 users.
 Two-tier architecture is generally divided into two parts: Client
application and Database.
 The client in a two-tier architecture application has the code written
for saving data in the database.
 The client sends a request to the server where it then processes the
request and sends back the data.
 The client handles both the presentation layer (application interface)
and application layer (logical operations), while the server system
handles the database.
240 | P a g e
Characteristics of Two-tier Architecture
These may be included the advantages and disadvantages of two-tier
architecture:
 No intermediate application present: The client directly interacts
with the
server
without the
presence of
any
intermediate
application.
 Use of Application Programming Interface (API): The Client
application communicates with the data layer through a database
bridge Application Programming Interface (API). The most common
APIs are
Open
Database
Connectivity
(ODBC) and
ADO.NET for
the Microsoft
241 | P a g e
platforM (VB.NET and C#) and Java Database Connectivity (JDBC) fo
use with Java programs.
 Installation of database driver: The Database driver is installed in
each computer that
runs the client
application. Reinstall
database driver in all
the computer if the
database changes,
thus increases
deployment cost.
 Database connection:
Each client establishes
a separate database
connection.
 High network traffic: because of an increase in the number of trips
of data transfer across
the physical boundaries
of the network.
Bhuvana. (2006,
August 24)
242 | P a g e
Applications of two-tier architecture
 Software installed on a client machine
THREE-TIER ARCHITECTURE
 It is a web based application
 Introduced in 1990’s and was proposed in 1995 and it accommodate
hundreds of users.
 Three-tier Architecture is generally divided into three parts:
Presentation layer (Client tier), Application layer (Business tier) and
Database layer (Data tier).
243 | P a g e
 In three-tier, the application logic or process resides in the middle-tier,
it is separated from the data and the user interface.
Advantages of three-tier architecture
 Maintainability: Because each tier is independent of the other tiers,
updates or changes can be carried out without affecting the
application as a whole.
 Scalability: Because tiers are based on the deployment of layers,
scaling out an application is reasonably straightforward.
 Flexibility: Because each tier can be managed or scaled
independently, flexibility is increased.
 Faster development: Because of division of work web designer does
presentation, software engineer does logic, DB admin does data
model. Benitamayekar. (n.d.)
 Better match of systems to business needs: New modules can be
built to support specific business needs rather than building more
general, complete applications.
 Improved customer service: Multiple interfaces on different clients
can access the same business process.
Disadvantages of three-tier architecture
244 | P a g e
 High installation cost.
 Structure is more complex as compare to 2 tier architecture.
Applications
 E-commerce Websites
 Database related Websites
CONNECT DATABASES IN A TWO-TIER APPLICATION
 VB.NET
The VB.NET code shown in figure 1 below uses the ADO.NET data
access framework and .NET data providers to connect to the database.
The .NET Framework has different data providers (or database drivers)
that allow you to connect a program written in a .NET programming
language to a database. Common data providers available in the
framework are for SQL Server and Oracle.
Figure 1-a shows the VB.NET code needed to create a simple form
that allows the user input to a name, department number, and student ID.
Figure 1-a -
Setup form for
receiving user
input.
245 | P a g e
Figure 1-b shows the detailed steps to connect to a database and
issue an INSERT query. By reading the explanations presented in the text
boxes in the figure, you can see how the generic steps for accessing a
database described in the previous section are implemented in the context
of a VB.NET program.
Figure 1-b :
Connecting to a
database and
issuing an
INSERT query.
Figure 1-
c shows
how you
would
access
the
database and process the results for a SELECT query. The main
difference is that use the ExecuteReader() method instead of
ExecuteNonQuery() method. The latter is used for INSERT, UPDATE, and
DELETE queries. The table that results from running a SELECT query are
246 | P a g e
captured inside an OracleDataReader object. You can access each row in
the result by traversing the object, one row at a time. Each column in the
object can be accessed by a Get method and by referring to the column’s
position in the query result (or by name). ADO.NET provides two main
choices with respect to handling the result of the query: DataReader (e.g.,
OracleDataReader in Figure 1-c) and DataSet. The primary difference
between the two options is that the first limits us to looping through the
result of a query one row at a time. This can be very cumbersome if the
result has a large number of rows. The DataSet object provides a
disconnected snapshot of the database that we can then manipulate in our
program using the features available in the programming language. Later
in this chapter, we will see how .NET data controls (which use DataSet
objects) can provide a cleaner and easier way to manipulate data in a
program.
Figure 1-c : Sample code snippet for using a select query.
 JAVA
This Java application is actually connecting to the same database
as the VB.NET application in Figure 1. Its purpose is to retrieve and print
the names of all
students in the
Student table.
247 | P a g e
In this example, the Java program is using the JDBC API and an
Oracle thin driver to access the Oracle database. Notice that unlike the
INSERT query shown in the VB.NET example, running an SQL SELECT
query requires us to capture the data inside an object that can
appropriately handle the tabular data. JDBC provides two key
mechanisms for this: the ResultSet and RowSet objects. The difference
between these two is somewhat similar to the difference between the
DataReader and DataSet objects described in the VB.NET example.
The ResultSet object has a mechanism, called the cursor, that
points to its current row of data. When the ResultSet object is first
initialized, the cursor is positioned before the first row. This is why we
need to first call the next() method before retrieving data. The ResultSet
object is used to loop through and process each row of data and retrieve
the column values that we want to access. In this case, we access the
value in the name column using the rec.getString method, which is a part
of the JDBC API. For each of the common database types, there is a
corresponding get and set method that allows for retrieval and storage of
data in the database. It is important to note that while the ResultSet object
maintains an active connection to the database, depending on the size of
the table, the entire table (i.e., the result of the query) may or may not
actually be in memory on the client machine. How and when data are
transferred between the database and client is handled by the Oracle
driver. By default, a ResultSet object is read-only and can be traversed
248 | P a g e
only in one direction (forward). However, advanced versions of the
ResultSet object allow scrolling in both directions and can be updateable
as well.
Figure 2-a: Database access from a Java Program.
KEY COMPONENTS OF A WEB APPLICATION
 Database Server
This server hosts the storage logic for the application and hosts the
DBMS. You have read about many of them, including Oracle, Microsoft SQL
Server, Informix, Sybase, DB2, Microsoft Access, and MySQL. The DBMS may
reside either on a separate machine or on the same machine as the Web server.
It can be configured to provide data access for authorized users only. This
type of server keeps the data in a central location that can be regularly backed
up. It also allows users and applications to centrally access the data across the
network. A larger number of the databases in your organization can be kept on
249 | P a g e
one server or a group of servers that are specifically configured to protect data
and service client requests.
A database server is a machine running database software dedicated to
providing database services. It is a crucial component in the client-server
computing environment where it provides business-critical information requested
by the client systems.
A database server consists of hardware and software that run a database.
 The software part of a database server, or the database instance, is
the back-end database application. The application represents a set
of memory structures and background processes accessing a set of
database files.
 The hardware part of a database server is the server system used for
database storage and retrieval.
 Web Server
The Web server provides the basic functionality needed to receive and
respond to requests from browser clients. These requests use HTTP or HTTPS
as a protocol.
The main job of a web server is to display website content through storing,
processing and delivering webpages to users. Besides HTTP, web servers also
support SMTP (Simple Mail Transfer Protocol) and FTP (File Transfer Protocol),
used for email, file transfer and storage. Gillis, A. S. (2020)
250 | P a g e
 Application Server
This software provides the building blocks for creating dynamic Web sites
and Web-based applications. Examples include the .NET Framework from
Microsoft; Java Platform, Enterprise Edition (Java EE); and ColdFusion. Also,
while technically not considered an application server platform, software that
enables you to write applications in languages such as PHP, Python, and Perl
also belong to this category.
An application server is a program that resides on the server-side, and it’s
a server programmer providing business logic behind any application. This server
can be a part of the network or the distributed network. It helps the clients to
process any requests by connecting to the Database and returning the
information back to web servers. Pedamkar, P. (2021)
A web browser is a software that allows you to find and view websites on
the Internet. A Web browser Microsoft’s Internet Explorer, Mozilla’s Firefox,
Apple’s Safari, Google’s Chrome, and Opera are examples. Sciencedirect (n.d.)
Information flow:
The database server stores the Database Management System (DBMS)
and the database itself. Its main role is to receive requests from client machines,
search for the required data, and pass back the results. Clients access a
database server through a front-end application that displays the requested data
on the client machine, or through a back-end application that runs on the server
251 | P a g e
and manages the database. In a master-slave model, the database master
server is the primary data location. Database slave servers are replicas of the
master server that act as proxies. Marijan, B. (2021)
When a web browser, like Google Chrome or Firefox, needs a file that's
hosted on a web server, the browser will request the file by HTTP. When the
request is received by the web server, the HTTP server will accept the request,
find the content and send it back to the browser through HTTP.
Application servers are basically used in a web-based application
that has 3-tier architecture. The position at which the application server fits
in is described below:
 Tier 1 – This is a GUI interface that resides at the client end and is
usually a thin client (e.g., browser)
 Tier 2 – This is called the middle tier, which consists of the Application
Server.
 Tier 3 – This is the 3rd tier which is backend servers. E.g., a
Database
Server.
As we can see,
they usually
communicate with the
252 | P a g e
webserver for serving any request that is coming from clients. The client first
makes a request, which goes to the webserver. The web server then sends it to
the middle tier, i.e., the application server, which further gets the information from
3rd tier (e.g., database server) and sends it back to the webserver. The web
server further sends back the required information to the client. Different
approaches are being utilized to process requests through the web servers, and
some of them are approaches like JSP, PHP, and ASP.NET.
Web vs. Application server
Web Server
 Deliver static content
 Content is delivered using the
HTTP protocol only.
 Serves only web-based
applications.
Application Server
 Delivers dynamic content
 Provides business logic to application programs using several protocols
(including HTTP).
 Can serve web and enterprise-based applications. Edpresso Team.
(2021)
253 | P a g e
General overview of the information flow in a Web application. A user
submitting a Web page request is unaware of whether the request being
submitted is returning a static Web page or a Web page whose content is a
mixture of static information and dynamic information retrieved from the
database. The data returned from the Web server is always in a format that can
be rendered by the browser (i.e., HTML or XML). As shown in Figure, if the Web
server determines that the request from the client can be satisfied without
passing the request on to the application server, it will process the request and
then return the appropriately formatted information to the client machine.
CONNECT TO DATABASES IN A THREE-TIER WEB APPLICATION
 JSP - (Java Server Pages)
As with a normal page, your browser sends an HTTP request to the web
server.
 The web server recognizes that the HTTP request is for a JSP page
and forwards it to a JSP engine. This is done by using the URL or JSP
page which ends with .jspinstead of .html.
 A part of the web server called the servlet engine loads the Servlet
class and executes it. During execution, the servlet produces an
output in HTML format. The output is furthur passed on to the web
server by the servlet engine inside an HTTP response.
 The web server forwards the HTTP response to your browser in terms
of static HTML content.
254 | P a g e
 Finally, the web browser handles the dynamically-generated HTML
page inside the HTTP response exactly as if it were a static page. JSP
- Architecture. (n.d.)
In a Php three-tier structure:
The biggest difference between a Java web server and PHP is that
PHP doesn't have its own built-in web server. PHP itself is basically one
executable which reads in a source code file of PHP code and
interprets/executes the commands written in that file.
PHP runs on a third-party web server which handles any incoming
requests and invokes the PHP interpreter with the given requested PHP
source code file as argument, then delivers any output of that process
back to the HTTP client.
API is the acronym for Application Programming Interface, which is
a software intermediary that allows two applications to talk to each other.
When you use an application on your mobile phone, the application
connects to the Internet and sends data to a server. The server then
retrieves that data, interprets it, performs the necessary actions and sends
it back to your phone. The application then interprets that data and
presents you with the information you wanted in a readable way. MySQL
is a sample of cloud database
255 | P a g e
 ASP.NET - (Active Server Pages Network)
ASP.NET hides the complex processes of data access and
provides much higher level of classes and objects through which data is
accessed easily. These classes hide all complex coding for connection,
data retrieving, data querying, and data manipulation. ASP.NET -
Database Access. (n.d.).
1. Create a web site and add a SqlDataSourceControl on the web
form.
2. Click on the Configure Data Source option.
3. Click on the New Connection button to establish connection with
a database.
4. Once the connection is set up, you may save it for further use.
At the next step, you are asked to configure the select
statement:
5. Select the columns and click next to complete the steps.
Observe the WHERE, ORDER BY, and the Advanced buttons.
These buttons allow you to provide the where clause, order by
clause, and specify the insert, update, and delete commands of
SQL respectively. This way, you can manipulate the data.
6. Add a GridView control on the form. Choose the data source
and format the control using AutoFormat option.
7. After this the formatted GridView control displays the column
headings, and the application is ready to execute.
256 | P a g e
8. Finally execute the application.
THE PURPOSE OF XML AND ITS USES
XML focuses on the transport of data without managing the appearance or
presentation of the output. XML addresses the issue of representing data in a
structure and format that can both be exchanged over the Internet and be
interpreted by different components (i.e., browsers, Web servers, application
servers). XML Introduction. (n.d.)
Most XML applications will work as expected even if new data is added (or
removed). Imagine an application designed to display a version of xml with <to>
<from> <heading> <body>). Then imagine a version with added <date> and
<hour> elements, and a removed <heading>.
The way XML is constructed, even older version of the application can
still work with the new ones:
Example:
__
Old Version
Note
To: Tove
From: Jani
Reminder
257 | P a g e
Don't forget me this weekend!
New Version
Note
To: Tove
From: Jani
Date: 2015-09-01 08:30
Don't forget me this weekend!
__
The XML standard is a flexible way to create information formats and
electronically share structured data via the public Internet, as well as via
corporate networks.
The tags in the example above (like <to> and <from>) are not defined in
any XML standard. These tags are "invented" by the author of the XML
document.
For example, Xml has offshoots/ sub class like the xbrl (eXtensible
Business Reporting Language). This type of xml acts as a standard in naming
accounts of a business, it helps businesses to copy, transfer or to communicate
data with their counter parts like the suppliers.
258 | P a g e
With XML, being a standard of data exchange, data can be available to all
kinds of "reading machines" like people, computers, voice machines, news feeds,
etc., thus making exchange much simpler.
 XML stands for eXtensible Markup Language
 XML is a markup language much like HTML
 XML was designed to store and transport data
 XML was designed to be self-descriptive
XQUERY USE TO QUERY XML DOCUMENTS
 What is XQuery?
XQuery is a technology from the World Wide Web Consortium (W3C)
that's designed to query collections of XML data -- not just XML files, but
anything that can appear as XML, including relational databases. The word
“Query” used in the 16th century in English as a noun meaning ‘query’, from Latin
quaere ‘ask, seek’. XQuery: Specifications, Articles, Mailing List, and Vendors.
(n.d.)
XQuery can be used to:
 Extract information to use in a Web Service
 Generate summary reports
 Transform XML data to XHTML
 Search Web documents for relevant information
259 | P a g e
XQuery is compatible with several W3C standards, such as XML,
Namespaces, XSLT, XPath, and XML Schema.
CHAPTER 8:
260 | P a g e
DATA WAREHOUSING
Corto, Michelle T.
Manalo, Cklint Louisse M.
Romero, Agnes M.
DATA WAREHOUSING
A data warehouse is a database designed to enable business intelligence
activities: it exists to help users understand and enhance their organization's
261 | P a g e
performance. It is designed for query and analysis rather than for transaction
processing, and usually contains historical data derived from transaction data,
but can include data from other sources. Data warehouses separate analysis
workload from transaction workload and enable an organization to consolidate
data from several sources. This helps in:
 Maintaining historical records and Analyzing the data to gain a better
understanding of the business and to improve the business
Data Warehouse is a relational database management system (RDBMS)
constructed to meet the requirements of transaction processing systems. It can
be loosely described as any centralized data repository which can be queried for
business benefits. It is a database that stores information oriented to satisfy
decision-making requests. It is a group of decision support technologies, targets
to enable the knowledge worker (executive, manager, and analyst) to make
superior and higher decisions. So, Data Warehousing supports architectures and
tools for business executives to systematically organize, understand and use
their information to make strategic decisions.
In addition to a relational database, a data warehouse environment can
include an extraction, transportation, transformation, and loading (ETL) solution,
statistical analysis, reporting, data mining capabilities, client analysis tools, and
other applications that manage the process of gathering data, transforming it into
useful, actionable information, and delivering it to business users.
262 | P a g e
To achieve the goal of enhanced business intelligence, the data
warehouse works with data collected from multiple sources. The source data
may come from internally developed systems, purchased applications, third-party
data syndicators and other sources. It may involve transactions, production,
marketing, human resources and more. In today's world of big data, the data may
be many billions of individual clicks on web sites or the massive data streams
from sensors built into complex machinery.
A data warehouse usually stores many months or years of data to support
historical analysis. The data in a data warehouse is typically loaded through an
extraction, transformation, and loading (ETL) process from multiple data sources.
Modern data warehouses are moving toward an extract, load, transformation
(ELT) architecture in which all or most data transformation is performed on the
database that hosts the data warehouse. It is important to note that defining the
ETL process is a very large part of the design effort of a data warehouse.
Similarly, the speed and reliability of ETL operations are the foundation of the
data warehouse once it is up and running.
Users of the data warehouse perform data analyses that are often time-
related. Examples include consolidation of last year's sales figures, inventory
analysis, and profit by product and by customer. But time-focused or not, users
want to "slice and dice" their data however they see fit and a well-designed data
warehouse will be flexible enough to meet those demands. Users will sometimes
263 | P a g e
need highly aggregated data, and other times they will need to drill down to
details. More sophisticated analyses include trend analyses and data mining,
which use existing data to forecast trends or predict futures. The data warehouse
acts as the underlying engine used by middleware business intelligence
environments that serve reports, dashboards and other interfaces to end users.
BASIC CONCEPTS OF DATA WAREHOUSING
A data warehouse is a subject-oriented, integrated, time-variant, non-
volatile collection of data used in support of management decision-making
processes and business intelligence (Inmon and Hackathorn, 1994). The
meaning of each of the key terms in this definition follows:
 Subject-Oriented
A data warehouse is subject oriented as it offers information regarding a
theme instead of companies’ ongoing operations. These subjects can be sales,
marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. Instead, it put
emphasis on modeling and analysis of data for decision making. It also provides
a simple and concise view around the specific subject by excluding data which is
not helpful to support the decision process.
 Integrated
In Data Warehouse, integration means the establishment of a common
unit of measure for all similar data from the dissimilar database. The data also
264 | P a g e
needs to be stored in the Data Warehouse in a common and universally
acceptable manner.
A data warehouse is developed by integrating data from varied sources
like a mainframe, relational databases, flat files, etc. Moreover, it must keep
consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming
conventions, attribute measures, encoding structure etc. have to be ensured.
Consider the following example:
In the above example, there are three different applications labeled A, B
and C. Information stored in these applications are Gender, Date, and Balance.
However, each application’s data is stored in a different way.
 In Application A, gender fields store logical values like M or F.
 In Application B, gender field is a numerical value.
 In Application C, the gender field is stored in the form of a character value.
 Same is the case with date and balance.

265 | P a g e
However, after the transformation and cleaning process all this data is
stored in common format in the Data Warehouse.
 Time-Variant
The time horizon for data warehouses is quite extensive compared with
operational systems. The data collected in a data warehouse is recognized with a
particular period and offers information from the historical point of view. It
contains an element of time, explicitly or implicitly.
One such place where Data warehouse data display time variance is in
the structure of the record key. Every primary key contained with the DW should
have either implicitly or explicitly an element of time. Like the day, week, month,
etc.
Another aspect of time variance is that once data is inserted in the
warehouse, it can’t be updated or changed. All the historical data along with the
recent data in the
Data warehouse
play a crucial role
to retrieve data of
any duration of
time. If the
business wants
any reports,
266 | P a g e
graphs, etc then for comparing it with the previous years and to analyze the
trends, all the old data that are 6 months old, 1-year-old or even older data, etc.
are required.
 Non-volatile
The data residing in the data warehouse is permanent and defined by its
names. It additionally means that the data in the data warehouse cannot be
erased or deleted or also when new data is inserted into it. In the data
warehouse, data is read-only and can only be refreshed at a particular interval of
time. Operations such as delete, update and insert that are done in a software
application over data are lost in the data warehouse environment. There are only
two types of data operations that can be done in the data warehouse:
 Data Loading
 Data Access
A data warehouse is not just a consolidation of all the operational
databases in an organization. Because of its focus on business intelligence,
external data, and time-variant data, a data warehouse is a unique kind of
database. Most data warehouses are relational databases designed in a way
optimized for decision support, not operational data processing.
Data warehousing is the process whereby organizations create and
maintain data warehouses and extract meaning from and help inform decision
making through the use of data in the data warehouses. Successful data
warehousing requires following proven data warehousing practices, sound
267 | P a g e
project management, strong organizational commitment, as well as making the
right technology decisions.
The process of creating data warehouses to store a large amount of data
is named Data Warehousing. Data Warehousing helps to improve the speed and
efficiency of accessing different data sets and makes it easier for company
decision-makers to obtain insights that will help the business and promote
marketing tactics that set them aside from their competitors. We can say that it is
a blend of technologies and components which aids the strategic use of data and
information. The main goal of data warehousing is to create a hoarded wealth of
historical data that can be retrieved and analyzed to supply helpful insight into
the organization’s operations.
Types of Data Warehousing
There are mainly three types of data warehousing, which are as follows:
 Enterprise Data Warehouse: Enterprise data warehouse is a centralized
warehouse that offers decision-making support to different departments
across an enterprise. It provides a unified approach for organizing as well as
representing data. With this warehouse at your end, you gain the ability to
classify the data as per the subject and grant the level of access to different
departments accordingly.
 Operational Data Store: Popularly known as ODS, Operational Data
Store is used when an organization’s reporting needs are not satisfied by a
data warehouse or an OLTP system. In ODS, a data warehouse can be
268 | P a g e
refreshed in real-time, making it best for routine activities like storing
employees’ records.
 Data Mart: As part of a data warehouse, Data Mart is particularly
designed for a specific business line like finance, accounts, sales, purchases,
or inventory. The warehouse allows you to collect data directly from the
sources.
HISTORY OF DATA WAREHOUSING
The key discovery that triggered the development of data warehousing
was the recognition of the fundamental differences between operational systems
(sometimes called systems of record because their role is to keep the official,
legal record of the organization) and informational systems. The need to
warehouse data evolved as computer systems became more complex and
needed to handle increasing amounts of information.
Here are some key events in evolution of Data Warehouse-
 1960- Dartmouth and General Mills in a joint research project, develop the
terms dimensions and facts.
 1970- A Nielsen and IRI introduced dimensional data marts for retail
sales.
 1983- Tera Data Corporation introduces a database management system
which is specifically designed for decision support.
 Data warehousing started in the late 1980s when IBM worker Paul Murphy
and Barry Devlin developed the Business Data Warehouse.

269 | P a g e
 1988- Devlin and Murphy published the first article describing the
architecture of a data warehouse.
 1992- Inmon published the first book describing data warehousing, and he
has subsequently become one of the most prolific authors in this field.
 However, the real concept was given by Inmon Bill. He was considered
the father of the data warehouse. He had written about a variety of topics for
building, usage, and maintenance of the warehouse & the Corporate
Information Factory.
In essence, the data warehousing idea was planned to support an
architectural model for the flow of information from the operational system to
decisional support environments. The concept attempts to address the various
problems associated with the flow, mainly the high costs associated with it.
In the absence of data warehousing architecture, a vast amount of space
was required to support multiple decision support environments. In large
corporations, it was ordinary for various decision support environments to
operate independently.
THE NEED FOR DATA WAREHOUSING
Data Warehousing is a progressively essential tool for business
intelligence. It allows organizations to make quality business decisions. The data
warehouse benefits by improving data analytics, it also helps to gain
considerable revenue and the strength to compete more strategically in the
market. By efficiently providing systematic, contextual data to the business
270 | P a g e
intelligence tool of an organization, the data warehouses can find out more
practical business strategies.
Two major factors drive the need for data warehousing in most organizations
today:
1. A business requires an integrated, company-wide view of high-quality
information.
2. The information systems department must separate informational from
operational systems to improve performance dramatically in managing company
data.
Need for a Company-Wide View
Data in operational systems are typically fragmented and inconsistent, so-
called silos, or islands, of data. They are also generally distributed on a variety of
incompatible hardware and software platforms. For example, one source of
customer data may be located on a UNIX-based server running an Oracle
DBMS, whereas another may be located on a SAP system. Yet, for decision-
making purposes, it is often necessary to provide a single, corporate view of that
information.
To understand the difficulty of deriving a single corporate view, look at the
simple example shown in Figure 1. This figure shows three tables from three
separate systems of record, each containing similar student data. The STUDENT
DATA table is from the class registration system, the STUDENT EMPLOYEE
table is from the personnel system, and the STUDENT HEALTH table is from a
health center system. Each table contains some unique data concerning
271 | P a g e
students, but even common data (e.g., student names) are stored using different
formats.
Figure 1. Examples of heterogeneous data
STUDENT DATA
StudentNo LastName MI FirstName Telephone Status …
123-45-6789 Enright T Mark 483-1967 Soph
389-21-4062 Smith R Elaine 283-4195 Jr
STUDENT EMPLOYEE
StudentID Address Dept Hours …
123-45-6789 1218 Elk Drive, Phoenix, AZ 91304 Soc 8
389-21-4062 134 Mesa Road, Tempe, AZ 90142 Math 10
STUDENT HEALTH
StudentName Telephone Insurance ID …
Mark T. Enright 483-1967 Blue Cross 123-45-6789
Elaine R. Smith 555-7828 ? 389-21-4062
Suppose you want to develop a profile for each student, consolidating all
data into a single file format. Some of the issues that you must resolve are as
follows:
 Inconsistent key structures - The primary key of the first two tables is
some version of the student Social Security number, whereas the primary key
of STUDENT HEALTH is StudentName.
 Synonym - In STUDENT DATA, the primary key is named StudentNo,
whereas in STUDENT EMPLOYEE it is named StudentID.
272 | P a g e
 Free-form fields versus structured fields - In STUDENT HEALTH,
StudentName is a single field. In STUDENT DATA, StudentName (a
composite attribute) is broken into its component parts: LastName, MI, and
FirstName.
 Inconsistent data values - Elaine Smith has one telephone number in
STUDENT DATA but a different number in STUDENT HEALTH.
 Missing data - The value for Insurance is missing (or null) for Elaine
Smith in the STUDENT HEALTH table.
This simple example illustrates the nature of the problem of developing a
single corporate view but fails to capture the complexity of that task. A real-life
scenario would likely have dozens (if not hundreds) of tables and thousands (or
millions) of records.
Why do organizations need to bring data together from various systems of
record? Ultimately, of course, the reason is to be more profitable, to be more
competitive, or to grow by adding value for customers. This can be accomplished
by increasing the speed and flexibility of decision making, improving business
processes, or gaining a clearer understanding of customer behavior. For the
previous student example, university administrators may want to investigate if the
health or number of hours students work on campus is related to student
academic performance; if taking certain courses is related to the health of
students; or whether poor academic performers cost more to support, for
example, due to increased health care as well as other costs. In general, certain
273 | P a g e
trends in organizations encourage the need for data warehousing; these trends
include the following:
 No single system of record
Almost no organization has only one database. Because of the
heterogeneous needs for data in different operational settings, because of
corporate mergers and acquisitions, and because of the sheer size of many
organizations, multiple operational databases exist.
 Multiple systems are not synchronized
It is difficult, if not impossible, to make separate databases consistent. Even if
the metadata are controlled and made the same by one data administrator,
the data values for the same attributes will not agree. This is because of
different update cycles and separate places where the same data are
captured for each system. Thus, to get one view of the organization, the data
from the separate systems must be periodically consolidated and
synchronized into one additional database. We will see that there can be
actually two such consolidated databases—an operational data store and an
enterprise data warehouse.
 Organizations want to analyze the activities in a balanced way
Many organizations have implemented some form of a balanced scorecard—
metrics that show organization results in financial, human, customer
satisfaction, product quality, and other terms simultaneously. To ensure that
this multidimensional view of the organization shows consistent results, a
data warehouse is necessary. When questions arise in the balanced
274 | P a g e
scorecard, analytical software working with the data warehouse can be used
to “drill down,” “slice and dice,” visualize, and in other ways mine business
intelligence.
 Customer relationship management
Organizations in all sectors are realizing that there is value in having a total
picture of their interactions with customers across all touch points. Different
touch points (e.g., for a bank, these touch points include ATMs, online
banking, tellers, electronic funds transfers, investment portfolio management,
and loans) are supported by separate operational systems. Thus, without a
data warehouse, a teller may not know to try to cross-sell a customer one of
the bank’s mutual funds if a large, atypical automatic deposit transaction
appears on the teller’s screen. Having a total picture of the activity with a
given customer requires a consolidation of data from various operational
systems.
 Supplier relationship management
Managing the supply chain has become a critical element in reducing costs
and raising product quality for many organizations. Organizations want to
create strategic supplier partnerships based on a total picture of their
activities with suppliers, from billing, to meeting delivery dates, to quality
control, to pricing, to support. Data about these different activities can be
locked inside separate operational systems (e.g., accounts payable, shipping
and receiving, production scheduling, and maintenance). ERP systems have
improved this situation by bringing many of these data into one database.
275 | P a g e
However, ERP systems tend to be designed to optimize operational, not
informational or analytical, processing.
Need to Separate Operational and Informational Systems
An operational system is a system that is used to run a business in real
time, based on current data. Examples of operational systems are sales order
processing, reservation systems, and patient registration systems. Operational
systems must process large volumes of relatively simple read/write transactions
and provide fast response. Operational systems are also called systems of
record.
Table 1. Comparison of Operational and Informational Systems
Characteristic Operational Systems Informational Systems

Primary Run the business on a Support managerial decision
purpose current basis making

Type of data Current representation of Historical point-in-time
state of the business (snapshots) and predictions

Primary users Clerks, salespersons, Managers, business
administrators analysts, customers

Scope of Narrow, planned, and simple Broad, ad hoc, complex
usage updates and queries queries and analysis

Design goal Performance: throughput, Ease of flexible access and
availability use
Volume Many constant updates and Periodic batch updates and
queries on one or a few table queries requiring many or all
rows rows
276 | P a g e
Informational systems are designed to support decision making based on
historical point-in-time and prediction data. They are also designed for complex
queries or data-mining applications. Examples of informational systems are
systems for sales trend analysis, customer segmentation, and human resources
planning.
The key differences between operational and informational systems are
shown in Table 1. These two types of processing have very different
characteristics in nearly every category of comparison. In particular, notice that
they have quite different communities of users. Operational systems are used by
clerks, administrators, salespersons, and others who must process business
transactions. Informational systems are used by managers, executives, business
analysts, and (increasingly) by customers who are searching for status
information or who are decision makers. The need to separate operational and
informational systems is based on three primary factors:
1. A data warehouse centralizes data that are scattered throughout disparate
operational systems and makes them readily available for decision support
applications.
2. A properly designed data warehouse adds value to data by improving their
quality and consistency.
3. A separate data warehouse eliminates much of the contention for resources
that results when informational applications are confounded with operational
processing.
Data warehousing Architectures
277 | P a g e
A data warehouse architecture is a method of defining the overall
architecture of data communication processing and presentation that exist for
end-clients computing within the enterprise. Each data warehouse is different,
but all are characterized by standard vital components.
The architecture for data warehouses has evolved, and organizations
have considerable latitude in creating variations.The first is a three-level
architecture that characterizes a bottom-up, incremental approach to evolving the
data warehouse; the second is also a three-level data architecture that appears
usually from a more top-down approach that emphasizes more coordination and
an enterprise-wide perspective. Even with their differences, there are many
common characteristics to these approaches.
Data Warehouse applications are designed to support the user ad-hoc
data requirements, an activity recently dubbed online analytical processing
(OLAP). These include applications such as forecasting, profiling, summary
278 | P a g e
reporting, and trend analysis. Data warehouses and their architectures vary
depending upon the elements of an organization's situation.
Independent Data Mart Data Warehousing Environment
The independent data mart architecture for a data warehouse is shown in the
figure below. Building this architecture requires four basic steps (moving left to
right in the figure below):
1. Data are extracted from the various internal and external source system
files and databases. In a large organization, there may be dozens or even
hundreds of such files and databases.
2. The data from the various source systems are transformed and integrated
before being loaded into the data marts. Transactions may be sent to the
source systems to correct errors discovered in data staging. The data
warehouse is considered to be the collection of data marts.
3. . The data warehouse is a set of physically distinct databases organized
for decision support. It contains both detailed and summary data.
4. Users access the data warehouse by means of a variety of query
languages and analytical tools. Results (e.g., predictions, forecasts) may be
fed back to data warehouses and operational databases.
Extraction and loading happen periodically—sometimes daily, weekly, or
monthly. Thus, the data warehouse often does not have, nor does it need to
have, current data. Remember, the data warehouse is not (directly) supporting
operational transaction processing, although it may contain transactional data
279 | P a g e
(but more often summaries of transactions and snapshots of status variables,
such as account balances and inventory levels). For most data warehousing
applications, users are not looking for a reaction to an individual transaction but
rather for trends and patterns in the state of the organization across a large
subset of the data warehouse. At a minimum, five fiscal quarters of data are kept
in a data warehouse so that at least annual trends and patterns can be
discerned. Older data may be purged or archived. We will see later that one
advanced data warehousing architecture, real-time data warehousing, is based
on a different assumption about the need for current data.
Contrary to many of the principles discussed so far in this chapter, the
independent data marts approach does not create one data warehouse. Instead,
this approach creates many separate data marts, each based on data
warehousing, not transaction processing database technologies. A data mart is a
data warehouse that is limited in scope, customized for the decision-making
applications of a particular end-user group. Its contents are obtained either from
independent ETL processes, as shown in Figure 9-2 for an independent data
mart, or are derived from the data warehouse, which we will discuss in the next
two sections. A data mart is designed to optimize the performance forwell-
defined and predicable uses, sometimes as few as a single or a couple of
queries. For example, an organization may have a marketing data mart, a
finance data mart, a supply chain data mart, and so on, to support known
analytical processing. It is possible that each data mart is built using different
tools; for example, a financial data mart may be built using a proprietary
280 | P a g e
multidimensional tool like Hyperion’s Essbase, and a sales data mart may be
built on a more general-purpose data warehouse platform, such as Teradata,
using MicroStrategy and other tools for reporting, querying, and data
visualization.
Independent data marts are often created because an organization
focuses on a series of short-term, expedient business objectives. The limited
short-term objectives can be more compatible with the comparably lower cost
(money and organizational capital) to implement yet one more independent data
mart. However, designing the data warehousing environment around different
sets of short-term objectives means that you lose flexibility for the long term and
the ability to react to changing business conditions. And being able to react to
change is critical for decision support. It can be organizationally and politically
easier to have separate, small data warehouses than to get all organizational
parties to agree to one view of the organization in a central data warehouse.
Also, some data warehousing technologies have technical limitations for the size
of the data warehouse they can support—what we will call later a scalability
issue. Thus, technology, rather than the business, may dictate a data
warehousing architecture if you first lock yourself into a particular data
warehousing set of technologies before you understand your data warehousing
requirements. We discuss the pros and cons of the independent data mart
architecture compared with its prime competing architecture in the next section.
281 | P a g e
Dependent Data Mart and Operational Data Store Architecture: A Three-Level
Approach
1. A separate ETL process is developed for each data mart, which can yield
costly redundant data and processing efforts.
2. Data marts may not be consistent with one another because they are
often developed with different technologies, and thus they may not provide a
clear enterprise wide view of data concerning important subjects such as
customers, suppliers, and products.
3. There is no capability to drill down into greater detail or into related facts in
other data marts or a shared data repository, so analysis is limited, or at best
very difficult (e.g., doing joins across separate platforms for different data
marts). Essentially, relating data across data marts is a task performed by
users outside the data warehouse.
4. Scaling costs are excessive because every new application that creates a
separate data mart repeats all the extract and load steps. Usually, operational
systems have limited time windows for batch data extracting, so at some
point, the load on the operations systems may mean that new technology is
needed, with additional costs.
5. If there is an attempt to make the separate data marts consistent, the cost to
do so is quite high.
282 | P a g e
One of the most popular approaches to addressing the independent data
mart limitations raised earlier is to use a three-level approach represented by the
dependent data mart and operational data store architecture. Here the new level
is the operational data store, and the data and metadata storage level is
reconfigured. The first and second limitations are addressed by loading the
dependent data marts from an enterprise data warehouse (EDW), which is a
central, integrated data warehouse that is the control point and single “version of
the truth” made available to end users for decision support applications.
Dependent data marts still have a purpose to provide a simplified and high-
performance environment that is tuned to the decision-making needs of user
groups. A data mart may be a separate physical database (and different data
marts may be on different platforms) or can be a logical (user view) data mart
instantiated on the fly when accessed.
A user group can access its data mart, and then when other data are
needed, users can access the EDW. Redundancy across dependent data marts
is planned, and redundant data are consistent because each data mart is loaded
in a synchronized way from one common source of data (or is a view of the data
warehouse). Integration of data is the responsibility of the IT staff managing the
enterprise data warehouse; it is not the end users’ responsibility to integrate data
across independent data marts for each query or application. The dependent
data mart and operational data store architecture is often called a “hub and
spoke” approach, in which the EDW is the hub and the source data systems and
the data marts are at the ends of input and output spokes.
283 | P a g e
The third limitation is addressed by providing an integrated source for all
the operational data in an operational data store. An operational data store
(ODS) is an integrated, subject-oriented, continuously update-able, current-
valued (with recent history), organization-wide, detailed database designed to
serve operational users as they do decision support processing (Imhoff, 1998;
Inmon, 1998)An ODS is typically a relational database and normalized like
databases in the systems of record, but it is tuned for decision-making
applications.
An ODS typically does not contain “deep” history, whereas an EDW holds
typically a multiyear history of snapshots of the state of the organization. An ODS
may be fed from the database of an ERP application, but because most
organizations do not have only one ERP database and do not run all operations
against one ERP, an ODS is usually different from an ERP database. The ODS
also serves as the staging area for loading data into the EDW. The ODS may
284 | P a g e
receive data immediately or with some delay from the systems of record,
whichever is practical and acceptable for the decision-making requirements that
it supports.
Different leaders in the field endorse different approaches to data
warehousing. Those that endorse the independent data mart approach argue
that this approach has two significant benefits:
1. It allows for the concept of a data warehouse to be demonstrated by
working on a series of small projects.
2. The length of time until there is some benefit from data warehousing is
reduced because the organization is not delayed until all data are centralized.
Logical Data Mart and Real-Time Data Warehouse Architecture
The logical data mart and real-time data warehouse architecture is
practical for only moderate-sized data warehouses or when using high-
performance data warehousing technology, such as the Teradata system.
1. Logical data marts are not physically separate databases but rather
different relational views of one physical, slightly denormalized relational data
warehouse.
285 | P a g e
2. Data are moved into the data warehouse rather than to a separate staging
area to utilize the high-performance computing power of the warehouse
technology to perform the cleansing and transformation steps.
3. . New data marts can be created quickly because no physical database or
database technology needs to be created or acquired and no loading routines
need to be written.
4. . Data marts are always up to date because data in a view are created
when the view is referenced; views can be materialized if a user has a series
of queries and analysis that need to work off the same instantiation of the
data
mart.
Whether logical or physical, data marts and data warehouses play
different roles in a data warehousing environment. Although limited in scope, a
data mart may not be small. Thus, scalable technology is often critical. A
significant burden and cost is placed on users when they themselves need to
integrate the data across separate physical data marts (if this is even possible).
As data marts are added, a data warehouse can be built in phases; the easiest
286 | P a g e
way for this to happen is to follow the logical data mart and real-time data
warehouse architecture.
The real-time data warehouse aspect of the architecture means that the
source data systems, decision support services, and the data warehouse
exchange data and business rules at a near-real-time pace because there is a
need for rapid response (i.e., action) to a current, comprehensive picture of the
organization. The purpose of real-time data warehousing is to know what is
happening, when it is happening, and to make desirable things happen through
the operational systems. For example, a help desk professional answering
questions and logging problem tickets will have a total picture of the customer’s
most recent sales contacts, billing and payment transactions, maintenance
activities, and orders. With this information, the system supporting the help desk
can, based on operational decision rules created from a continuous analysis of
up-to-date warehouse data, automatically generate a script for the professional to
sell what the analysis has shown to be a likely and profitable maintenance
contract, an upgraded product, or another product bought by customers with a
similar profile. A critical event, such as entry of a new product order, can be
considered immediately so that the organization knows at least as much about
the relationship with its
customer as does the
customer.
287 | P a g e
In addition of the given information above, here are some of the three common
architectures in data warehousing:
Data Warehouse Architecture: Basic
Data Warehouse Architecture: With Staging Area
Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data warehousing to refer to a system
that is used to process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For example,
288 | P a g e
author, data build, and data changed, and file size are examples of very basic
document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance.
The summarized record is updated continuously as new information is loaded
into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making. These customers interact with
the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
Reporting and Query Tools
Application Development Tools
Executive Information Systems Tools
Online Analytical Processing Tools
Data Mining Tools
Data Warehouse Architecture: With Staging Area
289 | P a g e
We must clean and process your operational information before put it into the
warehouse.
We can do this programmatically, although data warehouses uses a staging area
(A place where data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems, especially for enterprise data warehouses
where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.
290 | P a g e
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups
within our organization.
We can do this by adding data marts. A data mart is a segment of a data
warehouses that can provided information for reporting and analysis on a
section, unit, department or operation in the company, e.g., sales, payroll,
production, etc.
The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data for
purchases and sales or mine historical information to make predictions about
customer behavior.
291 | P a g e
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse
system:
1.
Separation: Analytical
and transactional
processing should be
keep apart as much as
possible.
2. Scalability: Hardware and software architectures should be simple to upgrade
the data volume, which has to be managed and processed, and the number of
user's requirements, which have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and
technologies without redesigning the whole system.
292 | P a g e
4. Security: Monitoring accesses are necessary because of the strategic data
stored in the data warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Characteristics of data warehouse data
Understand and model the data in each of the three layers of the data
architecture for a data warehouse, you need to learn some basic characteristics
of data as they are stored in data warehouse databases.
Status Versus Event Data
The difference between
status data and event data is
shown in figure . The figure
shows a typical log entry
recorded by a DBMS when
processing a business
transaction for a banking application. This log entry contains both status and
event data: The “before image” and “after image” represent the status of the bank
account before and then after a withdrawal. Data representing the withdrawal (or
update event) are shown in the middle of the figure.
Transactions are business activities that cause one or more business
events to occur at a database level. An event results in one or more database
actions (create, update, or delete). The withdrawal transaction in the above figure
293 | P a g e
leads to a single update, which is the reduction in the account balance from 750
to 700. On the other hand, the transfer of money from one account to another
would lead to two actions: two updates to handle a withdrawal and a deposit.
Sometimes non-transactions, such as an abandoned online shopping cart, busy
signal or dropped network connection, or an item put in a shopping cart and then
taken out before checkout, can also be important activities that need to be
recorded in the data warehouse.
Both status data and event data can be stored in a database. However, in
practice, most of the data stored in databases (including data warehouses) are
status data. A data warehouse likely contains a history of snapshots of status
data or a summary (say, an hourly total) of transaction or event data. Event data,
which represent transactions, may be stored for a defined period but are then
deleted or archived to save storage space.Both status and event data are
typically stored in database logs (as represented in the figure) for backup and
recovery purposes.
Transient Versus Periodic Data
In data warehouses, it is typical to maintain a record of when events
occurred in the past. This is necessary, for example, to compare sales or
inventory levels on a particular date or during a particular period with the
previous year’s sales on the same date or during the same period. Most
operational systems are based on the use of transient data. Transient data are
data in which changes to existing records are written over previous records, thus
294 | P a g e
destroying the previous data content. Records are deleted without preserving the
previous contents of those records. You can easily visualize transient data by
again referring to Figure 9-6. If the after image is written over the before image,
the before image (containing the previous balance) is lost. However, because
this is a database log, both images are normally preserved. Periodic data are
data that are never physically altered or deleted once added to the store. The
before and after images in Figure 9-6 represent periodic data. Notice that each
record contains a time stamp that indicates the date (and time, if needed) when
the most recent update event occurred.
OTHER DATA WAREHOUSE CHANGES
Besides the periodic changes to data values outlined previously, six other kinds
of changes to a warehouse data model must be accommodated by data
warehousing:
1. New descriptive attributes For example, new characteristics of products or
customers that are important to store in the warehouse must be
accommodated. Later in the chapter we call these attributes of dimension
tables. This change is fairly easily accommodated by adding columns to
tables and allowing null values for existing rows (if historical data exist in
source systems, null values do not have to be stored).
2. New business activity attributes For example, new characteristics of an
event already stored in the warehouse, such as a column C for the table in
Figure 9-8, must be accommodated. This can be handled as in item 1, but
295 | P a g e
is more difficult when the new facts are more refined, such as data
associated with days of the week, not just month and year.
3. New classes of descriptive attributes This is equivalent to adding new
tables to the database.
4. Descriptive attributes become more refined For example, data about
stores must be broken down by individual cash registers to understand
sales data. This change is in the grain of the data, an extremely important
topic, which we discuss later in the chapter. This can be a very difficult
change to accommodate.
5. Descriptive data are related to one another For example, store data are
related to geography data. This causes new relationships, often hierarchical, to
be included in the data model.
6. New source of data This is a very common change, in which some new
business need causes data feeds from an additional source system or
some new operational system is installed that must feed the warehouse.
This change can cause almost any of the previously mentioned changes,
as well as the need for new extract, transform, and load processes.
It is usually not possible to go back and reload a data warehouse to
accommodate all of these kinds of changes for the whole data history
maintained. But it is critical to accommodate such changes smoothly to enable
the data warehouse to meet new business conditions and information and
296 | P a g e
business intelligence needs. Thus, designing the warehouse for change is very
important.
In addition, according to oracle help center,these are the key characteristics of a
data warehouse:
 Some data is denormalized for simplification and to improve performance
 Large amounts of historical data are used
 Queries often retrieve large amounts of data
 Both planned and ad hoc queries are common
 The data load is controlled
In general, fast query performance with high data throughput is the key to a
successful data warehouse.
Data warehouse can be controlled when the user has a shared way of explaining
the trends that are introduced as specific subject. Below are
major characteristics of data warehouse:
1. Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a
theme instead of organization’s current operations. It can be achieved on specific
theme. That means the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales, distributions,
marketing etc.
A data warehouse never put emphasis only current operations. Instead, it
focuses on demonstrating and analysis of data to make various decision. It also
297 | P a g e
delivers an easy and precise demonstration around particular theme by
eliminating data which is not required to make the decisions.
2. Integrated –
It is somewhere same as subject orientation which is made in a reliable format.
Integration means founding a shared entity to scale the all similar data from the
different databases. The data also required to be resided into various data
warehouse in shared and generally granted manner.
A data warehouse is built by integrating data from various sources of data such
that a mainframe and a relational database. In addition, it must have reliable
naming conventions, format and codes. Integration of data warehouse benefits in
effective analysis of data. Reliability in naming conventions, column scaling,
encoding structure etc. should be confirmed. Integration of data warehouse
handles various subject related warehouse.
3. Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly,
or annually etc. It founds various time limit which are structured between the
large datasets and are held in online transaction process (OLTP). The time limits
for data warehouse is wide-ranged than that of operational systems. The data
resided in data warehouse is predictable with a specific interval of time and
delivers information from the historical perspective. It comprises elements of time
explicitly or implicitly. Another feature of time-variance is that once data is stored
in the data warehouse then it cannot be modified, alter, or updated.
4. Non-Volatile –
298 | P a g e
As the name defines the data resided in data warehouse is permanent. It also
means that data is not erased or deleted when new data is inserted. It includes
the mammoth quantity of data that is inserted into modification between the
selected quantity on logical business. It evaluates the analysis within the
technologies of warehouse.
In this, data is read-only and refreshed at particular intervals. This is beneficial in
analysing historical data and in comprehension the functionality. It does not need
transaction process, recapture and concurrency control mechanism.
Functionalities such as delete, update, and insert that are done in an operational
application are lost in data warehouse environment. Two types of data operations
done in the data warehouse are:
 Data Loading
 Data Access
The Derived Data Layer
Derived data is generated from existing data using a mathematical
operation or a data transformation. OLAP Services uses SQL ROLLUP to
generate aggregate data in the data warehouse. Dimension tables, also
called lookup tables, are used to store the dimension members for all levels in
the hierarchy. This is the data layer associated with logical or physical data
marts. It is the layer with which users normally interact for their decision support
applications. Ideally, the reconciled data level is designed first and is the basis for
the derived layer, whether data marts are dependent, independent, or logical. In
299 | P a g e
order to derive any data mart we might need, it is necessary that the EDW
(Enterprise Data Warehouse) be a fully normalized relational database
accommodating transient and periodic data; this gives us the greatest flexibility to
combine data into the simplest form for all user needs, even those that are
unanticipated when the EDW is designed.
Derived data is generated from existing data using a mathematical
operation or a data transformation. It can be created as part of a database
maintenance operation or generated at run-time in response to a query.
The objectives that are sought with derived data are quite different from
the objectives of reconciled data. Typical objectives are the following:
 Provide ease of use for decision support applications
 Provide fast response for predefined user queries or requests for
information (information usually in the form of metrics used to gauge the
health of the organization in areas such as customer service, profitability,
process efficiency, or sales growth)
 Customize data for particular target user groups
 Support ad hoc queries and data mining and other analytical applications
To satisfy these needs, we usually find the following characteristics in derived
data:
 Both detailed data and aggregate data are present:
300 | P a g e
a. Detailed data are often (but not always) periodic—that is, they provide a
historical record.
b. Aggregate data are formatted to respond quickly to predetermined (or
common) queries.
 Data are distributed to separate data marts for different user groups.
 The data model that is most commonly used for a data mart is a
dimensional model, usually in the form of a star schema, which is a relational-
like model (such models are used by relational online analytical processing
[ROLAP] tools).
Star Schema
A star schema is a database organizational structure optimized for use in
a data warehouse or business intelligence that uses a single large fact table to
store transactional or measured data, and one or more smaller dimensional
tables that store attributes about the data. It is called a star schema because the
fact table sits at the center of the logical diagram, and the small dimensional
tables branch off to form the points of the star.
A star schema is a simple database design (particularly suited to ad hoc
queries) in which dimensional
data (describing how data are
commonly aggregated for
reporting) are separated from
fact or event data (describing
301 | P a g e
business activity). A star schema is one version of a dimensional model (Kimball,
1996a).
A star schema consists of two types of tables: one fact table and one or
more dimension tables. Fact tables contain factual or quantitative data
(measurements that are numerical, continuously valued, and additive) about a
business, such as units sold, orders booked, and so on. Dimension tables hold
descriptive data (context) about the subjects of the business. The dimension
tables are usually the source of attributes used to qualify, categorize, or
summarize facts in queries, reports, or graphs; thus, dimension data are usually
textual and discrete (even if numeric). A data mart might contain several star
schemas with similar dimension tables but each with a different fact table. Typical
business dimensions (subjects) are Product, Customer, and Period.
Components of Star Schema
A Fact Table sits at the center of a star schema database, and each star
schema database only has a single fact table. The fact table contains the specific
measurable (or
quantifiable) primary data to
be analyzed, such as sales
records, logged performance
data or financial data. It may
be transactional -- in that
302 | P a g e
rows are added as events happen -- or it may be a snapshot of historical data up
to a point in time.
Dimension tables store supporting information to the fact table. Each star
schema database has at least one dimension table, but will often have many.
Each dimension table will relate to a column in the fact table with a dimension
value, and will store additional information about that value.
Star Schema Example
A star schema provides answers to a domain of business questions. For
example, consider the following questions:
1. Which cities have the highest sales of large products?
2. What is the average monthly sales for each store manager?
3. In which stores are we losing money on which products? Does this vary by
quarter?
A simple example of a star schema that could provide answers to such
questions is shown in Figure 9-10. This example has three dimension tables:
PRODUCT, PERIOD, and STORE, and one fact table, named SALES. The fact
table is used to record three business
facts: total units sold, total dollars sold, and
total dollars cost. These totals are
recorded for each day (the lowest level of
PERIOD) a product is sold in a store.
Could these three questions be
answered from a fully normalized data
303 | P a g e
model of transactional data? Sure, a fully normalized and detailed database is
the most flexible, able to support answering almost any question. However, more
tables and joins would be involved, data need to be aggregated in standard
ways, and data need to be sorted in an understandable sequence.
Star Schema Sample Data
Some sample data for this schema are shown in Figure 9-11. From the
fact table, we find (for example) the following facts for product number 110 during
period 002: 1. Thirty units were sold in store S1. The total dollar sale was 1500,
and total dollar cost was 1200. 2. Forty units were sold in store S3. The total
dollar sale was 2000, and total dollar cost was 1200.
Additional detail concerning the dimensions for this example can be
obtained from the dimension tables. For example, in the PERIOD table, we find
that period 002 corresponds to year 2010, quarter 1, month 5. Try tracing the
other dimensions in a similar manner.
Surrogate Key
Surrogate keys are widely used and accepted design standard in data
warehouses. It is a sequentially generated unique number attached with each
and every record in a Dimension table in any Data Warehouse. It joins between
the fact and dimension tables and is necessary to handle changes in dimension
table attributes.
Surrogate keys are typically meaningless integers used to connect the fact
to the dimension tables of a data warehouse. There are various reasons why we
cannot simply reuse our existing natural or business keys. Surrogate keys
304 | P a g e
essentially buffer the data warehouse from the operational environment by
making it immune to any operational changes. They are used to relate the facts
in the fact table to the appropriate rows in the dimension tables, with the
business keys only occurring in the (much smaller) dimension tables to keep the
link with the identifiers in the operational systems.
 Business keys change, often slowly, over time, and we need to remember
old and new business key values for the same business object. As we will see
in a later section on slowly changing dimensions, a surrogate key allows us to
handle changing and unknown keys with ease.
 Using a surrogate key also allows us to keep track of different nonkey
attribute values for the same production key over time. Thus, if a product
package changes in size, we can associate the same product production key
with several surrogate keys, each for the different package sizes.
 Surrogate keys are often simpler and shorter, especially when the
production key is a composite key.
 Surrogate keys can be of the same length and format for all keys, no
matter what business dimensions are involved in the database, even dates.
The primary key of each dimension table is its surrogate key. The primary
key of the fact table is the composite of all the surrogate keys for the related
dimension tables, and each of the composite key attributes is obviously a foreign
key to the associated dimension table.
Grain of the Fact Table
305 | P a g e
Fact tables provide the (usually) additive values that act as independent
variables by which dimensional attributes are analyzed. Fact tables are often
defined by their grain. The fact table grain functionality sets a new compound
primary key for a table. This means you no longer need to use connection points
for incremental uploads to fact tables. The grain of a fact table defines the lowest
level of detail that the fact table is divided into.
The raw data of a star schema are kept in the fact table. All the data in a
fact table are determined by the same combination of composite key elements;
so, for example, if the most detailed data in a fact table are daily values, then all
measurement data must be daily in that fact table, and the lowest level of
characteristics for the period dimension must also be a day. Determining the
lowest level of detailed fact data stored is arguably the most important and
difficult data mart design step. The level of detail of this data is specified by the
intersection of all of the components of the primary key of the fact table. This
intersection of primary keys is called the grain of the fact table. Determining the
grain is critical and must be determined from business decision-making needs
(i.e., the questions to be answered from the data mart). There is always a way to
summarize fact data by aggregating using dimension attributes, but there is no
way in the data mart to understand business activity at a level of detail finer than
the fact table grain.
Duration of the Database
306 | P a g e
As in the case of the EDW or ODS, another important decision in the
design of a data mart is the amount of history to be kept; that is, the duration of
the database. The natural duration is about 13 months or 5 calendar quarters,
which is sufficient to see annual cycles in the data. Some businesses, such as
financial institutions, have a need for longer durations. Older data may be difficult
to source and cleanse if additional attributes are required from data sources.
Even if sources of old data are available, it may be most difficult to find old values
of dimension data, which are less likely than fact data to have been retained. Old
fact data without associated dimension data at the time of the fact may be
worthless.
Size of the Fact Table
As you would expect, the grain and duration of the fact table have a direct
impact on the size of that table. We can estimate the number of rows in the fact
table as follows:
1. Estimate the number of possible values for each dimension associated
with the fact table (in other words, the number of possible values for each
foreign key in the fact table).
2. Multiply the values obtained in the first step after making any necessary
adjustments.
Let’s apply this approach to the star schema shown in Figure 9-11.
Assume the following values for the dimensions:
 Total number of stores = 1000
 Total number of products = 10,000
307 | P a g e
 Total number of periods = 24 (2 years’ worth of monthly data)
Although there are 10,000 total products, only a fraction of these products
are likely to record sales during a given month. Because item totals appear in the
fact table only for items that record sales during a given month, we need to adjust
this figure. Suppose that on average 50 percent (or 5000) items record sales
during a given month. Then an estimate of the number of rows in the fact table is
computed as follows:
 Total rows = 1000 stores X 5000 active products X 24 months
= 120,000,000 rows (!)
Thus, in our relatively small example, the fact table that contains two
years’ worth of monthly totals can be expected to have well over 100 million
rows. This example clearly illustrates that the size of the fact table is many times
larger than the dimension tables. For example, the STORE table has 1000 rows,
the PRODUCT table 10,000 rows, and the PERIOD table 24 rows. If we know the
size of each field in the fact table, we can further estimate the size (in bytes) of
that table. The fact table (named SALES) in Figure 9-11 has six fields. If each of
these fields averages four bytes in length, we can estimate the total size of the
fact table as follows:
 Total size = 120,000,000 rows X 6 fields X 4 bytes/field
= 2,880,000,000 bytes (or 2.88 gigabytes)
The size of the fact table depends on both the number of dimensions and
the grain of the fact table. Suppose that after using the database shown in Figure
9-11 for a short period of time, the marketing department requests that daily
308 | P a g e
totals be accumulated in the fact table. (This is a typical evolution of a data mart.)
With the grain of the table changed to daily item totals, the number of rows is
computed as follows:
 Total rows = 1000 stores X 2000 active products X 720 days (2 years)
= 1,440,000,000 row
In this calculation, we have assumed that 20 percent of all products record
sales on a given day. The database can now be expected to contain well over 1
billion rows. The database size is calculated as follows:
 Total size = 1,440,000,000 rows X 6 fields X 4 bytes/field
= 34,560,000,000 bytes (or 34.56 gigabytes)
Modeling Date and Time
Because data warehouses and data marts record facts about dimensions
over time, date and time (henceforth simply called date) is always a dimension
table, and a date surrogate key is always one of the components of the primary
key of any fact
table. Because a
user may want to
aggregate facts on
many different
aspects of date or
different kinds of
dates, a date
309 | P a g e
dimension may have many nonkey attributes. Also, because some
characteristics of dates are country or event specific (e.g., whether the date is a
holiday or there is some standard event on a given day, such as a festival or
football game), modeling the date dimension can be more complex than
illustrated so far.
Modeling Dates
The figure above shows a typical design for the date dimension. As we have
seen before, a date surrogate key appears as part of the primary key of the fact
table and is the primary key of the date dimension table. The nonkey attributes of
the date dimension table include all of the characteristics of dates that users use
to categorize, summarize, and group facts that do not vary by country or event.
VARIATIONS OF THE STAR SCHEMA
The simple star schema introduced earlier is adequate for many
applications. However, various extensions to this schema are often required to
cope with more complex modeling problems.
Multiple Fact Tables
310 | P a g e
Multiple-fact, multiple-grain queries in relational data sources occur when
a table containing dimensional
data is joined to multiple fact
tables on different key columns.
It is often desirable for
performance or other reasons to
define more than one fact table
in a given star schema.For
example, suppose that various users require different levels of aggregation (in
other words, a different table grain). Performance can be improved by defining a
different fact table for each level of aggregation. The obvious trade-off is that
storage requirements may increase dramatically with each new fact table. More
commonly, multiple fact tables are needed to store facts for different
combinations of dimensions, possibly for different user groups.
Conformed Dimension
In data warehousing, a conformed dimension is a dimension that has the same
meaning to every fact with which it relates. Conformed dimensions allow facts
and measures to be categorized and described in the same way across multiple
facts and/or data marts, ensuring consistent reporting across the enterprise.
One or more dimension tables associated with two or more fact tables for which
the dimension tables have the same business meaning and primary key with
311 | P a g e
each fact table. Conformed dimensions are dimensions that are shared by
multiple stars. They are used to compare the measures from each star schema.
Figure 9-13 illustrates a typical situation of multiple fact tables with two related
star schemas. In this example, there are two fact tables, one at the center of
each star:
1. Sales—facts about the sale of a product to a customer in a store on a
date
2. Receipts—facts about the receipt of a product from a vendor to a
warehouse on a date
As is common, data about one or more business subjects (in this case, Product
and Date) need to be stored in dimension tables for each fact table, Sales and
Receipts. Two approaches have been adopted in this design to handle shared
dimension tables. In one case, because the description of the product is quite
different for sales and receipts, two separate product dimension tables have
been created. On the other hand, because users want the same descriptions of
dates, one date dimension table is used. In each case, we have created a
conformed dimension, meaning that the dimension means the same thing with
each fact table and, hence, uses the same surrogate primary keys. Even when
the two star schemas are stored in separate physical data marts, if dimensions
are conformed, there is a potential for asking questions across the data marts
(e.g., Do certain vendors recognize sales more quickly, and are they able to
312 | P a g e
supply replenishments with less lead time?). In general, conformed dimensions
allow users to do the following:
 Share nonkey dimension data
 Query across fact tables with consistency
 Work on facts and business subjects for which all users have the same
meaning.
Factless Fact Table
A factless fact table is a fact table that does not have any measures. It is
essentially an intersection of dimensions (it contains nothing but dimensional
keys). Factless facts are a simple collection of dimensional keys which define the
transactions or describing conditions for the time period of the fact. There are
two types of factless tables: One is for capturing an event, and one is for
describing conditions.
The most common example used for factless facts are student attendance in a
class. As you can see
from the dimensional
diagram below the
FACT_ATTENDANCE
is an amalgamation of
the DATE_KEY, the
STUDENT_KEY, and
the CLASS_KEY.
313 | P a g e
As you can see there is nothing we can measure about a student’s attendance at
a class. The student was there and the attendance was recorded or the student
was not there and no record is recorded. It is a fact, plain and simple. There is a
derivation of this fact where you can always load the full roster of individuals
registered for the class and add a flag stating the person was in attendance.
In conclusion, factless fact tables are important dimensional data structures used
to convey transactional information which contain no measures. These tables are
occasionally necessary for capturing important dimensional relationships which
are critical to meeting the defined business reporting requirements.
Normalizing Dimension Tables
Fact tables are fully normalized because each fact depends on the whole
composite primary key and nothing but the composite key. However, dimension
tables may not be normalized. Most data warehouse experts find this acceptable
for a data mart optimized and simplified for a given user group, so that all the
dimension data are only one join away from associated facts. (Remember that
this can be done with logical data marts, so duplicate data do not need to be
stored.) Sometimes, as with any other relational database, the anomalies of a
denormalized dimension table cause add, update, and delete problems. In this
section, we address various situations in which it makes sense or is essential to
further normalize dimension tables.
Multivalued Dimensions
When the relationships between the dimension member and the fact are many to
many which means the dimension members are lower granularity than the facts.
314 | P a g e
Fact table should contain a one-to-one relationship with the dimension. So, we
introduce the Bridge table when we need to related multiple dimensions values
with one record.
There are situations when your data needs to represent a many to many
relationships such that your dimension members are at a lower grain than related
facts; aka multivalued dimension. In these cases, a single fact record should
relate to multiple dimension values. Here are a few examples from the Kimball
Group.
 Patients can have multiple diagnoses.
 Students can have multiple majors.
 Consumers can have multiple hobbies or interests.
 Commercial customers can have multiple industry classifications.
 Employees can have multiple skills or certifications.
 Products can have multiple optional features.
315 | P a g e
 Bank accounts can have multiple customers.
Multivalued dimension
There may be a need for facts to be qualified by a set of values for the same
business subject. For example, consider the hospital example in Figure 9-15. In
this situation, a particular hospital charge and payment for a patient on a date
(e.g., for all foreign keys in the Finances fact table) is associated with one or
more diagnoses. (We indicate this with a dashed M:N relationship line between
the Diagnosis and Finances tables.) We could pick the most important diagnosis
as a component key for the Finances table, but that would mean we lose
316 | P a g e
potentially important information about other diagnoses associated with a row.
Or, we could design the Finances table with a fixed number of diagnosis keys,
more than we think is ever possible to associate with one row of the Finances
table, but this would create null components of the primary key for many rows,
which violates a property of relational databases.
The best approach (the normalization approach) is to create a table for an
associative entity between Diagnosis and Finances, in this case the Diagnosis
group table. (Thus, the dashed relationship in the Figure is not needed.) In the
data warehouse database world, such an associative entity table is called a
“helper table,” and we will see more examples of helper tables as we progress
through subsequent sections. A helper table may have nonkey attributes (as can
any table for an associative entity); for example, the weight factor in the
Diagnosis group table of Figure above indicates the relative role each diagnosis
plays in each group, presumably normalized to a total of 100 percent for all the
diagnoses in a group. Also note that it is not possible for more than one Finances
row to be associated with the same Diagnosis group key; thus, the Diagnosis
group key is really a surrogate for the composite primary key of the Finances fact
table.
Hierarchies
Many times a dimension in a star schema forms a natural, fixed depth hierarchy.
For example, there are geographical hierarchies (e.g., markets within a state,
states within a region, and regions within a country) and product hierarchies
(packages or sizes within a product, products within bundles, and bundles within
317 | P a g e
product groups). When a dimension participates in a hierarchy, a database
designer has two basic choices:
1. Include all the information for each level of the hierarchy in a single
denormalized dimension table for the most detailed level of the hierarchy,
thus creating considerable redundancy and update anomalies. Although it
is simple, this is usually not the recommended approach.
2. Normalize the dimension into a nested set of a fixed number of tables with
1:M relationships between them. Associate only the lowest level of the
hierarchy with the fact table. It will still be possible to aggregate the fact
data at any level of the hierarchy, but now the user will have to perform
nested joins along the hierarchy or be given a view of the hierarchy that is
prejoined.
3. Fixed product hierarchy
When the depth of the hierarchy can be fixed, each level of the hierarchy is a
separate dimensional entity. Some hierarchies can more easily use this scheme
than can others. Consider the product hierarchy in this Figure. Here each product
is part of a product family (e.g., Crest with Tartar Control is part of Crest), and a
product family is part of a product category (e.g., toothpaste), and a category is
318 | P a g e
part of a product group (e.g., health and beauty). This works well if every product
follows this same hierarchy. Such hierarchies are very common in data
warehouses and data marts.
Slowly Changing Dimensions
Slowly Changing Dimensions (SCD) - dimensions that change slowly over
time, rather than changing on regular schedule, time-base. In Data Warehouse
there is a need to track changes in dimension attributes in order to report
historical data. In other words, implementing one of the SCD types should enable
users to assign the proper dimension's attribute value for a given date. Examples
of such dimensions could be: customer, geography, employee.
There are many approaches to deal with SCD. The most popular are:
 Type 0 - The passive method
 Type 1 - Overwriting the old value
 Type 2 - Creating a new additional record
 Type 3 - Adding a new column
Type 0 - The passive method. In this method no special action is performed upon
dimensional changes. Some dimension data can remain the same as it was first
time inserted, others may be overwritten.
Type 1 - Overwriting the old value. In this method no history of dimension
changes is kept in the database. The old dimension value is simply overwritten
by the new one. This type is easy to maintain and is often used for data which
319 | P a g e
changes are caused by processing corrections(e.g. removal of special
characters, correcting spelling errors).
Before the change:
After the change:
Type 2 - Creating a new additional record. In this methodology all history of
dimension changes is kept in the database. You capture attribute change by
adding a new row with a new surrogate key to the dimension table. Both the prior
and new rows contain as attributes the natural key(or other durable identifier).
Also 'effective date' and 'current indicator' columns are used in this method.
There could be only one record with current indicator set to 'Y'. For 'effective
date' columns, i.e. start_date and end_date, the end_date for the current record
usually is set to value 9999-12-31. Introducing changes to the dimensional model
in type 2 could be very expensive database operation so it is not recommended
to use it in dimensions where a new attribute could be added in the future.
Before the change:
320 | P a g e
After the change:
Type 3 - Adding a new column. In this type usually only the current and previous
value of dimension is kept in the database. The new value is loaded into the
'current/new' column and the old one into the 'old/previous' column. Generally
speaking, history is limited to the number of columns created for storing historical
data. This is the least commonly needed technique.
Before the change:
After the change:
Ten Essential Rules of Dimensional Modeling
1. Use atomic facts: Eventually, users want detailed data, even if their initial
requests are for summarized facts.
321 | P a g e
2. Create single-process fact tables: Each fact table should address the
important measurements for one business process, such as taking a
customer order or placing a material purchase order.
3. Include a date dimension for every fact table: A fact should be
described by the characteristics of the associated day (or finer) date/time to
which that fact is related.
4. Enforce consistent grain: Each measurement in a fact table must be
atomic for the same combination of keys (the same grain).
5. Disallow null keys in fact tables: Facts apply to the combination of key
values, and helper tables may be needed to represent some M:N
relationships.
6. Honor hierarchies: Understand the hierarchies of dimensions and
carefully choose to snowflake the hierarchy or denormalize into one
dimension.
7. Decode dimension tables: Store descriptions of surrogate keys and
codes used in fact tables in associated dimension tables, which can then be
used to report labels and query filters.
8. Use surrogate keys: All dimension table rows should be identified by a
surrogate key, with descriptive columns showing the associated production
and source system keys.
9. Conform dimensions: Conformed dimensions should be used across
multiple fact tables.
322 | P a g e
10. Balance requirements with actual data: Unfortunately, source data may
not precisely support all business requirements, so you must balance what is
technically possible with what users want and need.
Big Data and Columnar Databases
Big Data
Big Data is an ill-defined term applied to databases whose size strains the ability
of commonly used relational DBMSs to capture, manage, and process the data
within a tolerable elapsed time.
Big Data basically refers to the data which is in large volume and has complex
data sets. This large amount of data can be structured, semi-structured, or non-
structured and cannot be processed by traditional data processing software and
databases. Various operations like analysis, manipulation, changes, etc are
performed on data and then it is used by companies for intelligent decision
making. Big data is a very powerful asset in today's world. Big data can also be
used to tackle business problems by providing intelligent decision making.
Big data is a combination of structured, semistructured and unstructured data
collected by organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.
Systems that process and store big data have become a common component
of data management architectures in organizations, combined with tools that
support big data analytics uses. Big data is often characterized by the three V's:
323 | P a g e
 the large volume of data in many environments;
 the wide variety of data types frequently stored in big data systems; and
 the velocity at which much of the data is generated, collected and
processed.
Concept of 5V’s
Big data refers to data that is so large, fast or complex that it’s difficult or
impossible to process using traditional methods. The act of accessing and storing
large amounts of information for analytics has been around for a long time.
Volume
Volume, the first of the 5 V's of big data, refers to the amount of data that exists.
Volume is like the base of big data, as it is the initial size and amount of data that
is collected. If the volume of data is large enough, it can be considered big data.
What is considered to be big data is relative, though, and will change depending
on the available computing power that's on the market.
Velocity
The next of the 5 V's of big data is velocity. It refers to how quickly data is
generated and how quickly that data moves. This is an important aspect for
companies that need their data to flow quickly, so it's available at the right times
to make the best business decisions possible.
324 | P a g e
An organization that uses big data will have a large and continuous flow of data
that is being created and sent to its end destination. Data could flow from
sources such as machines, networks, smartphones or social media. This data
needs to be digested and analyzed quickly, and sometimes in near real time.
As an example, in healthcare, there are many medical devices made today to
monitor patients and collect data. From in-hospital medical equipment to
wearable devices, collected data needs to be sent to its destination and analyzed
quickly.
In some cases, however, it may be better to have a limited set of collected data
than to collect more data than an organization can handle -- since this can lead
to slower data velocities.
Variety
The next V in the five 5 V's of big data is variety. Variety refers to the diversity
of data types. An organization might obtain data from a number of different data
sources, which may vary in value. Data can come from sources in and outside an
enterprise as well. The challenge in variety concerns the standardization and
distribution of all data being collected.
Collected data can be unstructured, semi-structured or structured in nature.
Unstructured data is data that is unorganized and comes in different files or
formats. Typically, unstructured data is not a good fit for a mainstream relational
325 | P a g e
database because it doesn't fit into conventional data models. Semi-structured
data is data that has not been organized into a specialized repository but has
associated information, such as metadata. This makes it easier to process than
unstructured data. Structured data, meanwhile, is data that has been organized
into a formatted repository. This means the data is made more addressable for
effective data processing and analysis.
Veracity
Veracity is the fourth V in the 5 V's of big data. It refers to the quality and
accuracy of data. Gathered data could have missing pieces, may be inaccurate
or may not be able to provide real, valuable insight. Veracity, overall, refers to the
level of trust there is in the collected data.
Data can sometimes become messy and difficult to use. A large amount of data
can cause more confusion than insights if it's incomplete. For example,
concerning the medical field, if data about what drugs a patient is taking is
incomplete, then the patient's life may be endangered.
Both value and veracity help define the quality and insights gathered from data.
Value
The last V in the 5 V's of big data is value. This refers to the value that big data
can provide, and it relates directly to what organizations can do with that
collected data. Being able to pull value from big data is a requirement, as the
326 | P a g e
value of big data increases significantly depending on the insights that can be
gained from them.
Organizations can use the same big data tools to gather and analyze the data,
but how they derive value from that data should be unique to them.
Why Is Big Data Important?
The importance of big data doesn’t simply revolve around how much data you
have. The value lies in how you use it. By taking data from any source and
analyzing it, you can find answers that 1) streamline resource management, 2)
improve operational efficiencies, 3) optimize product development, 4) drive new
revenue and growth opportunities and 5) enable smart decision making. When
you combine big data with high-performance analytics, you can accomplish
business-related tasks such as:
 Determining root causes of failures, issues and defects in near-real time.
 Spotting anomalies faster and more accurately than the human eye.
 Improving patient outcomes by rapidly converting medical image data into
insights.
 Recalculating entire risk portfolios in minutes.
 Sharpening deep learning models' ability to accurately classify and react
to changing variables.
 Detecting fraudulent behavior before it affects your organization.
327 | P a g e
Companies use big data in their systems to improve operations, provide better
customer service, create personalized marketing campaigns and take other
actions that, ultimately, can increase revenue and profits. Businesses that use it
effectively hold a potential competitive advantage over those that don't because
they're able to make faster and more informed business decisions.
For example, big data provides valuable insights into customers that companies
can use to refine their marketing, advertising and promotions in order to increase
customer engagement and conversion rates. Both historical and real-time data
can be analyzed to assess the evolving preferences of consumers or corporate
buyers, enabling businesses to become more responsive to customer wants and
needs.
Big data is also used by medical researchers to identify disease signs and risk
factors and by doctors to help diagnose illnesses and medical conditions in
patients. In addition, a combination of data from electronic health records, social
media sites, the web and other sources gives healthcare organizations and
government agencies up-to-date information on infectious disease threats or
outbreaks.
Here are some more examples of how big data is used by organizations:
 In the energy industry, big data helps oil and gas companies identify
potential drilling locations and monitor pipeline operations; likewise, utilities
use it to track electrical grids.
328 | P a g e
 Financial services firms use big data systems for risk management
and real-time analysis of market data.
 Manufacturers and transportation companies rely on big data to manage
their supply chains and optimize delivery routes.
 Other government uses include emergency response, crime prevention
and smart city initiatives.
Columnar Databases
A column-oriented DBMS or columnar DBMS is a database management
system (DBMS) that stores data tables by column rather than by row. Practical
use of a column store versus a row store differs little in the relational
DBMS world. Both columnar and row databases can use traditional database
query languages like SQL to load data and perform queries. Both row and
columnar databases can become the backbone in a system to serve data for
common extract, transform, load (ETL) and data visualization tools. However, by
storing data in columns rather than rows, the database can more precisely
access the data it needs to answer a query rather than scanning and discarding
unwanted data in rows.
A columnar database stores data by columns rather than by rows, which makes it
suitable for analytical query processing, and thus for data warehouses.
329 | P a g e
A columnar database is optimized for fast retrieval of columns of data, typically in
analytical applications. Column-oriented storage for database tables is an
important factor in analytic query performance because it drastically reduces the
overall disk I/O requirements, and reduces the amount of data you need to load
from disk.
Like other NoSQL databases, column-oriented databases are designed to scale
“out” using distributed clusters of low-cost hardware to increase throughput,
making them ideal for data warehousing and Big Data processing.
A columnar database stores data of each column independently. This allows to
read data from disks only for those columns that are used in any given query.
The cost is that operations that affect whole rows become proportionally more
expensive. The synonym for a columnar database is a column-oriented database
management system. ClickHouse is a typical example of such a system.
Key columnar database advantages are:
 Queries that use only a few columns out of many.
 Aggregating queries against large volumes of data.
 Column-wise data compression.
330 | P a g e
Columnar Database example
In a columnar database, all the values in a column are physically grouped
together. For example, all the values in column 1 are grouped together; then all
values in column 2 are grouped together; etc. The data is stored in record order,
so the 100th entry for column 1 and the 100th entry for column 2 belong to the
same input record. This enables individual data elements, such as customer
name to be accessed in columns as a group, rather than individually row-by-row.
Here is an example of a simple database table with four columns and three rows.
In a columnar DBMS, the data would be stored like this:
0411,0412,0413;Moriarty,Richards,Diamond;Angela,Jason,Samantha;52.35,325.
82,25.50.
In a row-oriented DBMS, the data would be stored like this:
0411,Moriarty,Angela, 52.35;412,
Richards,Jason,325.82;0413,Diamond,Samantha,25.50.
NoSQL
331 | P a g e
Short for “Not only SQL,” NoSQL is a class of database technology used to store
and access textual and other unstructured data, using more flexible structures
than the rows and columns format of relational databases. The major purpose of
using a NoSQL database is for distributed data stores with humongous data
storage needs. NoSQL is used for Big data and real-time web apps. For
example, companies like Twitter, Facebook and Google collect terabytes of user
data every single day. Carl Strozz introduced the NoSQL concept in 1998.
NoSQL databases (aka "not only SQL") are non-tabular databases and store
data differently than relational tables. NoSQL databases come in a variety of
types based on their data model. The main types are document, key-value, wide-
column, and graph. They provide flexible schemas and scale easily with large
amounts of data and high user loads.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like
Google, Facebook, Amazon, etc. who deal with huge volumes of data. The
system response time becomes slow when you use RDBMS for massive
volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our
existing hardware. This
process is expensive.
The alternative for this issue
is to distribute database load
332 | P a g e
on multiple hosts whenever the load increases. This method is known as “scaling
out.”
NoSQL databases are non-relational, so it scales out better than relational
databases as they are designed with web applications in mind.
Brief history of NoSQL databases
NoSQL databases emerged in the late 2000s as the cost of storage dramatically
decreased. Gone were the days of needing to create a complex, difficult-to-
manage data model in order to avoid data duplication. Developers (rather than
storage) were becoming the primary cost of software development, so NoSQL
databases optimized for developer productivity.
As storage costs rapidly decreased, the amount of data that applications needed
to store and query increased. This data came in all shapes and sizes
— structured, semi-structured, and polymorphic — and defining the schema in
advance became nearly impossible. NoSQL databases allow developers to store
huge amounts of unstructured data, giving them a lot of flexibility.
Additionally, the Agile Manifesto was rising in popularity, and software engineers
were rethinking the way they developed software. They were recognizing the
need to rapidly adapt to changing requirements. They needed the ability to iterate
333 | P a g e
quickly and make changes throughout their software stack — all the way down to
the database. NoSQL databases gave them this flexibility.
Cloud computing also rose in popularity, and developers began using public
clouds to host their applications and data. They wanted the ability to distribute
data across multiple servers and regions to make their applications resilient, to
scale out instead of scale up, and to intelligently geo-place their data. Some
NoSQL databases like MongoDB provide these capabilities.
NoSQL database features
Each NoSQL database has its own unique features. At a high level, many
NoSQL databases have the following features:
 Flexible schemas
 Horizontal scaling
 Fast queries due to the data model
 Ease of use for developers
The User-Interface
User Interface
 The means by which the user and a computer system interact, in
particular the use of input devices and software.
 The purpose of a UI is to enable a user to effectively control a computer or
machine they are interacting with.
334 | P a g e
 A successful user interface should be intuitive (not require training to
operate), efficient (not create additional or unnecessary friction) and user-
friendly (be enjoyable to use).
A variety of tools are available to query and analyze data stored in data
warehouses and data marts. These tools may be classified as follows:
 Traditional query and reporting tools
 OLAP, MOLAP, and ROLAP tools NoSQL Short for “Not only SQL,”
 Data visualization tools
 Business performance management and dashboard tools
 Data-mining tools
Traditional query and reporting tools include spreadsheets, personal computer
databases, and report writers and generators.
Role of Metadata
The first requirement for building a user-friendly interface is a set of metadata
that describes the data in the data mart in business terms that users can easily
understand.
The metadata associated with data marts are often referred to as a “data
catalog,” “data directory,” or some similar term. Metadata serve as kind of a
“yellow pages” directory to the data in the data marts. The metadata should allow
users to easily answer questions such as the following:
335 | P a g e
1. What subjects are described in the data mart? (Typical subjects are
customers, patients, students, products, courses, and so on.)
2. What dimensions and facts are included in the data mart? What is the
grain of the fact table?
3. How are the data in the data mart derived from the enterprise data
warehouse data? What rules are used in the derivation?
4. How are the data in the enterprise data warehouse derived from
operational data? What rules are used in this derivation?
5. What reports and predefined queries are available to view the data?
6. What drill-down and other data analysis techniques are available?
7. Who is responsible for the quality of data in the data marts, and to whom
are requests for changes made?
Online Analytical Processing (OLAP) Tools (OLAP)
A specialized class of tools has been developed to provide users with
multidimensional views of their data. Such tools also usually offer users a
graphical interface so that they can easily analyze their data. In the simplest
case, data is viewed as a simple three dimensional cube.
Online analytical processing (OLAP) is the use of a set of query and reporting
tools that provides users with multidimensional views of their data and allows
them to analyze the data using simple windowing techniques. The term online
analytical processing is intended to contrast with the more traditional term online
transaction processing (OLTP).
336 | P a g e
Online Analytical Processing Server (OLAP) is based on the multidimensional
data model. It allows managers, and analysts to get an insight of the information
through fast, consistent, and interactive access to information. OLAP is actually
a general term for several categories of data warehouse and data mart access
tools (Dyché, 2000).
Relational OLAP (ROLAP) tools use variations of SQL and view the database
as a traditional relational database, in either a star schema or another normalized
or denormalized set of tables. ROLAP tools access the data warehouse or data
mart directly.
Multidimensional OLAP (MOLAP) tools load data into an intermediate
structure, usually a three- or higher-dimensional array (hypercube). We illustrate
MOLAP in the next few sections because of its popularity. It is important to note
with MOLAP that the data are not simply viewed as a multidimensional
hypercube, but rather a MOLAP data mart is created by extracting data from the
data warehouse or data mart and then storing the data in a specialized separate
data store through which data can be viewed only through a multidimensional
structure. Other, less-common categories of OLAP tools are database OLAP
(DOLAP), which includes OLAP functionality in the DBMS query language (there
are proprietary, non-ANSI standard SQL systems that do this), and hybrid OLAP
(HOLAP), which allows access via both multidimensional cubes or relational
query languages.
337 | P a g e
OLAP Operations
 Cube slicing–
slicing the data
cube to produce a
simple two-
dimensional table or
view.
 Drill-down–
analyzing a given set of data at a finer level of detail.
Slicing a data cube
In the Figure, this slice is for the product named shoes. The resulting table shows
the three measures (units,
revenues, and cost) for this
product by period (or month).
Other views can easily be
developed by the user by means
of simple “drag and drop”
operations. This type of operation
is often called slicing and dicing
the cube.
338 | P a g e
An example of drill-down is shown in Figure 9-22. Figure 9-22a shows a
summary report for the total sales of three package sizes for a given brand of
paper towels: 2-pack, 3-pack, and 6-pack. However, the towels come in different
colors, and the analyst wants a further breakdown of sales by color within each of
these package sizes. Using an OLAP tool, this breakdown can be easily obtained
using a “point-and-click” approach with a mouse device.
The result of the drill-down is shown in Figure 9-22b. Notice that a drill-down
presentation is equivalent to adding another column to the original report. (In this
case, a column was added for the attribute color.)
Data Mining
Knowledge discovery, using a sophisticated blend of techniques from traditional
statistics, artificial intelligence, and computer graphics.
It is the process of finding patterns and correlations within large data sets to
identify relationships between data. Data mining tools allow a business
organization to predict customer behavior. Data mining tools are used to build
risk models and detect fraud. Data mining is used in market analysis and
management, fraud detection, corporate analysis and risk management.
The goals of data mining are threefold:
1. Explanatory To explain some observed event or condition, such as why
sales of pickup trucks have increased in Colorado
339 | P a g e
2. Confirmatory To confirm a hypothesis, such as whether two-income
families are more likely to buy family medical coverage than single-income
families
3. Exploratory To analyze data for new or unexpected relationships, such as
what spending patterns are likely to accompany credit card fraud
Business Performance Management
Business Performance Management (BPM) refers to the mechanisms
companies put in place to measure performance and communicate results
internally and externally. The goal of CPM is to use current and historical
performance data to improve future performance and decision making.
A business performance management (BPM) system allows managers to
measure, monitor, and manage key activities and processes to achieve
organizational goals. Dashboards are often used to provide an information
system in support of BPM.Dashboards, just as those in a car or airplane cockpit,
include a variety of displays to show different aspects of the organization. Often
the top dashboard, an executive dashboard,
is based on a balanced scorecard, in which
different measures show metrics from
different processes and disciplines, such as
operations efficiency, financial status,
customer service, sales, and human
340 | P a g e
resources. Each display of a dashboard will address different areas in different
ways.
For example, Figure 9-25 is a simple dashboard for one financial measure,
revenue. The left panel shows dials about revenue over the past three years,
with needles indicating where these measures fall within a desirable range. Other
panels show more details to help a manager find the source of out-of-tolerance
measures.
Data Visualization
Data visualization is the representation of data in graphical and multimedia
formats for human analysis. Benefits of data visualization include the ability to
better observe trends and patterns and to identify correlations and clusters. Data
visualization is often used in conjunction with data mining and other analytical
techniques.
In essence, data visualization is a way to show multidimensional data not as
numbers and text but as graphs. Thus, precise values are often not shown, but
rather the intent is to more readily show relationships between the data.
341 | P a g e
342 | P a g e
CHAPTER 9:
DATA WAREHOUSING-
MODERN PRINCIPLES AND
METHODOLOGIES
Donnabelle M. Durante
Shiela Mae E. Rosano
343 | P a g e
What Is a Decision Support System?

A decision support system (DSS) is a computerized program used to
support determinations, judgments, and courses of action in an organization or a
business. A DSS sifts (filter) through and analyzes massive amounts of data,
compiling comprehensive (complete) information that can be used to solve
problems in decision-making.
Typical information used by a DSS includes target or projected revenue,
sales figures or past ones from different time periods, and other inventory- or
operations-related data.
A decision support system gathers and analyzes data, synthesizing
(combining) it to produce comprehensive information reports. In this way, as an
informational application, a DSS differs from an ordinary operations application,
whose function is just to collect data. The DSS can either be completely
computerized or powered by humans. In some cases, it may combine both. The
ideal systems analyze information and actually make decisions for the user. At
the very least, they allow human users to make more informed decisions at a
quicker pace..
DSS Primary Purpose
The primary purpose of using a DSS is to present information to the
customer in an easy-to-understand way. A DSS system is beneficial because it
can be programmed to generate many types of reports, all based on user
specifications. For example, the DSS can generate information and output its
344 | P a g e
information graphically, as in a bar chart that represents projected revenue or as
a written report. As technology continues to advance, data analysis is no longer
limited to large, bulky mainframe computers. Since a DSS is essentially an
application, it can be loaded on most computer systems, whether on desktops or
laptops. Certain DSS applications are also available through mobile devices. The
flexibility of the DSS is extremely beneficial for users who travel frequently. This
gives them the opportunity to be well-informed at all times, providing the ability to
make the best decisions for their company and customers on the go or even on
the spot.
FIVE CATEGORIES OF DSS
1. Communication-driven
Its purpose are to help conduct a meeting or for users to
collaborate. The most common technology used to deploy the DSS is a
web or client server.
Example:
Chats and instant messaging soft wares such as messenger,
online collaboration and net meeting systems using Google Meet or Zoom
2. Data-driven
Most data-driven DSSs are targeted at managers, staff and also
product/service suppliers. It is used to query a database or data
345 | P a g e
warehouse to seek specific answers for specific purposes. It is deployed
via a main frame system, client/server link, or via the web.
Example: Computer-based databases that have a query system to check
in particular a GIS
 A geographic information system (GIS) is a computer system for
capturing, storing, checking, and displaying data related to
positions on Earth’s surface. GIS can use any information that
includes location, data about people such as population, income,
education level and information about landscape, different kinds
of soil and so much more.
GIS is not limited just for geologists who used to
study earthquake faults. Many retail businesses use GIS to help
them determine where to locate a new store. Marketing companies
use GIS to decide to whom to market stores and restaurants, and
where that marketing should be.
3. Document-driven
Document-driven DSS is a computerized support system that
integrates a variety of storage and processing technologies to provide
document retrieval and analysis. Such is intended to assist in decision
making.
346 | P a g e
Examples: Policies and procedures, product specifications, catalogs and
corporate historical documents, including minutes of meetings, corporate
records, and important correspondence.
4. Knowledge-driven
Knowledge-driven DSSs are a catch-all (hold variety of things)
category covering a broad range of systems covering users within the
organization setting it up but may also include others interacting with the
organization. These systems contain specialized problem-solving
expertise wherein the “expertise” consists of knowledge about a particular
domain.
Example:
TaxAct is a system that supports online tax filing. It contains
information (tips) that can help one to improve his or her tax outcome and
financial wellness.
5. Model-driven
In general, model-driven DSS use complex financial, simulation,
optimization or multi-criteria models to provide decision support. Model-
driven DSS use data and parameters provided by decision makers to aid
them in analyzing a particular situation. In other words, it is a system that
provides managers with models and analysis capabilities that can be used
during the process of making a decision.
347 | P a g e
Example: Optimization Spreadsheet DSS, wherein the decision variables
are the quantities of TVs, stereos and speakers to build. The objective
function is to maximize total profits. The constraints are from the parts
inventory. Managers should be able to determine the best way to use the
resources. Managers need to determine what “best” means, but usually it
implies maximizing profits or minimizing costs. Optimization may be
incorporated in a DSS used routinely in a firm or a management scientist
may build an optimization model for a special decision support study.
The four (4) Modern Data Warehouse Architectures

348 | P a g e
What is modern data architecture?
Data architecture is the structure (how data is being organized) of your
data assets (anything valuable) developed with a vision of how those assets and
your information systems will inevitably interact with one another. This includes
planning how data in a system will be created, processed, stored, and
transmitted.
Over time, data architecture has undergone several paradigm shifts
related to new technologies and business demands. Modern data architecture as
we know it has been significantly impacted by the concurrent evolution of big
data, machine learning or Artificial Intelligence, and cloud computing platforms. In
other words, modern data architecture is designed proactively with scalability
(system's ability to handle a growing amount of work) and flexibility in mind,
anticipating complex data needs.
Companies are increasingly moving towards cloud-based data
warehouses instead of traditional on-premise systems that involve the use of
physical servers (computers) located on-site and owned, managed and
maintained by your organization. Such was basically because cloud-based data
warehouses are quicker and cheaper to set up. There is no need to spend more
to purchase physical hardware, maintain and upgrade hardware, in addition to
running necessary systems such as power and cooling. And lastly, cloud-based
data warehouse architectures can typically perform complex analytical queries
much faster because they use massively parallel processing (MPP), a term that
349 | P a g e
means using a large number of computer processors to simultaneously perform a
set of coordinated computations in parallel.
FOUR (4) MODERN DATA WAREHOUSE ARCHITECTURES
1. Multiple Parallel Processing (MPP) Architectures
MPP architecture enables a mighty (vast) scale and Distributed
Computing, a model in which components of a software system are
shared among multiple computers. MPP basically uses a "shared-
nothing". There are numerous physical nodes, each runs its instance or
each has its own task. Thus, making it a lot faster in terms of performance
compare to traditional architectures.
Example: Amazon Redshift
Amazon
Redshift
uses MPP
350 | P a g e
architecture, breaking up large data sets into chunks which are assigned to slices
within each node. Queries perform faster because the compute nodes process
queries in each slice simultaneously. The Leader Node aggregates the results
and returns them to the client application.
Client applications, such as analytics tools, can directly connect to
Redshift using open source PostgreSQL JDBC and ODBC drivers. Analysts can
thus perform their tasks directly on the Redshift data.
Amazon Redshift requires computing resources to be provisioned and set
up in the form of clusters, which contain a collection of one or more nodes. Each
node has its own CPU, storage, and RAM. A leader node compiles queries and
transfers them to compute nodes, which execute the queries.
On each node, data is stored in chunks, called slices. Redshift uses
a columnar storage, meaning each block of data contains values from a single
column across a number of rows, instead of a single row with values from
multiple columns.
2. Multi-Structured Data
Interprets Big Data or data set whose size or type is beyond the
ability of traditional relational databases to capture, manage and process
the data with low latency and Analytics Infrastructure (concept that
comprises many technologies and services that support the essential
process of extracting the value of a given data) for multiple storage data
351 | P a g e
with a polyglot persistence strategy. A polyglot persistence database is
used when it is necessary to solve a complex problem by breaking that
problem into segments and applying different database models.
Example:
An e-commerce website which sells products online (Shopee,
Lazada) will use a NoSQL Store for storing the session state (record or
track the while browsing the app) of the users shopping on the website
while the payment system which captures the credit card information
persists it to a relational database like Oracle. In a similar fashion you can
implement different services to use different data stores and avoid building
a monolith (single massive) application where one database failure can
lead to the entire business going down. The need for polyglot data stores
is not just for high availability but also for scalability demands of an
internet-scale application.
3. Lambda Architecture
Lambda architecture
proposes a simpler,
elegant paradigm that is
designed to tame
complexity while being
able to store and
effectively process large
352 | P a g e
amounts of data. In the context of big data scenarios, Lambda architecture is a
frequently used form of architecture in IT system landscapes when it comes to
reconciling the requirements of two different user groups. On the one hand, there
are users who have always had to process and evaluate data of high quality.
These are usually enriched with additional, calculated key figures. The “classic”
users need the data for specific key dates in departments such as reporting,
accounting, risk or controlling. On the other hand, there are users with a short-
term need for information who have to react quickly to events. This can be the
defective ATM for the maintenance technician, but also the next boycott call for a
certain company in the social media for a stock trader.
Lambda architecture is used to solve the problem of computing
arbitrary functions. The lambda architecture itself is composed of 3 layers:
3.1. Batch Layer
New data comes continuously, as a feed to the data system. It gets
fed to the batch layer and the speed layer simultaneously. It looks at all
the data at once and eventually corrects the data in the stream layer.
Here we can find lots of ETL and a traditional data warehouse. This layer
is built using a predefined schedule, usually once or twice a day. The
batch layer has two very important functions:
 To manage the master dataset (data about the business
entities that provide context for business transactions)
 To pre-compute (initial computation) the batch views.
353 | P a g e
3.2. Serving Layer
The outputs from the batch layer in the form of batch views and
those coming from the speed layer in the form of near real-time views
(users see data that is only a few seconds old) get forwarded to the
serving. This layer indexes the batch views so that they can be queried in
low-latency on an ad-hoc basis.
3.3. Speed Layer
This layer handles the data that are not already delivered in the
batch view due to the latency of the batch layer. In addition, it only deals
with recent data in order to provide a complete view of the data to the user
by creating real-time views.
The query application reads data from the text file where the batch
layer stored its results. It combines and then sorts the data.
Example:
354 | P a g e
Lambda Architecture implementation that focuses on few common tools:
namely Hive, Spark and Kafka. The pre-system is an SAP Bank Analyzer 9 on a
HANA database.
The program (1) for loading the market data receives JSON files from the
ECB Statistical Data Warehouse via a REST call. These files are then parsed
(analyze) to extract and re-bundle the relevant data. Using the Kafka-Java-API a
Kafka-Producer is implemented, which writes the data formatted as JSON-string
into a Kafka-Topic
(2) Since only the latest version of the market data is needed, such a
topic is an easy-to-use key-value store. Of course, this step can also be done
directly in Spark and you can also skip the caching of the data in Kafka Topics.
However, the focus was to test as many interfaces as possible with a simple use
case. In addition, the traceability of older calculations is ensured in this way.
The main program for loading cash flows (3) was developed using the
Spark-Java-API. Two versions of the program were created for this purpose, one
for stream processing and a second for batch processing. Thanks to the
possibility to use Spark-Streaming for batch processing via the trigger setting
“One-Time-Micro-Batch”, the implementation and maintenance effort is limited.
Most of the code can be used for both cases. The processing mode is simply
selected as needed via a configuration file. Such a single processing brings all
known advantages of the Spark streaming library, such as the automatic
recovery of the query in case of an unintentional system shutdown or crash of
355 | P a g e
created checkpoints. In addition, however, all advantages of batch processing
are retained, such as the reduction of costs through targeted cluster startup and
shutdown.
Using the Spark-API, the HANA database (4) is accessed and the latest
record is retrieved. The recognition runs over a column with a continuous integer
of the datatype Long, which is generated from the timestamp of the data set. This
detour had to be taken because the SAP timestamp is not compatible with the
Spark timestamp in this case. The loading is then done in so-called microbatch
requests, which are sent to the HANA DB at certain time intervals and retrieve all
data since the last microbatch by querying the number just described. This
process is done and managed automatically by Spark. In the case of a
conventional Spark batch retrieval, all data from the last processed time stamp
would be retrieved, but would then have to be managed and stored by the user.
The Spark Streaming API does this automatically using the checkpoint files, as
explained above.
The latest market data is directly loaded from the aforementioned Kafka
Topic (5) via the Spark-Kafka implementation and is provided to the FTP library
for discounting cash flows. The library interpolates the grid points of the yield
curve to the due dates of the cash flow and discounts the cash flow accordingly.
The resulting dataframe is then checked with the help of a delta library for
changes of records already available in Hive and applies a filter if necessary. For
this purpose, the contents of the relevant fields are hashed and compared with
356 | P a g e
the values in the target table. If there is a match, the corresponding row is filtered
out of the dataframe.
The result including the hash values is written to a partitioned hive table (6) by
Spark. The partitioning by month and year helps to keep the performance of
reading the data for the delta comparison as high as possible.
4. Hybrid Architecture
Utilize existing On-Premises data structures. Hybrid architecture
is a combination of having on-premises sources with cloud sources. For
most companies, it is certainly an essential component to have a hybrid
cloud for your cloud adoption. Therefore, selecting the right cloud source
benefits your company for a clever Hybrid Integration Platform strategy.
Thus, end goal would be business benefits.
Use Cloud services for Advanced Analytics. For instance, the
architecture of a hybrid cloud typically includes an Infrastructure-as-a-
Service (IaaS) platform. IaaS is one of the three main categories of cloud
computing services, alongside software as a service (SaaS) and platform
as a service (PaaS), that provides virtualized computing resources over
the internet.
The main Infrastructure-as-a-Service platforms are Amazon Web
Services (AWS), Microsoft Azure and Google Cloud platform. A private
cloud is one in which resources are. These can be stored on premises or
357 | P a g e
off premises. Lastly, a hybrid cloud management requires a wide area
network (a telecommunications network that extends over a large
geographic area for the primary purpose of computer networking) to
connect the public and private clouds.
The Main Infrastructure-as-a-Service Platform:
358 | P a g e
The Data Staging and ETL
The data staging
area sits between the data
source(s) and the data
target(s), which are
often data warehouses, data
marts, or other data repositories. In other words, it is temporary storage area
between the data sources and a data warehouse.
359 | P a g e
Data staging
areas are often
transient (temporary)
in nature, with their
contents being erased
prior to running an
ETL process or
immediately following
successful completion of an ETL process.
ETL is a process of data integration that encompasses three steps —
extraction, transformation, and loading. In a nutshell, an ETL system take large
volumes of raw data from multiple sources, converts it for analysis, and loads
that data into your warehouse.
THE ETL PROCESS
Extraction
In the first step, extracted data sets come from a source, say for example
from SQL server into a staging area. The staging area acts as a buffer between
the data warehouse and the source data. Since data may be coming from
multiple different sources, it is likely in various formats and directly transferring
the data to the warehouse may result in corrupted data. The staging area is used
for data cleansing and organization.
360 | P a g e
Transformation
The data cleaning and organization stage is the transformation stage. All of
that data from multiple source systems will be normalized and converted to a
single system format — improving data quality and compliance. ETL yields
transformed data through different methods such as cleaning, filtering, joining,
sorting, splitting, deduplication and summarization.
Loading
Finally, data that has been extracted to a staging area and transformed is
loaded into your data warehouse. Depending upon your business needs, data
can be loaded in batches or all at once. The exact nature of the loading will
depend upon the data source, ETL tools, and various other factors.
Multidimensional Model
A multidimensional model views data in the form of a data-cube. A data
cube enables data to be modeled and viewed in multiple dimensions. It is defined
by dimensions and facts. The dimensions are the perspectives or entities
concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records
of its sales for the dimension time, item, and location. These dimensions allow
the save to keep track of things such as monthly sales of items and the locations
361 | P a g e
at which the items were sold. Each dimension has a table related to it, called a
dimensional table.
Consider the data of a shop for items sold per quarter in the city of Delhi.
The data is shown in the table. In this 2D representation, the sales for Delhi are
shown for the time dimension (organized in quarters) and the item dimension
(classified according to the types of an item sold). The fact or measure displayed
in rupee_sold (in thousands).
362 | P a g e
Now, if we want to view the sales data with a third dimension, For
example, suppose the data according to time and item, as well as the location is
considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data
are shown in the table. The 3D data of the table are represented as a series of
2D tables.
363 | P a g e
Conceptually, it may also be represented by the same data in the form of
a 3D data cube, as shown in fig:
Benefits of Using Multidimensional Solutions
The primary reason for building an Analysis Services multidimensional
model is to achieve fast query performance against business data. A
multidimensional model is composed of cubes and dimensions that can be
annotated and extended to support complex query constructions. BI developers
create cubes to support fast response times, and to provide a single data source
for business reporting. Given the growing importance of business intelligence
across all levels of an organization, having a single source of analytical data
ensures that discrepancies are kept to a minimum, if not eliminated entirely.
364 | P a g e
Another important benefit to using Analysis Services multidimensional
databases is integration with commonly used BI reporting tools such as Excel,
Reporting Services, and PerformancePoint, as well as custom applications and
third-party solutions.
META-DATA
Metadata can be explained in a few ways:
 Data that provide information about other data.
 Metadata summarizes basic information about data, making finding &
working with particular instances of data easier.
 Metadata can be created manually to be more accurate, or automatically
and contain more basic information.
In short, metadata is important. I like to answer this "what is metadata"
question as such: metadata is a shorthand representation of the data to which
they refer. If we use analogies, we can think of metadata as references to data.
Think about the last time you searched Google. That search started with the
metadata you had in your mind about something you wanted to find. You may
have begun with a word, phrase, meme, place name, slang or something else.
The possibilities for describing things seem endless. Certainly metadata schema
can be simple or complex, but they all have some things in common.
EXAMPLE
365 | P a g e
A simple example of metadata for a document might include a collection of
information like the author, file size, the date the document was created, and
keywords to describe the document. Metadata for a music file might include the
artist's name, the album, and the year it was released.
4 Stages of Data Warehouses
Stage 1: Offline Database
In their most early stages, many companies have Data Bases. The data is
forwarded from the day-to-day operational systems to an external server for
storage. Unless extrapolated and manually analyzed, this data sits where it is
and does not impact ongoing business functions. Transactions such as loading
or processing of data have no effects on an operational standpoint.
Offline Database, lets users search for numbers even without being
connected to the Internet. -
https://glosbe.com
This is the initial stage where data is simply copied to a server from an operating
system.
 Offline Operational Database: This is the initial stage where data is
simply copied to a server from an operating system. It is done so that
data loading, processing, and reporting do not affect the performance
of the operational system.
Stage 2: Offline Data Warehouse
366 | P a g e
While not entirely up-to-date, offline Data Warehouses regularly update their
content from existing operational systems. By emphasizing reporting-oriented
data structures, the organized data meets the particular objectives of the Data
Warehouse.
 Offline Data Warehouse: In this stage, all the data warehouses are
updated on a regular time cycle from the operational database to get
actionable business insights.
Stage 3: Real-time Data Warehouse
Real-Time Data Warehouses gathers information through operational system
events-based triggers. Often, these come in the form of transactions such as
airline bookings or bank balances.
 Real-time Data Warehouse: In this stage, data warehouses are updated
based on transaction or event basis. Whenever a transaction takes place
in an operational database, it is updated in the data warehouse.
Stage 4: Integrated Data Warehouse
Daily activities to be passed back to the operating system continuously in the
Integrated Data Warehouse. Integrated Data Warehouses are the ideal Data
Warehouse stage with the data not just readily available but also updated and
accurate.
367 | P a g e
 Integrated Data Warehouse: This is the final stage where all the
transactions which are used daily by the organization are passed back into
the operational system. Each transaction that takes place in the
operational database is updated in the warehouse simultaneously.
Accessing Data Warehouses
Storage is a fairly simple choice. You can host your data warehouse on-
premises, in the cloud, or use a hybrid approach. On-premises hosting
is, according to some, on its way out. Cloud hosting is much cheaper and more
flexible because you’re renting space on another’s server. You don’t need to run
maintenance, you can expand and cut back as needed, and there is an ever-
expanding set of features added each year. Bridging the gap between these two
approaches is hybrid hosting, which, as we mentioned before, is the preferred
choice for companies migrating from on-premises to cloud hosting.
To get data into your data warehouse, you need to use a type
of software commonly called ETL software. Extract, transform, load (ETL) is a
process where the data is extracted, made ready for use, then loaded into the
data warehouse.
Of course, data warehouses don’t run themselves. Labor is a significant part of
keeping a data warehouse running because it’s not just a system; it’s a “full-
fledged…architecture” that requires experts to set up and manage.
What is OLAP?
368 | P a g e
OLAP (Online Analytical Processing) was introduced into the business
intelligence (BI) space over 20 years ago, in a time where computer hardware
and software technology weren’t nearly as powerful as they are today. OLAP
introduced a groundbreaking way for business users (typically analysts) to easily
perform multidimensional analysis of large volumes of business data.
Aggregating, grouping, and joining data are the most difficult types of
queries for a relational database to process. The magic behind OLAP derives
from its ability to pre-calculate and pre-aggregate data. Otherwise, end users
would be spending most of their time waiting for query results to be returned by
the database.
Vendors offer a variety of OLAP products that can be grouped into three
categories: multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and
hybrid OLAP (HOLAP). Here is a breakdown of the differences between them.
What is ROLAP?
ROLAP stands for Relational Online Analytical Processing. ROLAP stores
data in columns and rows (also known as relational tables) and retrieves the
information on demand through user submitted queries. A ROLAP database can
be accessed through complex SQL queries to calculate information. ROLAP can
handle large data volumes, but the larger the data, the slower the processing
times.
What is MOLAP?
MOLAP stands for Multidimensional Online Analytical Processing. MOLAP
uses a multidimensional cube that accesses stored data through various
369 | P a g e
combinations. Data is pre-computed, pre-summarized, and stored (a difference
from ROLAP, where queries are served on-demand).
Its speedy data retrieval makes it the best for “slicing and dicing” operations.
One major disadvantage of MOLAP is that it is less scalable than ROLAP, as it
can handle a limited amount of data.
What is HOLAP?
HOLAP stands for Hybrid Online Analytical Processing. As the name
suggests, the HOLAP storage mode connects attributes of both MOLAP and
ROLAP. Since HOLAP involves storing part of your data in a ROLAP store and
another part in a MOLAP store, developers get the benefits of both.
With this use of the two OLAPs, the data is stored in both
multidimensional databases and relational databases. The decision to access
one of the databases depends on which is most appropriate for the requested
processing application or type. This setup allows much more flexibility for
handling data. For theoretical processing, the data is stored in a multidimensional
database. For heavy processing, the data is stored in a relational database.
Problems of Data Warehousing
The problems associated with developing and managing a data warehousing are
as follows:
Underestimation of resources of data loading
Sometimes we underestimate the time required to extract, clean, and load
the data into the warehouse. It may take the significant proportion of the total
370 | P a g e
development time, although some tools are there which are used to reduce the
time and effort spent on this process.
Required data not captured
In some cases the required data is not captured by the source systems
which may be very important for the data warehouse purpose. For example the
date of registration for the property may be not used in source system but it may
be very important analysis purpose.
High maintenance
Data warehouses are high maintenance systems. Any reorganization· of
the business processes and the source systems may affect the data warehouse
and it results high maintenance cost.
Data ownership
Data warehousing may change the attitude of end-users to the ownership of
data. Sensitive data that owned by one department has to be loaded in data
warehouse for decision making purpose. But some time it results in to reluctance
of that department because it may hesitate to share it with others.
371 | P a g e
CHAPTER 10:
DATA QUALITY AND INTEGRATION
Gabotero, Stephanie S.
Tiolo, Michelle Anne M.
372 | P a g e
What Is a Data Governance?
Data governance is a set of processes and procedures aimed at
managing the data within an organization with an eye toward high-level
objectives such as availability, integrity and compliance with regulations.
Data governance oversees data access policies by measuring risk and
security exposures (Leon, 2007). Data governance is a function that has to be
jointly owned by IT and the business. Successful data governance will require
support from upper management in the firm. A key role in enabling success of
data governance in an organization is that of a data steward.
Data steward
A person assigned the responsibility of ensuring that organizational
applications properly support the organization’s enterprise goals for data quality.
1. have a strong interest in managing information as a
corporate resource
2. an in-depth understanding of the business of the
organization, and
3. good negotiation skills.
The Sarbanes-Oxley Act of 2002
 The Sarbanes-Oxley Act of 2002 has made it imperative that
organizations undertake actions to ensure data accuracy, timeliness, and
consistency (Laurent, 2005).
373 | P a g e
 The Sarbanes-Oxley Act of 2002 is a federal law that established
sweeping auditing and financial regulations for public companies.
 Lawmakers created the legislation to help protect shareholders,
employees and the public from accounting errors and fraudulent financial
practices. Auditors, accountants and corporate officers became
accountable for the new set of rules.
Establishment of a business information advisory committee consisting of
representatives from each major business unit who have the authority to make
business policy decisions can contribute to the establishment of high data quality
(Carlson, 2002; Moriarty, 1996).
A data governance program needs to include the following:
1. Sponsorship from both senior management and business units
2. A data steward manager to support, train, and coordinate the data
stewards.
3. Data stewards for different business units, data subjects, source systems,
or combinations of these elements.
4. A governance committee, headed by one person, but composed of data
steward managers, executives and senior vice presidents, IT leadership
(e.g., data administrators), and other business leaders, to set strategic
goals, coordinate activities, and provide guidelines and standards for all
enterprise data management activities.
374 | P a g e
The goals of data governance are:
1. Transparency
2. Increasing the value of data maintained by the organization
Managing data quality
 The importance of high-quality data cannot be overstated.
 The data that serves as the foundation of these systems must be good
data, and if the data are bad—the systems fail.
 High-quality data—that is, data that are accurate, consistent, and available
in a timely fashion—are essential to the management of organizations
today.
 Leading provider of technology for data quality and integration, data
quality is important to:
o Minimize IT project risk
 Dirty data can cause delays and extra work on
information systems projects, especially those that
involve reusing data from existing systems.
 Make timely business decisions
o The ability to make quick and informed business
decisions is compromised when managers do not
have high-quality data or when they lack confidence
in their data.
375 | P a g e
 Ensure regulatory compliance
o Not only is quality data essential for SOX and Basel II (Europe)
compliance, quality data can also help an organization in justice,
intelligence, and antifraud activities.
 Expand the customer base
o Being able to accurately spell a customer’s name or to accurately
know all aspects of customer activity with your organization will help
in up-selling and cross-selling new business.
Redman (2004) summarizes data quality as “fit for their intended uses in
operations, decision making, and planning.” In other words, this means that data
are free of defects and possess desirable features (relevant, comprehensive,
proper level of detail, easy to read, and easy to interpret).
Characteristics of Quality Data (Loshin and Russom, 2006):
1. Uniqueness
- Uniqueness means that each entity exists no more than once
within the database, and there is a key that can be used to uniquely
access each entity.
2. Accuracy
- Accuracy has to do with the degree to which any datum correctly
represents the real-life object it models.
3. Consistency
376 | P a g e
- Consistency means that values for data in one data set
(database) are in agreement with the values for related data in another
data set (database).
4. Completeness
- Completeness refers to data having assigned values if they need
to have values.
5. Timeliness
- Timeliness means meeting the expectation for the time between
when data are expected and when they are readily available for use
6. Currency
- Currency is the degree to which data are recent enough to be
useful.
7. Conformance
 Conformance refers to whether data are stored, exchanged, or
presented in a format that is as specified by their metadata.
8. Referential integrity
 Data that refer to other data need to be unique and satisfy
requirements to exist
377 | P a g e
External Data Sources
 Much of an
organization’s data
originates outside
the organization,
where there is less
control over the
data sources to
comply with
expectations of the
receiving
organization.
Redundant data storage and inconsistent metadata
 Many organizations have allowed the uncontrolled proliferation of
spreadsheets, desktop databases, legacy databases, data marts, data
warehouses, and other repositories of data.
Data Entry Problems
 User interfaces that do not take advantage of integrity controls—such as
automatically filling in data, providing drop-down selection boxes, and
other improvements in data entry control— are tied for the number-one
cause of poor data.
378 | P a g e
Lack of Organizational Commitment
 For a variety of reasons, many organizations simply have not made the
commitment or invested the resources to improve their data quality.
Data Quality Improvement
Implementing a successful quality improvement program will require the
active commitment and participation of all members of an organization.
Get the Business Buy-In
 Data quality initiatives need to be viewed as business imperatives rather
than as an IT project.
Conduct a Data Quality Audit
 An organization without an established data quality program should begin
with an audit of data to understand the extent and nature of data quality
problems.
Establish a Data Stewardship Program
 As pointed out in the section on data governance, stewards are held
accountable for the quality of the data for which they are responsible.
Improve Data Capture Processes
 As noted earlier, lax data entry is a major source of poor data quality, so
improving data capture processes is a fundamental step in a data quality
improvement program.
For simplicity, we summarize what Inmon recommends only for the original data
capture step:
379 | P a g e
i. Enter as much of the data as possible via automatic, not human.
ii. Where data must be entered manually, ensure that it is selected from
preset options.
iii. Use trained operators when possible.
iv. Follow good user interface design principles that create consistent
screen layouts, easy to follow navigation paths, clear data entry masks
and formats minimal use of obscure Codes and so on.
v. Immediately check entered data for quality against data in the
database, so use triggers and user-defined procedures liberally to
make sure that only high-quality data enter the database; when
questionable data are entered, immediate and understandable
feedback should be given to the operator, questioning the validity of
the data.
Apply Modern Data Management Principles and Technology
 Powerful software is now available that can assist users with the technical
aspects of data quality improvement.
Apply TQM Principles and Practices
 Data quality improvements should be considered as an ongoing effort and
not treated as one-time projects
Summary of Data Quality
Ensuring the quality of data that enters databases and data warehouses is
essential if users are to have confidence in their systems.
380 | P a g e
Master Data Management
If one were to examine the data used in applications across a large
organization, one would likely find that certain categories of data are referenced
more frequently than others across the enterprise in operational and analytical
system.
 Master data management (MDM)
o refers to the disciplines, technologies, and methods to
ensure the currency, meaning, and quality of reference data
within and across various subject areas (Imhoff and White,
2006).
o Master data can be as simple as a list of acceptable city
names and abbreviations.
o Master data can be as simple as a list of acceptable city
names and abbreviations.
o MDM can also be realized in specialized forms.
3 popular architectures
1. Identity Registry Approach
o the master data remain in their source systems, and applications
refer to the registry to determine where the agreed-upon source of
particular data
2. Integration Hub Approach
381 | P a g e
o data changes are broadcast (typically asynchronously) through a
central service to all subscribing databases.
3. Persistent Approach
o one consolidated record is maintained, and all applications draw on
that one “golden record” for the common data.
DATA INTEGRATION: OVERVIEW
Data Integration
 It is the process of combining data from different sources into a single,
unified view.
 In a typical data integration process, the client sends a request to the
master server for data. The master server then intakes the needed data
from internal and external sources. The data is extracted from the
sources, then consolidated into a single, cohesive data set. This is served
back to the client for use.
 The end location needs to be flexible enough to handle lots of different
kinds of data at potentially large volumes.
Other ways to consolidate data are as follows (White, 2000):
Application Integration
 It creates connectors between two or more applications so they can work
with one another.
382 | P a g e
 Each individual application has a particular way it emits and accepts data,
and this data moves in smaller volumes.
 You only need to enter your data into a system once, then your
information will flow automatically directly in all your other connected
systems and they will take an action automatically.
Business Process Integration
 Is a crucial technique for supporting inter-organizational business
interoperability.
 Achieved by tighter coordination of activities across business processes
(e.g., selling and billing) so that applications can be shared and more
application integration can occur.
 By the help of BPI, it enables companies to digitally connect, communicate
and collaborate with customers, suppliers, partners, service vendors, and
all other players in the supply chain.
User interaction integration
 Achieved by creating fewer user interfaces that feed different data
systems.
Three techniques form the building blocks of any data integration
approach:
1. Data Consolidation
383 | P a g e
 It is the classic data integration process leveraging ETL technology,
the two terms are sometimes used interchangeably.
 It involves combining data from disparate sources, removing its
redundancies, cleaning up any errors, and aggregating it within a
single data store like a
data warehouse.
 The main idea of data
consolidation is to
provide end users with
all critical data in one
place for the most
detailed reporting and analysis possible
2. Data Federation
 It is a software process that allows multiple databases to function
as one.
 This provides a single source of data for front-end applications
without actually
bringing the data
all into one
physical,
centralized
database.
384 | P a g e
 It vastly simplifies querying and analyzing information, and it
eliminates the need for users to directly access source systems,
which reduces the challenges involved with administering security
access to multiple systems.
 A main advantage of the federation approach is access to current
data:
3. Data Propagation
 It is the use of the
application to
replicate the data
from one
location(source) to
another
location(destination).
 It is supported by Enterprise Application Integration (EAI) and
Enterprise Data Replication (EDR).
 This is commonly used for real-time business transactions. EDR
sends massive amounts of data between the databases instead of
applications using base triggers and logs.
 The major advantage of the data propagation approach to data
integration is the near-real-time cascading of data changes
throughout the organization.
385 | P a g e
Characteristics of Data After ETL:
Reconciled data Operational Data
 Detailed
 Historical  Transient
 Normalized  Not normalized
 Comprehensive  Generally restricted in scope to a particular
application.
 Timely
 Often of poor quality
 Quality
controlled
Data Reconciliation Process

 is responsible for transforming operational data to reconciled data.
 It helps you for extracting accurate and reliable information about the state
of industry process from raw measurement data.
 It also helps you to produces a single consistent set of data representing
the most likely process operation.
Data reconciliation occurs in two stages during the process of filling an
enterprise data warehouse:
1. During an initial load, when the EDW is first created.
2. During subsequent updates to keep the EDW current and/or to expand it.
386 | P a g e
Data Reconciliation Process
1. Mapping and Metadata Management
 This mapping could be shown graphically or in a simple matrix
with rows as source data elements, columns as data warehouse
table columns, and the cells as explanations of any reformatting,
transformations, and cleansing actions to be done.
2. Extract
 is the act or process of retrieving data out of data sources for
further data processing or data storage.
The two generic types of data extracts are:
 Static Extract
387 | P a g e
 all the data currently available in the source system is
extracted.
 Incremental Extract
 the data which have changed from the time when the
last data extraction had taken place is extracted.
3. Cleanse
 involves detecting such errors and repairing them and
preventing them from occurring in the future.
 uses pattern recognition and AI techniques to upgrade data
quality.
 Fixing errors like misspellings, erroneous dates, incorrect field
usage, mismatched addresses, missing and duplicate data.
 Also: decoding, reformatting, time stamping, conversion, key
generation, merging, error detection/logging, locating missing
data.
4. Load And Index
 is to load the selected data into the target data warehouse and
to create the necessary indexes.
The two basic modes for loading data to the target EDW:
 Refresh mode
 Bulk rewriting of target data at periodic intervals.
388 | P a g e
 Update mode
 only changes in source data are written to data
warehouse.
Data transformation
 is at the very center of the data reconciliation process.
 involves converting data from the format of the source operational
systems to the format of the enterprise data warehouse.
 the goal of data transformation is to convert the data format from the
source to the target system.
Data transformation functions
1. Record-level functions
 the most important record-level functions are selection, joining,
normalization, and aggregation.
 Selection
 The process of partitioning data according to predefined criteria.
 Joining
 The process of combining data from various sources into a single
table or view.
 Normalization
 Is the e process of decomposing relations with anomalies to
produce smaller, well-structured relations.
389 | P a g e
 Aggregation
 is the process of transforming data from a detailed level to a
summary level.
2. Field-level function
 converts data from a given format in a source record to a
different format in the target record.
Two types of field-level function
 Single-field transformation - converts data from a single source
field to a single target field.
a. Basic field transformation or In-general some transformation –
translate data
from old form to
new form.
b. Algorithmic transformation – it uses a formula or logical expression in
transforming a
data.
390 | P a g e
c. Table lookup – it uses a separate table keyed by source record code.
 Multi-field transformation
- converts data from one or more source fields to one or
more target fields.
- is very common in data warehouse applications.
- may involve more than one source record and/or more than
one target record.
Two types of multi-field transformation:
a. Many sources to one target
391 | P a g e
b. One source to many targets
392 | P a g e
CHAPTER 11:
DATA AND DATABASE

ADMINISTRATION
Garcia, Janah G.
Notario, Don
Trias, Angela B.
393 | P a g e
The roles of data and the database administrators
A Data administrator is the one who are responsible for processing data which is
relevant to be store in the database. A data administrator likely more a business
role with some technical role which also called as Data Analyst, that why this is
more likely a high level function which is responsible for the overall management
of data resources in an organizations, including in maintaining corporate wide
definition and standard. The head of data administration is a senior- level person
who required to have a high level of both managerial and technical skill. Data
administrator is a person who focused in business but should also understand
database technology.
Responsibilities:
 Filters out relevant data
 Monitor the data flow throughout the organization
 Designs concept-based data model
 Analyze and break down the data to be understood by the non-tech
person
Database Administrator is a person who has a knowledge in data base
technology, controls the organization design and use the database. It provides
necessary technical support for implementing the database such as design,
development, testing, and operational phase. It is not need to be a business
person but a person that can understand the business to administrate the
database effectively.
394 | P a g e
Responsibilities:
1. Deciding the hardware device- the DBA is responsible to decide which
hardware is suitable to use in the company. To know how much it will cost,
the performance and efficiency of that hardware.
2. Managing Data Integrity- the DBA need to protect the data from
unauthorized use.
3. Decides Data Recovery and Back up method- the DBA need to back up
the entire database in case of breach. DBA need also to recover the data
in case of loss
4. Tuning Database Performance- a way of upgrading the performance of
the database to make it faster and more convenient to be used to all
authorized users.
5. Capacity Issues- DBA need to know the maximum limit of storing data.
6. Database design- DBA is responsible for physical design, external model
design, and integrity control.
7. Database accessibility- the DBA writes subschema to secure the database
accessibility. Only authorized users can access the entire data.
8. Decides validation checks on data-DBA need to validate and check the
data to make it accurate and consistence.
9. Monitoring performance- DBA monitors the CPE and memory usage to
make sure that it works well.
395 | P a g e
10. Decides content of the database- DBA decide what will be the structure of
the database files.
11. Provides help and support to user- DBA is responsible also to help the
user who didn’t know well how to operate the system.
12. Database implementation- the DBA implement the database system
before anyone can use it.
13. Improve query processing performance- question made by the users need
to perform speedily that why the DBA improves query processing by
improving the performance.
The open-source movement and database management
Open source movement is a term that referred to open source software. The
open source movement is a code that people can modify and share because it’s
design to be publically accessible to anyone. The Source code is a part of
software that a computer users can’t see. A programmers who can access the
computer program’s source code can improve the program, by adding either
feature or fixing parts. The example of open source software is LibreOffice and
the GNU Image Manipulation Program.
LibreOffice is a free open source, it contain application like word processing,
spreadsheet, presentation, database management and graphic editing. This is
compatible with other office productivity like Microsoft Office. It runs on Microsoft
Windows, macOS, and Linux.
396 | P a g e
It is often cheaper but more flexible and has more longevity because it developed
by a community and not a single author or any company.
What are the value of open source?
The most common reason why people choose the open source
1. Peer review- the open source code is free and accessible to all, that’s why
it is actively checked and improves by peer programmers.
2. Transparency- open source help to track and check if there’s any changes
in the data code.
3. Reliability- the open source code constantly updated through active open
source community.
4. Flexibility- it help to solve problem to your business because of the help of
open source community and peer programmers.
5. Lower Cost- free and accessible
6. No vendor locking- because it is free to be used and you can take your
open source code anytime and anywhere
7. Open collaboration- the active source communities can help you to get
new solution to a problem
In database management system is a software package that generally
manipulate the data itself, data format, field name, record and file structure,
Components of a DBMS
397 | P a g e
1. Storage Engine- it is used to store data, it can used additional component
to store data
2. Metadata catalog- sometimes called a system catalog or database
dictionary, the DBMS uses this to verify the user who request for the data.
The metadata catalog can include information about database objects,
schemas, programs, security, performance, communication and other
environmental details about the databases it manages.
3. Database access language- the DBMS most provide API to access the
data.
*API (Application Programming Interface) is a software that allow
the two application to communicate (middleman) for example when
signing into your Facebook account using your phone, the mobile
application tells the API to retrieve your Facebook account. The
Facebook will access your data information to your mobile
application
4. Lock manager- locks are required to make sure that multiple user can’t
access and change the same data simultaneously
4. Log Manager- record all data changes to makes sure that the records are
accurate and efficient. The DBMS uses the log manager during shutdown
and startup to ensure data integrity
4. Data Utilities- include reorganization, run stats, backup and copy, recover,
integrity check, load data, unload data and repair database
Benefits of using a DBMS
398 | P a g e
Central storage and management of data within the DBMS provides the
following:
 data abstraction and independence;
 data security;
 a locking mechanism for concurrent access;
 an efficient handler to balance the needs of multiple applications using the
same data;
 the ability to swiftly recover from crashes and errors;
 strong data integrity capabilities;
 logging and auditing of activity;
 simple access using a standard API; and
 uniform administration procedures for data.
Example of this is commercial airlines, they rely on a DBMS for data-intensive
applications such as scheduling flight plans and managing customer flight
reservations.
Managing data security
Data Security are sometimes called computer security, system security or
information security. The data security is a measure that need to be taken to
prevent any unauthorized access to the information in computer and database or
on the web. And also it prevent corruption or modification of that information
Data Security Protecting Against
399 | P a g e
 Security Hackers: people who intended to steal, protest, or gather
information in the computer system.
 Malware: shortened name for “ malicious software”, it used to have an
access to those files even it can be only open by an authorized user, it can
also cause damage in a computer or computer system
 Computer Viruses: is a form of malware that uses written codes so the
virus can spread from one computer and computer system to another, this
can damage the computer and the data stored on it.
The 2017 WannaCry Ransomware Attack Was One Of The Most Widespread
Computer Infections Ever, And WannaCry Attacks Continue Today.
 The WannaCry ransomware epidemic of 2017 disrupted hospitals, banks
and communications companies worldwide.
 Four years later, cybercriminals renewed efforts to deploy WannaCry
ransomware during the COVID-19 pandemic.
 Companies can take steps to prevent infection, with software updates
being most important.
WannaCry ransomware is an example of crypto ransomware, a type of malicious
software used by cybercriminals to extort money. WannaCry takes your data a
hostage. It does either they will lock you out of your computer so you can’t use it
(called lock ransomware) or encrypting your valuable files so you can’t able to
open or read them (called crypto ransomware). WannaCry target computer’s
operating system that uses Microsoft Windows
400 | P a g e
Data Security Management is used to ensure the data’s organization is not
accessed or corrupted by someone who are unauthorized users. The Data
Security Management Plan includes Planning, Implementation of the Plan and
verification and updating of the plans components
Here are some basics of data security that are often included in any data security
management plan:
1. Backups- ensure that you have another copy of all the file to easily
recovery in case that there might be happen like breach, computer viruses
or damage in the computer.
2. Data masking- which some sensitive data or information is obscured
3. Data Erasure- a method when all the data in the computer is wiped clean
or overwritten when the equipment is sold or discarded
4. Encrypted- the process which the data is scrambled and encoded, only
the another entity can decode the data using encryption key
5. Authentication- using username and password of every user to identify
who access the computer system
6. One time password- the password that only work in one network session
or transaction
7. Electronic security token- need to have a physical device that serve as
electronic key and a password to access the data or information
8. Two factor authentication- requires a two method authentication
401 | P a g e
9. Transparent data encryption (TDE)- a method that encrypts the actual
files, intruder who access the data using different server can’t read or
used the data
10. Cloud access security broker- Software that works between users of a
cloud service and the cloud applications. The software monitors activity
and ensures the user's security policies are followed.
11. Big data security- securing the extremely large amount of data that add
another level of security by using security tools. The Hadoop can be used
to store and process extreme large data set
12. Payment security, mobile app security, web browser security email
security- using special security featured that work to prevent unauthorized
access
Database software and data security features
Database software is used to create, edit, and maintain database files and
records, enabling easier file and record creation, data entry, data editing,
updating, and reporting. The software also handles data storage, backup and
reporting, multi-access control, and security. Strong database security is
especially important today, as data theft becomes more frequent. Database
software is sometimes also referred to as a “database management system”
(DBMS). It’s primarily used for storing, modifying, extracting and searching for
information within a database. Database software is also used to
402 | P a g e
implement cybersecurity measures to protect against malware, viruses and other
security threats.
Database software makes data management simpler by enabling users to store
data in a structured form and then access it. It typically has a graphical interface
to help create and manage the data and, in some cases, users can construct
their own databases by using database software.
A database typically requires a comprehensive database software program
known as a database management system (DBMS). A DBMS serves as an
interface between the database and its end users or programs, allowing users to
retrieve, update, and manage how the information is organized and optimized. A
DBMS also facilitates oversight and control of databases, enabling a variety of
administrative operations such as performance monitoring, tuning, and backup
and recovery.
Most database software includes a graphical user interface (GUI) consisting of
structured fields and tabular forms that give users a centralized view of the data
present in a database and the tools to manipulate and query it. Structured Query
Language (SQL) commands are also typically used to interact with databases
through the software. Administrators input SQL queries to prompt the system to
perform an action, such as retrieving a specific set of data. However, there are
also databases that use other means for retrieving information in addition to SQL.
The most widely-used databases consist of a basic set of columns and rows that
display information retrieved using SQL. However, more complex software has
403 | P a g e
been developed in recent years to accommodate the massive amounts of unique
data collected by organizations, especially enterprises. These tools are multi-
layered, use a variety of query languages and support more storage formats,
such as XML.
Database software is available both as a commercial product and open
source software. Commercial options often have the advantage of vendor
support. While open-source software may lack this support, they make up for it
with more customization and free downloads.
Database software exists to protect the information in the database and ensure
that it’s both accurate and consistent. Its functions include storage, backup and
recovery, and presentation and reporting. It can also help your team with multi-
user access control, security management, and database communication.
Database Challenges
 Absorbing significant increases in data volume
The explosion of data coming in from sensors, connected machines, and
dozens of other sources keeps database administrators scrambling to
manage and organize their companies’ data efficiently.
 Ensuring data security
Data breaches are happening everywhere these days, and hackers are
getting more inventive. It’s more important than ever to ensure that data is
secure but also easily accessible to users.
 Keeping up with demand
404 | P a g e
In today’s fast-moving business environment, companies need real-time
access to their data to support timely decision-making and to take advantage
of new opportunities.
 Managing and maintaining the database and infrastructure
Database administrators must continually watch the database for problems
and perform preventative maintenance, as well as apply software upgrades
and patches. As databases become more complex and data volumes grow,
companies are faced with the expense of hiring additional talent to monitor
and tune their databases.
 Removing limits on scalability
A business needs to grow if it’s going to survive, and its data management
must grow along with it. But it’s very difficult for database administrators to
predict how much capacity the company will need, particularly with on-
premises databases.
 Ensuring data residency, data sovereignty, or latency requirements
Some organizations have use cases that are better suited to run on-premises.
In those cases, engineered systems that are pre-configured and pre-
optimized for running the database are ideal. Customers achieve higher
availability, greater performance and up to 40% lower cost with Oracle
Exadata, according to Wikibon’s recent analysis (PDF).
BENEFITS OF DATABASE SOFTWARE
405 | P a g e
 Data availability: Traversing through large stores of data in a single
database can be time-consuming and labor-intensive. Database
software makes this information readily available by providing the ability
to input queries to direct you to the exact data you’re searching for.
 Minimized redundancy: Users commonly work on the same projects
within multiple locations in a database. This can end up creating
multiple copies of the same file, leading to data redundancy. This was
particularly an issue with file-based data management systems, the
predecessor to database software. This can cause confusion when
searching for and organizing data and consumes valuable storage
space. Database software reduces redundancy by controlling
information stored in a variety of locations.
 Improved data security: Security should always be a top concern
when it comes to stored data. Database software can authorize or
block user access to views of protected data within an application
called, also called subschemas. It can also give access to specific
functions of a database depending on assigned roles. For example,
only system administrators and others with high-level access are able
to modify the database or alter user access. Authorizing access
typically involves using unique passwords for each user.
 Backup and Recovery: Database software has the ability to
regularly backup the data from a database and store it in a safe
location in the event of an outage or data breach. It can then use these
406 | P a g e
backups to automatically recover and restore the database to its
previous state.
 Analytics: Database software can collect valuable analytics, such as
what information users’ access, the frequency at which they access it,
potential security threats and other hiccups in the system. This
information is then visualized through the GUI so administrators can
easily gain insights and make data-driven decisions to improve
efficiency.
USER ROLES
Part of what allows database software to improve efficiency and maintain security
is the ability to assign roles to users that authorize or restrict access to certain
portions of a network. This ensures that users only have access to the assets
they need to do their job. The primary roles include the following:
 Administrators: This role has the highest level of access to the
database. They are able to view and manage the most sensitive
information, modify other users’ access, alter security protocols and
more.
 Programmers: In order to build and
modify applications, programmers require special permissions. They
can install new applications, modify application functionality and in
some cases remove them altogether.
407 | P a g e
 End users: These users typically have the most restricted access. and
can only retrieve, update, share and delete information relevant to their
duties. At most, they can retrieve, update, share and delete information
only in the applications that are essential to their jobs. In some cases,
they are confined to read-only access. This only allows users to view
this information but are not able to manipulate or delete it.
 Applications and programs: Aside from human users, programs also
need to access databases to retrieve and transmit information. Setting
permissions for how these programs access data is also an important
aspect of network security. The level of permissions for programs can
mirror those of different users stated above.
USER INTERACTION
 Building tables and forms: In order to add and organize files in a
database, database software is used to create fields and data entry
forms. When new files are added, they are indexed according to
programmer-defined parameters, such as name, type and length. Data
entry forms are created to input this information for each file. This
information is used by the software to determine where files are stored
and how they can be accessed.
 Updating and editing data: After data is stored, it will likely need to be
regularly updated or edited with new information. Database software
offers an ‘Edit’ mode to make these changes. However, each file will
408 | P a g e
have restrictions on who can edit data according to assigned user
permissions.
 View and query data: Besides storing data, one of the primary uses of
database software is to quickly and easily find relevant information.
Queries are used to search through a database and retrieve data.
 Reporting: Most database software has the capability to track
database activity. It also has features that allow users to pull this
information into reports that can be used to make data-driven business
decisions.
TYPES OF DATABASE SOFTWARE
There are multiple different types of database software that are typically broken
down into six categories:
 Analytical database software: This tool is used to gather and
compare data to assess the performance of different assets, such as
website traffic, employee productivity or business goals.
 Data warehouse software: This software acts as a large repository
that can pull and store data from a variety of databases. Data sets from
these different databases can then be compared to find inconsistencies
to improve data integrity.
 Distributed database software: Administrators can use this tool to
manage information from multiple databases from a centralized system.
409 | P a g e
 End user database software: Designed for the smaller scale, end
user database software stores information used by single users.
 External database software: This software acts as a central location
for multiple users to access the same information, typically over
the internet.
 Operational database software: Users can use this tool to manage or
modify data in real time.
TYPES OF DATABASE SOFTWARE TECHNOLOGY
 Relational database management system (RDBMS): this traditional
database technology can be applied to most use cases, and as a
result, is a very popular option. Information is presented in rows and
columns and allows for easy querying using SQL. RDBMS are mostly
used to store relatively simple information, such as contact information
and user identities. This technology is also highly scalable making it a
good option for large organizations. It can be hosted on-premises, in
the cloud and on hybrid-cloud systems.
 NoSQL: This is the second most common database technology next to
RDBMS. The name of this technology stands for “not only SQL.”
Standard SQL language can be used but it also supports a variety of
data models, such as key-value, document, columnar and graph
410 | P a g e
formats, as opposed to just rows and columns. The purpose of this
design is to allow it to handle evolving data structures.
 In-memory database management system (IMDBMS): Rather than
focusing on a variety of use cases or data structures, the main goal
of in-memory database tools is to provide fast response times and
improved performance.
 Columnar database management system (CDBMS): This technology
was mainly designed for data warehouses. These systems typically
store large amounts of very similar data. So a data structure composed
of mostly columns is a more straightforward solution to maintaining a
database.
 Cloud-based database management system: Cloud database
technology is gaining popularity as many organizations are shifting to a
cloud-based or hybrid cloud infrastructure. They are highly scalable
and maintenance is often provided by the cloud service.
ON-PREMISE VS. HOSTED DATABASE SOFTWARE
Database software can be delivered in two ways depending on an organization’s
infrastructure. On-premise software is deployed at an organization’s physical
location on hardware-based servers. It’s typically managed by the company’s
internal IT department. On-premise database software generally allows for more
customization.
411 | P a g e
The other option is cloud-hosting delivered as SaaS. One large benefit
depending on an organization’s resources is that the software is typically
maintained by the service provider, freeing up IT teams to focus on other efforts.
It is also more scalable than on-premise software, as it’s not limited by hardware.
TOP DATABASE SOFTWARE VENDORS
Database software is used for a number of reasons across many industries.
Because they have so many uses, there are dozens of database software
programs available. Here are a few of the most popular:
Microsoft SQL Server: Microsoft’s SQL server is one of the oldest players in the
game, first released in 1989. It’s mainly used for Windows-based systems but
also supports Linux operating systems (OS).
Oracle RDBMS: This tool is one of the most popular database software options
for enterprise organizations as it can support large databases but maintains good
performance. It can support Windows, Linux and UNIX systems
IBM DB2: IBM DB2 was also an early contender in the database software space,
introduced in 1983. It’s praised for its simple deployment, installation and
operation. It also supports Windows, Linux and UNIX systems.
Altibase: This is an open-source database software solution but is also a high
performing, enterprise-grade tool. It uses an in-memory database to offer high
speeds and is one of the few solutions that provides scale-out technology and
sharding.
MySQL: MySQL is an open-source relational database tool. It’s common for web
hosting providers to bundle MySQL with their offerings making it a popular tool
412 | P a g e
for web developers. It can handle robust sets of data but its relatively simple
deployment and management make it a good option for smaller organizations
and independent web developers as well.
AmazonRDS: As an offering from Amazon Web Services (AWS), Amazon
Relational Database Service (AmazonRDS) is a cloud-based database-as-a-
service (DBaaS). It offers high scalability, dedicated secure connections and it
creates and stores backups automatically.
SQL Developer: This tool was built with flexibility in mind. It can integrate with a
number of other database tools and supports queries in a variety of formats,
including XML, HTML, PDF, or Excel.
Knack: Released in 2010, Knack is a relatively new database software tool. It’s
another DBaaS offering that is easy to use. It allows users to structure, connect
and extend data without the need for any coding. It’s already gained a notable
portfolio of clients, such as Spotify, Capital One and Intel.
Using databases improve business performance and decision-making,
Using database and other computing and business intelligence tools,
organizations can now leverage the data they collect to run more efficiently,
enable better decision-making, and become more agile and scalable. Optimizing
access and throughput to data is critical to businesses today because there is
more data volume to track. It’s critical to have a platform that can deliver the
performance, scale, and agility that businesses need as they grow over time.
Provide a significant boost to business capabilities, databases automate
413 | P a g e
expensive, time-consuming manual processes, they free up business users to
become more proactive with their data. By having direct control over the ability to
create and use databases, users gain control and autonomy while still
maintaining important security standards.
How autonomous technology is improving database management
Self-driving databases use cloud-based technology and machine learning to
automate many of the routine tasks required to manage databases, such as
tuning, security, backups, updates, and other routine management tasks. With
these tedious tasks automated, database administrators are freed up to do more
strategic work. The self-driving, self-securing, and self-repairing capabilities of
self-driving databases are poised to revolutionize how companies manage and
secure their data, enabling performance advantages, lower costs, and improved
security.
What is database security
Database security refers to the range of tools, controls, and measures designed
to establish and preserve database confidentiality, integrity, and availability.
Database security is a complex and challenging endeavor that involves all
aspects of information security technologies and practices. It’s also naturally at
odds with database usability. The more accessible and usable the database, the
414 | P a g e
more vulnerable it is to security threats; the more invulnerable the database is to
threats, the more difficult it is to access and use.
Database security must address and protect the following:
 The data in the database
 The database management system (DBMS)
 Any associated applications
 The physical database server and/or the virtual database server and the
underlying hardware
 The computing and/or network infrastructure used to access the database
Encryption
When data is encrypted, it is transformed using an algorithm to make it
unreadable to anyone without the decryption key. The general idea is to make
the effort of decrypting so difficult as to outweigh the advantage to a hacker of
accessing the unauthorized data. There are two situations where data encryption
can be deployed: data in transit and data at rest. In a database context, data “at
rest” encryption protects data stored in the database, whereas data “in transit”
encryption is used for data being transferred over a network.
Encrypting data at rest is undertaken to prohibit “behind the scenes” snooping for
information. When the data at rest is encrypted, even if a hacker surreptitiously
gains access to the data behind the scenes, without the decryption key the data
is meaningless. Data at rest encryption most commonly is supported by using
415 | P a g e
built-in functions, a DBMS feature such as Oracle Transparent Data Encryption,
or through an add-on encryption product.
Encrypting data in transit protects against network packet sniffing. If the data is
encrypted before it is sent over the network and decrypted upon receipt at its
destination, it is protected along its journey. Anyone nefariously attempting to
access the data when in route will receive only encrypted data. And again,
without the decryption key, the data cannot be deciphered. Data in transit
encryption most commonly is supported using DBMS system parameters and
commands or through an add-on encryption product.
Label-Based Access Control
A growing number of DBMSs offer label-based access control (LBAC), which
delivers more fine-grained control over authorization to specific data in the
database. With LBAC, it is possible to support applications that need a more
granular security scheme. LBAC can be set up to specify who can read and
modify data in individual rows and/or columns.
LBAC is not for every application; it is geared more for top-secret, governmental,
and similar types of data. Setting up such a security scheme is virtually
impossible without LBAC.
Any attempted access to a protected column when the LBAC credentials do not
permit that access will fail. If users try to read protected rows not allowed by their
LBAC credentials, the DBMS simply acts as if those rows do not exist. This is
416 | P a g e
important because sometimes even the knowledge that the data exists (without
being able to access it) must be protected.
Data Masking
Data masking is the process of protecting sensitive information in databases from
inappropriate visibility by replacing it with gibberish or realistic but not real data.
Protecting sensitive data using data masking can prevent fraud, identity theft,
and other types of criminal activities.
A good data masking solution should offer the ability to mask using multiple
techniques. Common techniques include substitution, shuffling, number and data
variance, nulling out, encryption, and table-to-table synchronization. Data
masking is supported by many DBMS offerings as well as by third-party
products.
Staying Up-to-Date
Be sure to keep up-to-date on the latest security requirements and capabilities of
your DBMS. Understand what is available to you and what you may need to
augment with additional tools.
5. Database back-up and recovery
Database Backup
Database Backup is storage of data that means the copy of the data. It is
a safeguard against unexpected data loss and application errors. It protects the
417 | P a g e
database against data loss. If the original data is lost, then using the backup it
can reconstructed.
The backups are divided into two types, Physical Backup and Logical Backup
1. Physical backups
Physical Backups are the backups of the physical files used in storing and
recovering your database. It is a copy of files storing database information to
some other location, such as disk, some offline storage like magnetic tape.
Physical backups are the foundation of the recovery mechanism in the database.
Provides the minute details about the transaction and modification to the
database.
2. Logical backup
Logical Backup contains logical data which is extracted from a database. It
includes backup of logical data like views, procedures, functions, tables, etc. It is
a useful supplement to physical backups in many circumstances but not a
sufficient protection against data loss without physical backups, because logical
backup provides only structural information.
Importance Of Backups
Planning and testing backup help against failure of media, operating
system, software and any other kind of failures that cause a serious data crash. It
determines the speed and success of the recovery.
Methods of Backup
The different methods of backup in a database are:
418 | P a g e
Full Backup - This method takes a lot of time as the full copy of the database is
made including the data and the transaction records.
Transaction Log - Only the transaction logs are saved as the backup in this
method. To keep the backup file as small as possible, the previous transaction
log details are deleted once a new backup record is made.
Differential Backup - This is similar to full back up in that it stores both the data
and the transaction records. However only that information is saved in the
backup that has changed since the last full backup. Because of this, differential
backup leads to smaller files.
Common causes of Failures in a Database:
1. System Crash
System crash occurs when there is a hardware or software failure or
external factors like a power failure. The data in the secondary memory is not
affected when system crashes because the database has lots of integrity.
Checkpoint prevents the loss of data from secondary memory.
2. Transaction Failure
The transaction failure is affected on only few tables or processes
because of logical errors in the code. This failure occurs when there are system
errors like deadlock or unavailability of system resources to execute the
transaction.
3. Network Failure
419 | P a g e
A network failure occurs when a client – server configuration or distributed
database system are connected by communication networks.
4. Disk Failure
Disk Failure occurs when there are issues with hard disks like formation of
bad sectors, disk head crash, unavailability of disk etc.
5. Media Failure - Catastrophic Event
Media failure is the most dangerous failure because, it takes more time to
recover than any other kind of failures. A disk controller or disk head crash is a
typical example of media failure. Natural disasters like floods, earthquakes,
power failures that damage the data.
6. User Error
Normally, user error is the biggest reason of data destruction or corruption
in a database. To rectify the error, the database needs to be restored to the point
in time before the error occurred.
Redundancy
Data redundancy is a condition created within a database or data storage
technology in which the same piece of data is held in two separate places. This
can mean two different fields within a single database, or two different spots in
multiple software environments or platforms. Whenever data is repeated, this
basically constitutes data redundancy. This can occur by accident, but is also
done deliberately for backup and recovery purposes.
Hardware redundancy
420 | P a g e
Hardware redundancy is achieved by providing two or more physical copies of a
hardware component. When other techniques, such as use of more reliable
components, manufacturing quality control, test, design simplification, etc., have
been exhausted, hardware redundancy may be the only way to improve the
dependability of a system.
What Is Recovery?
Recovery is the process of restoring a database to the correct state in the
event of a failure. It ensures that the database is reliable and remains in
consistent state in case of a failure.
Database recovery can be classified into two parts;
1. Rolling Forward applies redo records to the corresponding data blocks.
2. Rolling Back applies rollback segments to the datafiles. It is stored in
transaction tables.
There are two methods that are primarily used for database recovery. These are:
 Log based recovery - In log-based recovery, logs of all database
transactions are stored in a secure area so that in case of a system
failure, the database can recover the data. All log information, such as the
time of the transaction, its data etc. should be stored before the
transaction is executed.
 Shadow paging - In shadow paging, after the transaction is completed, its
data is automatically stored for safekeeping. So, if the system crashes in
421 | P a g e
the middle of a transaction, changes made by it will not be reflected in the
database.
6. Controlling concurrent access
Concurrency Control
Concurrency control is a database management systems (DBMS) concept that is
used to address occur with a multi-user system. Concurrency control, when
applied to a DBMS, is meant to coordinate simultaneous transactions while
preserving data integrity.
Concurrent access is quite easy if all users are just reading data. There is no way
they can interfere with one another. Though for any practical Database, it would
have a mix of READ and WRITE operations and hence the concurrency is a
challenge.
Potential problems of Concurrency
Lost Updates - occur when multiple transactions select the same row and update
the row based on the value selected
Uncommitted dependency issues - occur when the second transaction selects a
row which is updated by another transaction (dirty read)
Non-Repeatable Read - occurs when a second transaction is trying to access the
same row several times and reads different data each time.
Incorrect Summary issue - occurs when one transaction takes summary over the
value of all the instances of a repeated data-item, and second transaction update
few instances of that specific data-item. In that situation, the resulting summary
does not reflect a correct result.
422 | P a g e
Reasons for using Concurrency control method is DBMS:
1. To apply Isolation through mutual exclusion between conflicting transactions
2. To resolve read-write and write-write conflict issues
3. To preserve database consistency through constantly preserving execution
obstructions
4. Concurrency control helps to ensure serializability
Concurrency Control Protocols
Different concurrency control protocols offer different benefits between the
amount of concurrency they allow and the amount of overhead that they impose.
Following are the Concurrency Control techniques in DBMS:
Lock Based Protocols in DBMS is a mechanism in which a transaction cannot
Read or Write the data until it acquires an appropriate lock. Lock based protocols
help to eliminate the concurrency problem in DBMS for simultaneous
transactions by locking or isolating a particular transaction to a single user.
Binary Locks: A Binary lock on a data item can either locked or unlocked states.
1. Shared Lock (S):
A shared lock is also called a Read-only lock. With the shared lock, the data item
can be shared between transactions. This is because you will never have
permission to update data on the data item.
2. Exclusive Lock (X):
With the Exclusive Lock, a data item can be read as well as written. This is
exclusive and can’t be held concurrently on the same data item. X-lock is
423 | P a g e
requested using lock-x instruction. Transactions may unlock the data item after
finishing the ‘write’ operation.
3. Simplistic Lock Protocol
This type of lock-based protocols allows transactions to obtain a lock on every
object before beginning operation. Transactions may unlock the data item after
finishing the ‘write’ operation.
4. Pre-claiming Locking
Pre-claiming lock protocol helps to evaluate operations and create a list of
required data items which are needed to initiate an execution process. In the
situation when all locks are granted, the transaction executes. After that, all locks
release when all of its operations are over.
Starvation
Starvation is the situation when a transaction needs to wait for an indefinite
period to acquire a lock.
Following are the reasons for Starvation:
When waiting scheme for locked items is not properly managed
In the case of resource leak
The same transaction is selected as a victim repeatedly
Deadlock
Deadlock refers to a specific situation where two or more processes are waiting
for each other to release a resource or more than two processes are waiting for
the resource in a circular chain.
424 | P a g e
Two Phase Locking Protocol also known as 2PL protocol is a method of
concurrency control in DBMS that ensures serializability by applying a lock to the
transaction data which blocks other transactions to access the same data
simultaneously.
Two Phase Locking protocol helps to eliminate the concurrency problem in
DBMS.
This locking protocol divides the execution phase of a transaction into
three different parts.
First phase, when the transaction begins to execute, it requires permission for
the locks it needs.
Second part is where the transaction obtains all the locks.
When a transaction releases its first lock, the third phase starts.
Third phase, the transaction cannot demand any new locks. Instead, it only
releases the acquired locks.
The Two-Phase Locking protocol allows each transaction to make a lock or
unlock request in two steps:
Growing Phase: In this phase transaction may obtain locks but may not release
any locks.
Shrinking Phase: In this phase, a transaction may release locks but not obtain
any new lock
Strict Two-Phase Locking Method
425 | P a g e
Strict-Two phase locking system is almost similar to 2PL. The only difference is
that Strict-2PL never releases a lock after using it. It holds all the locks until the
commit point and releases all the locks at one go when the process is over.
Centralized 2PL
In Centralized 2 PL, a single site is responsible for lock management process. It
has only one lock manager for the entire DBMS.
Primary copy 2PL
Primary copy 2PL mechanism, many lock managers are distributed to different
sites. After that, a particular lock manager is responsible for managing the lock
for a set of data items. When the primary copy has been updated, the change is
propagated to the slaves.
Distributed 2PL
In this kind of two-phase locking mechanism, Lock managers are distributed to all
sites. They are responsible for managing locks for data at that site. If no data is
replicated, it is equivalent to primary copy 2PL. Communication costs of
Distributed 2PL are quite higher than primary copy 2PL.
Timestamp based Protocol in DBMS is an algorithm which uses the System
Time or Logical Counter as a timestamp to serialize the execution of concurrent
transactions. The Timestamp-based protocol ensures that every conflicting read
and write operations are executed in a timestamp order.
The older transaction is always given priority in this method. It uses system time
to determine the time stamp of the transaction. This is the most commonly used
concurrency protocol.
426 | P a g e
Lock-based protocols help you to manage the order between the conflicting
transactions when they will execute. Timestamp-based protocols manage
conflicts as soon as an operation is created.
Validation based Protocol in DBMS also known as Optimistic Concurrency
Control Technique is a method to avoid concurrency in transactions. In this
protocol, the local copies of the transaction data are updated rather than the data
itself, which results in less interference while execution of the transaction.
The Validation based Protocol is performed in the following three phases:
 Read Phase
 Validation Phase
 Write Phase
Read Phase: In the Read Phase, the data values from the database can be read
by a transaction but the write operation or updates are only applied to the local
data copies, not the actual database.
Validation Phase
In Validation Phase, the data is checked to ensure that there is no violation of
serializability while applying the transaction updates to the database.
Write Phase
In the Write Phase, the updates are applied to the database if the validation is
successful, else; the updates are not applied, and the transaction is rolled back.
Characteristics of Good Concurrency Protocol
An ideal concurrency control DBMS mechanism has the following objectives:
Must be resilient to site and communication failures.
427 | P a g e
It allows the parallel execution of transactions to achieve maximum concurrency.
Its storage mechanisms and computational methods should be modest to
minimize overhead.
It must enforce some constraints on the structure of atomic actions of
transactions.
DATA DICTIONARIES AND REPOSITORIES
Data dictionary (also called information repositories) are mini database
management systems that manages metadata. It is a repository of information
about a database that documents data elements of a database. It describes the
meanings and purposes of data elements within the context of a project, and
provides guidance on interpretation, accepted meanings and representation. A
Data Dictionary also provides metadata about data elements. The metadata
included in a Data Dictionary can assist in defining the scope and characteristics
of data elements, as well the rules for their usage and application. A data
dictionary is a collection of descriptions of the data objects or items in a data
model for the benefit of programmers and others who need to refer to them.
Often a data dictionary is a centralized metadata repository. A first step in
analyzing a system of interactive objects is to identify each one and its
relationship to other objects. This process is called data modeling and results in a
picture of object relationships. After each data object or item is given a
descriptive name, its relationship is described, or it becomes part of some
structure that implicitly describes relationship. The type of data, such as text or
image or binary value, is described, possible predefined default values are listed
428 | P a g e
and a brief textual description is provided. This data collection can be organized
for reference into a book called a data dictionary.
Types of data dictionaries
There are two types of data dictionaries. Active and passive data dictionaries
differ in level of automatic synchronization.
• Active data dictionaries. These are data dictionaries created within the
databases they describe automatically reflect any updates or changes in their
host databases. This avoids any discrepancies between the data dictionaries and
their database structures.
• Passive data dictionaries. These are data dictionaries created as new
databases -- separate from the databases they describe -- for the purpose of
storing data dictionary information. Passive data dictionaries require an additional
step to stay in sync with the databases they describe and must be handled with
care to ensure there are no discrepancies.
Data dictionary components
Specific contents in a data dictionary can vary. In general, these components are
various types of metadata, providing information about data.
• Data object listings (names and definitions)
• Data element properties (such as data type, unique identifiers, size,
nullability, indexes and optionality)
429 | P a g e
• Entity-relationship diagrams (ERD)
• System-level diagrams
• Reference data
• Missing data and quality-indicator codes
• Business rules (such as for validation of data quality and schema objects)
Pros and cons of data dictionaries
Data dictionaries can be a valuable tool for the organization and management of
large data listings. Other pros include:
• Provides organized, comprehensive list of data
• Easily searchable
• Can provide reporting and documentation for data across multiple
programs
• Simplifies the structure for system data requirements
• No data redundancy
• Maintains data integrity across multiple databases
• Provides relationship information between different database tables
• Useful in the software design process and test cases
Though they provide thorough listings of data attributes, data dictionaries may be
difficult to use for some users. Other cons include:

430 | P a g e
• Functional details not provided
• Not visually appealing
• Difficult to understand for non-technical users
Why Use a Data Dictionary?
Data Dictionaries are useful for a number of reasons. In short, they:
• Assist in avoiding data inconsistencies across a project
• Help define conventions that are to be used across a project
• Provide consistency in the collection and use of data across multiple
members of a research team
• Make data easier to analyze
• Enforce the use of Data Standards
What Are Data Standards and Why Should I Use Them?
Data Standards are rules that govern the way data are collected, recorded, and
represented. Standards provide a commonly understood reference for the
interpretation and use of data sets.
By using standards, researchers in the same disciplines will know that the way
their data are being collected and described will be the same across different
projects. Using Data Standards as part of a well-crafted Data Dictionary can help
431 | P a g e
increase the usability of your research data, and will ensure that data will be
recognizable and usable beyond the immediate research team.
TUNING THE DATABASE FOR PERFORMANCE
Databases are the guts of an application; without them, you're left with just skins
and skeletons, which aren't as useful on their own. Therefore, the overall
performance of any app is largely dependent on database performance. There
are dozens of factors that affect performance including how indexes are used,
how queries are structured and how data is modeled. Consequently, making
minor adjustments to any of these elements can have a large impact. Database
performance tuning refers to the various ways database administrators can
ensure databases are running as efficiently as possible. Typically, this refers to
tuning SQL Server or Oracle queries for enhanced performance. The goal of
database tuning is to reconfigure the operating systems according to how they’re
best used, including deploying clusters, and working toward optimal database
performance to support system function and end-user experience. Poor database
performance bogs down operations, and as the lifeblood of a business,
companies can’t afford barriers to data access. One of the best ways to navigate
past performance issues is by getting a regular database performance audit. Just
like a car needs standard tuning and maintenance, database engines and the
environments they reside in need to be assessed and serviced to ensure things
are working as they should and performing optimally. Database tuning can be an
incredibly difficult task, particularly when working with large-scale data where
432 | P a g e
even the most minor change can have a dramatic (positive or negative) impact
on performance. In mid-sized and large companies, most database tuning will be
handled by a Database Administrator (DBA). But there are plenty of developers
who have to perform DBA-like tasks; meanwhile, DBAs often struggle to work
well with developers.
Why should you perform database performance tuning?
Tuning the databases enhances the performance but it is only the first step in
keeping applications running smoothly. The purpose of database tuning is to
organize data in a way that makes retrieving information much easier. Without
database performance tuning, we could face problems every time we run
queries, even the response is incorrect or the query takes too long to perform.
10 Database Performance Tuning Best Practices
1. Keep statistics up to date
Table statistics are used to generate optimal execution plans. If the performance
tuning tool is using out-of-date statistics, the plan won’t be optimized for the
current situation.
2. Don’t use leading wildcards
Leading wildcards in parameters force a full table scan, even if there is an
indexed field inside the table. If the database engine must scan all the rows in a
table to find what it’s looking for, the delivery speed of your query results suffers.
Other queries may suffer as well, since scanning all of that data into memory will
433 | P a g e
cause the CPU utilization to spike and not allow other queries any time in
memory.
3. Avoid SELECT *
This tip is particularly important if you have a large table (think hundreds of
columns and millions of rows). If an application only needs a few columns,
include them individually instead of wasting time querying for all the data. Again,
reading extra data will cause CPU utilization to spike and memory to be
thrashed. You should check the Page Life Expectancy (PLE) to make sure you
are not having this issue.
4. Use constraints
Constraints are an effective way to speed up queries and helps the SQL
optimizer come up with a better execution plan, but the improved performance
comes at the cost of the data requiring more memory. The increased query
speed may be worth it depending on the business objective, but it’s important to
be aware of the price.
5. Look at the actual execution plan, not the estimated plan
The estimated execution plan is helpful when you are writing queries because it
gives you a preview of how the plan will run, but it is blind to parameter data
types which could be wrong. To get the best results when performance tuning,
it’s often better to review the actual execution plan because it uses the latest,
most accurate statistics.
434 | P a g e
6. Adjust queries by making one small change at a time
Making too many changes at once tends to muddy the waters. A better, more
efficient approach to query tuning is to make changes with the most expensive
operations first and work from there.
7. Adjust indexes to reduce I/O
Before you dive into troubleshooting I/O directly, first try adjusting indexes and
query tuning. Consider using a covering index that includes all the columns in the
query, this reduces the need to go back to the table as it can get all the columns
from the index. Adjusting indexes and query tuning have a high impact on almost
all areas of performance, so when they are optimized, many other performance
issues resolve as well.
8. Analyze query plans
Utilizing artificial intelligence to analyze your execution plan and determine how
to change it helps databases execute operations more efficiently.
9. Compare optimized and original SQL
When optimizing SQL queries, be sure to highlight changes in the SQL statement
so you can compare the original statement with the optimized version. Gather a
baseline metric such as logical I/O to compare against as you tune. Don’t make
any changes until you are sure the optimized version is accurate (i.e., includes
current statistics) and really does improve performance.
10. Automate SQL optimization

435 | P a g e
Automated SQL optimization tools not only analyze your SQL statement but can
also automatically rewrite it or optimize indexes until it finds the variation that
creates the most improvement in the execution time of the query.
DATA AVAILABILITY
Data availability is a measure of how often your data is available to be used,
whether by your own organization, or by one of your partners. It is desirable to
have your data available 24x7x365, which will permit your business to run
uninterrupted. Unexpected issues and interruptions are inevitable when dealing
with data management, so designing a system that can work around those
issues while still delivering data is essential. Data availability is primarily used to
create service level agreements (SLA) and similar service contracts, which define
and guarantee the service provided by third-party IT service providers.
Availability has to do with the accessibility and continuity of information.
Information with low availability concerns may be considered supplementary
rather than necessary.
Information with high availability concerns is considered critical and must be
accessible in order to prevent negative impact on University activities. It is the
ability to guarantee reliable access to data. Organizations must keep crucial data
available and shorten data outage times as much as possible. To achieve data
availability, organizations must be able to quickly repair all hardware failures and
maintain backups. Typically, data availability calls for implementing products,
services, policies and procedures that ensure that data is available in normal and
436 | P a g e
even in disaster recovery operations. This is usually done by implementing
data/storage redundancy, data security, network optimization, data security and
more. Storage area networks (SAN), network attached storage and RAID-based
storage systems are popular storage management technologies for ensuring
data availability.
Data Availability Challenges
There are several issues that can affect the availability of your data:
Host server failures—if the server that stores your data fails, your data will
become unavailable.
Storage failures—if your physical storage device fails, you can no longer access
the data it stores.
Network crash—if the network crashes, the host server becomes inaccessible
along with the data stored on it.
Poor data quality—low-quality datasets may contain incomplete, inconsistent, or
redundant data, which could be useless for your IT operations.
Data compatibility issues—data that is usable and working on a specific platform
or environment might not be on another.
Legacy data—data that is too outdated can become unusable. You can use data
transformation tools to make older data readily accessible, but these do not
always work.
437 | P a g e
Best practices to follow to combat data availability challenges include:
• Redundancy and backups. Backing up data is an essential aspect of data
availability. Data backups should be stored in separate locations or in a
distributed network. This way, if data is lost or corrupted, it can be restored
quickly. Storage devices are often set up in a redundant array of independent
disks (RAID) configuration.
• The use of data loss prevention tools. DLP tools can help mitigate data
breaches and damage to data centers.
• Erasure coding. This data protection method breaks data into fragments,
expands it and then encodes it with redundant data pieces. The data is then
stored across a set of different locations or storage devices. If a drive fails or data
becomes corrupted, the data can be reconstructed from the segments stored on
the other drives.
• Following retention policies and procedures. If data or devices are no
longer needed, they should be either archived or securely disposed of.
• Automatically switching to backups. Flexibility can be added by
automatically switching to a backup or failover environment if a drive fails or data
is lost.
438 | P a g e

Data Warehousing and Management (Compilation) Edited

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing and Management (Compilation) Edited

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF CALOOCAN CITY

Biglang Awa St. Grace Park East, Caloocan City

THE DATABASE ENVIRONMENT AND

Researched and presented by:

Acogido, Neil Angeli

 Data- A “given fact; a number, a statement or a picture. Stored

representations of meaningful objects and events. Meaningful facts, text

graphics, images, sound, video segments. A collection of individual

responses from a marketing research

(1) Structured: numbers, text dates

(2) Unstructured: images, video, documents

 Database- organized collection of logically data.

 Information- data that have meaning within a context. Data processed to

increase knowledge in the person using the data.

Data that describes data.

 Database System- collection of electronic data. Central repository of

shared data. Stored in a standardized, convenient form. Requires a

Database Management System (DBMS)

CONVENTIONAL FILE PROCESSING

Limitation of File Processing

 Duplication of Data- Different systems/ programs have separate copies of

the same data.

 Limited Data Sharing – No centralized control of data.

 Lengthy Development times- Programmers must design their own file

 Excessive Program Maintenance- 80% of information systems budget.

Problems with Data Dependency

 Non-standard file formats, lack of coordination and central control.

 Each application programmer must maintain his/ her own data

reading, inserting, updating, and deleting data.

Problems with Data Redundancy

 duplicate data, data changes in one file could cause inconsistencies

 Waste of space to have duplicate data.

 Causes more maintenance headaches.

 Compromises in data integrity

THE DATABASE APPROACH

Requires Database Management System that is used to create, maintain

and provide controlled access to user databases. Central repository of shared

data. Data is managed by a controlling agent. Stored in a standardized,

Database Management System

A database management system manages

data resources like an operating system manages

Elements of Database Approach

 Data Models- Graphical diagram capturing

the nature and relationship of data.

 Rational Databases- database technology involving table representing

entities and primary representing relationships.

 Entities- Noun from describing person, place, object, event, or concept.

 Relationships- one-to-may, many-to-many, one-to-one.

Advantages of the Database Approach

 Planned data redundancy  Improved data quality

 Improved data consistency  Improved data accessibility

 Improved data sharing and responsiveness

 Increased application  Reduced program

development productivity maintenance

 Improved decision support

Database Approach vs. Traditional File System

Costs and Risk of the Database Approach

 New, specialized personnel

Frequently, organizations that adopt the database approach need to

hire or train individuals to design and implement databases. This personnel

increase seems to be expensive, but an organization should not minimize the

upgrades to the hardware and data communications systems in the