Performance Tuning On DWS

H
e
r
e
a
r
e
s
o
m
e
PERFORMANCE EVALUATION AND

OPTIMIZATION TECHNIQUES OF DESIGNING
ORACLE DB SCHEMAS AND QUERIES FOR
DATAWAREHOUSE0
o
f
t
h
e
b
e
n
e
f
i
t
s
o
f
a
d
a
t
a
w
a
r
e
h
o
u
s
e
:
o
W
i
t
h
by
Bimal Kinkar Das
ABSTRACT :
Online Transaction Processing (OLTP) applications such as order processing,
process claims, checking accounts etc are important systems that are used to run the
organization. As an organization grows larger, hundreds of computer applications are needed
to support various business processes. But as the business grows more complex and spread
globally, business executives need more information to stay competitive and to make
strategic decisions. The operational systems, although needed for running the system could
not provide the strategic information. Thus, the need arises for Data warehousing which is a
paradigm intended to provide strategic information. OLTP systems are not suitable for
decision support
applications
because
their
emphasis
is
providing
ACID (Atomicity,
Consistency, Isolation and Durability) semantics. Hence performace of these systems is poor
for decision support applications.
Decision support applications use Online Analytical Processing(OLAP) which is a recent and
important application of database systems.Large organizations spend billions on Data
warehousing to make their business turn and they expect their Decision support systems to
provide them with the timely information. Decision support systems and applications used
for extracting strategic information need enterprise view of data. The data warehouses used
by these systems typically maintain a lot of information which grow over time. Due to the
large volume of data, the queries fired on the database may take hours before returning the
results. Also, OLAP queries are complex and can take many hours if executed directly on raw
data.
Thus, in this dissertation we look into the ways to minimize the time to provide information.
To minimize the time, we need to find the ways so that the performance of the database
used for the data warehouse implementation is high. To achieve this we have :
Compared , evaluated and found the best schema i.e. the star schema design for
implementing data warehouse
Compared, evaluated and found the ways to write queries with best response time and
throughput.
Broad Academic Area of Work: Data Warehousing

Key words: Data Warehousing , Oracle , Schema For data warehouse , Query optimization.
ACKNOWLEDGEMENTS
I take this opportunity to thank all those who have helped me in my dissertation
work.
Bimal kinkar Das

kinkarbimal@gmail.com
CONTENTS
Title Page
Certificate
Abstract
Acknowledgment
List of Figures and tables
Contents
References
Appendix
i
ii
iii
iv
v
vi
vii
vii
TABLE OF CONTENTS
INTRODUCTION ....................................................................................................................................1
1.1 Introduction to Data warehousing....................................................................................................1
1.2 Need for Data Warehousing.............................................................................................................2
1.3 Architecture of Data warehouse.......................................................................................................4
1.4 Attributes of Data Warehouse..........................................................................................................6
1.5 Data Warehouse vs Data Marts........................................................................................................9
Top Down Approach (Creating data warehouse first):.....................................................................9
Bottom Up Approach (Creating data marts first):............................................................................9
1.6 Advantages of Data warehouse.....................................................................................................10
1.7 Disadvantages of Data warehouse.................................................................................................11
SCHEMA ALTERNATIVES IN DATAWAREHOUSE..........................................................................12
2.1 Introduction...................................................................................................................................12
2.2 Dimension Modeling Schemas......................................................................................................13
2.3 Star Schema...................................................................................................................................15
2.4 Snowflake Schema........................................................................................................................17
INTRODUCTION TO TECHNOLOGIES USED..................................................................................19
3.1 Introduction to SQL......................................................................................................................19
3.1.1 What Can SQL do?................................................................................................................19
3.1.2 Syntax for Basic SQL Queries:..............................................................................................20
PERFORMANCE EVALUATION OF SCHEMAS................................................................................23
4.1 Introduction...................................................................................................................................23
4.2 Creating the star schema................................................................................................................24
4.4 Creating a snowflake schema........................................................................................................30
4.5 Data Model for the Snowflake Schema.........................................................................................31
4.6 Creating Tables.............................................................................................................................36

4.7 Data for the tables..........................................................................................................................36
4.8 Populating the data into warehouse tables using SQL Loader......................................................36
4.9 Performance evaluation of Star and Snowflake Schema..............................................................37
4.10 Timing Measurements................................................................................................................39
4.11 STAR OR SNOWFLAKE...........................................................................................................45
PERFORMANCE EVALUATION AND OPTIMIZATION OF QUERIES...........................................46
5.1 Understanding Oracle Optimizer and Explain plan......................................................................46
5.1.1 The Execution Plan................................................................................................................47
5.1.2 Cardinality.............................................................................................................................49
5.1.3 ACCESS METHOD..............................................................................................................51
5.1.4 JOIN METHOD....................................................................................................................54
5.1.5 JOIN ORDER........................................................................................................................58
5.1.6 Partition Pruning....................................................................................................................59
5.2 Star Transformation (Dimensions and Fact with bitmap indexes)................................................61
5.3 Understanding Optimizer Hints....................................................................................................66
5.4 Use of Analytical functions...........................................................................................................69
5.5 Understanding Query Rewrite.......................................................................................................74
Datawarehouse Aggregates : Materialized Views..........................................................................74
Query Rewrite................................................................................................................................77
Dimensions.....................................................................................................................................78
Example1........................................................................................................................................79
Example 2.......................................................................................................................................82
Scenario for Sales Data Mart..........................................................................................................86
CONCLUSION AND FUTURE WORK.................................................................................................92
Conclusion............................................................................................................................92
Future Work...........................................................................................................................93
REFERENCES........................................................................................................................................94
APPENDIX..............................................................................................................................................95
1. SAMPLE DATA FOR TABLES ....................................................................................................95
2. CREATE TABLES........................................................................................................................103
3. GENERATE CONTROL FILE DYNAMICALLY FOR SQL* LOADER...................................113
4. DATA LOADING USING SQL* LOADER.................................................................................114
5 LOGS FOR SCHEMA PERFORMANCE EVALUATION...........................................................116
Checklist of items for the Final Dissertation Report.............................................................................139
List Of Figures
1 : Basic Data
warehouse...........................................................................................5
2 : Complex Data
warehouse......................................................................................5
3 : General Star
Schema..........................................................................................16
4 : General Snowflake
Schema..................................................................................17
5 : STAR Schema for Sales Data
Mart.........................................................................24
6 : Data Model for the Star Schema...........................................................................25
7 : SNOWFLAKE schema for SALES data
mart.............................................................30
8 : Normalized Product
Dimension.............................................................................30
9 : Normalized Customer
Dimension..........................................................................31
10 : Normalized Geography
Dimension........................................................................31
11 : Schema Performance Comparision........................................................................44
12 : Representation of typical non repeating data in star and snowflake
schema................45
13 : Representation of typical repeating data in star and snowflake
schema......................45
14 : Aggregation setup for
SALES................................................................................86
List Of Tables
1 : Data model for SALES.........................................................................................26
2 : Data model for SALES_AGG.................................................................................27
3 : Data model for PRODUCT.....................................................................................27
4 : Data model for CUSTOMER..................................................................................28
5 : Data model for STORE........................................................................................ 28
6 : Data model for GEOGRAPHY.................................................................................29
7 : Data model for CAL_MASTER...............................................................................29
8 : Data model for PROD..........................................................................................32
9 : Data model for PROD_1.......................................................................................32
10: Data model for PROD_2.......................................................................................32
13: Data model for Cust............................................................................................33
14: Data model for Cust_1........................................................................................34
18: Data model for Geo.............................................................................................35
19: Data model for Geo_1.........................................................................................35

22: Result of performance of designing schemas.........................................................43
Chapter 1
INTRODUCTION
1.1 Introduction to Data warehousing
Data Warehouse is a term to describe a system used in an organization to collect data,
most of which are transactional data such as purchase records and etc., from one or
more data sources such as the database of a transactional system, into a central data
location, the Data Warehouse, and later report those data generally in an aggregated
way to business users in an organization. Data warehouses contain a wide variety of data
that present a coherent picture of business conditions at a single point in time that is
Business Driven, Market Focused and Technology Based.
This system consists of an ETL tool, a database, a reporting tool and other facilitating
tools such as a data modeling tool.
A Data warehouse (DW) is a database used for reporting. The data is offloaded from the
operational systems for reporting. The data may pass through an operational data store
for additional operations before it is used in the data warehouse for reporting. A data
warehouse maintains its functions in three layers: staging, integration and access.
Staging is used to store raw data for use by developers (analysis and support).
Integration layer is used to integrate data and to have a level of abstraction from
users.
Access layer is for getting data out for users. This definition of data warehouse
focuses on data storage.
The main source of the data is cleaned, transformed, cataloged and made available for
use by managers and other business professionals for data mining, online analytical
processing, market research and decision support.
1.2 Need for Data Warehousing

Every industry, from retail chain stores to manufacturing enterprises, from financial
institutions to government departments, and from airline companies to utility businesses,
need data warehousing to do business analysis and make strategic decisions.
As an organizations grow larger, hundreds of computer applications were needed to
support various business processes. But as the business grows more complex and spread
globally, business executives needed more information to stay competitive and to make
strategic decisions. Executives and managers who are responsible for keeping the
enterprise competitive need information to take appropriate decisions. They need
information to formulate business strategies, establish goals, set objectives and monitor
results. Information is needed to:
Get in depth knowledge of their company's performance.
Learn about key business factors and how they affect one another.
Monitor how business factors change over time.
Compare their company's performance relative to competition and industry

benchmarks.
In short, strategic information is required for continued health and survival of an

organization. The operational systems, although needed for running the system,
important as they were, could not provide strategic information. Businesses, therefore,
were compelled to turn to new ways of getting strategic information.
Over the past some decades, companies have accumulated tons and tons of data about
their operations. Information is said to double every 18 months. If the organizations have
such huge quantities, why cant
executives and managers use this data for making
strategic decisions? This is because the available data is not readily usable for strategic
decision making. These large quantities of data are very useful and good for running the
business operations, but hardly amenable for use in making decisions about business
strategies and objectives.
The fact is that for nearly two decades or more, IT departments have been attempting to
Provide information to key personnel in their companies for making strategic decisions.
Sometimes an IT department could produce ad hoc reports from a single application. In
most cases, the reports would need data from multiple systems, requiring the writing of
extract programs to create intermediary files that could be used to produce the ad hoc
reports.
Most of these attempts by IT in the past ended in failure. The users could not clearly
define what they wanted in the first place. Once they saw the first set of reports, they
wanted more data in different formats. The chain continued. This was mainly because of
the very nature of the process of making strategic decisions. Information needed for
strategic decision making has to be available in an interactive manner. The user must be
able to query online, get results, and query some more. The information must be in a
format suitable for analysis.
What is a basic reason for the failure of all the previous attempts by IT to provide
strategic information? What has IT been doing all along? The fundamental reason for
the inability to provide strategic information is that we have been trying all
along to provide strategic information from the operational systems. These
operational systems such as order processing, inventory control, claims processing,
outpatient billing, and so on are not designed or intended to provide strategic
information. If we need the ability to provide strategic information, we must get the
information from altogether different types of systems. Only specially designed decision
support systems or informational systems can provide strategic information.
Thus, we do need different types of decision support systems to provide strategic
information. The type of information needed for strategic decision making is different
from that available from operational systems. We need a new type of system
environment for the purpose of providing strategic information for analysis and
monitoring performance.
This new system environment that users desperately need to obtain strategic information
happens to be the new paradigm of data warehousing. Enterprises that are building data
warehouses are actually building this new system environment. This new environment is
kept separate from the system environment supporting the day-to-day operations. The
data warehouse essentially holds the business intelligence for the enterprise to enable
strategic decision making. The data warehouse is the only viable solution. We have
clearly seen that solutions based on the data extracted from operational systems are all
totally unsatisfactory.
Basically, data warehouse is a simple concept, it involves different functions: data
extraction, the function of loading the data, transforming the data, storing the data, and
providing user interfaces.The end result is the creation of a new computing environment
for the purpose of providing the strategic information every enterprise needs desperately.
1.3 Architecture of Data warehouse

Data warehouses and their architectures can vary depending upon the specifics of each
organization's situation.
In the most basic architecture for a data warehouse the data is fed from one or more
source systems, and end users directly access the data warehouse.
Figure 1: Basic Data warehouse

There is a more complex data warehouse environment. In addition to central database
there is a staging system used to cleanse and integrate data as well as multiple data
marts, which are systems designed for a particular line of business.
Figure 2: Complex Data warehouse
The main components of a complex data warehouse are:

Data Acquisition ( Data Source and Data Staging)
The main functions of Data Acquisition component are Extraction, Transformation
and Staging. Data is selected from different data sources using automated extract
procedures after providing different filters and is reformatted as per requirement.
Then this data is mapped to data required for the data warehouse repository.
Various transformations like data type conversion, demoralization, aggregation,
merge and purge etc. are performed. This data is then loaded to the staging area
and audit trail is preserved for each and every data item.
Data Storage
The main function of Data storage component is to load the data using either full
refresh or incremental loads. The main services of this component is to optimize
the load process , provide backup and recovery, handle security and data
archival, monitor and fine tune the database.
Information Delivery
Information delivery component takes the data from the data storage area and
includes:
User desktop query sessions
Information printed on adhoc reports
Data stores of query and reporting tools for repeated usage
Downstream systems such as EIS and Data mining
OLAP
1.4 Attributes of Data Warehouse

The characteristics of a data warehouse are :
Subject Oriented
Integrated
Non Volatile
Time Variant
Subject Oriented:
A data warehouse is organized around subjects. Subject orientation presents the data
in a format that is consistent and much clearer for end users to understand. In
contrast operational data such as order processing and manufacturing databases are
organized around business activities or functional areas. They are typically optimized
to serve a single static application. The functional separation of applications causes
companies to store identical information in multiple locations. The duplicated
information's format is usually inconsistent. For data warehouse subjects could be
Product, Customers and Orders as opposed to Purchasing and Payroll.
Data warehouses are designed to help you and analyze your data. For example, you
might want to learn more about your company's sales data. To do this, you could
build a warehouse concentrating on sales. In this warehouse, you could answer
questions like Who was our best Customer for this item last year? This kind of
focus on a topic, sales , in this case, is what is meant by subject oriented.
Integrated:
Integration of data within a warehouse is accomplished by dictating consistency in
format naming etc. Operational databases, for historic reasons, often have major
inconsistencies in data representation.
For example, a set of operational databases may represent male and female by
m and f, by 1 and 2 , by x and y. Frequently the inconsistencies are more
complex . Hence cleansing operations convert various representations of data into
one format so that they can be accommodated into a single schema. By definition,
data is always maintained in a consistent fashion in a data warehouse..Data
warehouses need to have the data from disparate sources put into a consistent
format. This means that naming conflicts have to be resolved and problems like data
being in different units of measure must be resolved.
Non Volatile:
Non Volatility, another primary aspect of data warehouses, means that after the
informational data is loaded into the data warehouse, changes, inserts or deletes are
rarely performed. Data is read only in most of the cases. The loaded data is
transformed data that originated from the operational databases. The data
warehouse is subsequently reloaded or more likely appended on a periodic basis with
new, transformed or summarized data from operational databases. Apart from the
loading process, the information contained in the data warehouse generally remains
static. The property of non volatility permits a data warehouse to be heavily
optimized for query processing.
Non volatile means that the data should not change once entered into the data
warehouse. This is logical because the purpose of a warehouse is to analyze what
has occurred.
Time Variant:
Data warehouse are time variant in the sense that they maintain both historical and
(nearly) current data. Operational databases, in contrast, contain only the most
current, up to date values. Furthermore, they generally maintain this information for
no more than one year (and often much less). By, comparison, data warehouses
contain data that is generally loaded from operational databases daily, weekly or
monthly and then typically maintained for a period of 3 to 5 years. This aspect marks
a major difference between the two types of environments. Historical information is
of high importance to decision makers. They often want to understand trends and
relationships between data. For example, the product manager for a soft drink maker
may want to see the relationship between coupon promotions and sales. This type of
information is typically impossible to determine with an operational database that
contains only current data. Most business analysis requires analyzing trends.
Because of this, analysts tend to need large amounts of data. This is very much in
contrast to OLTP systems , where performance requirements demand historical data
to be moved to an archive.
1.5 Data Warehouse vs Data Marts

A data warehouse incorporates information about many subject areas, often the entire
enterprise while the data mart focuses on one or more subject areas. The data mart
represents only a portion of an enterprise's data perhaps data related to a business unit
or work group. Typically, a data mart's data is targeted to a smaller audience of end
users or used to present information on a smaller scope. A data warehouse shows the
corporate view of the data where as a data mart provides a structure to suit
departmental view of the data.
The most common topic of debate is whether to create a data mart or to create a data
warehouse first. There are advantages and disadvantages of both the approach.
Top Down Approach (Creating data warehouse first):
Advantages:
It shows an enterprise view of data.
It is inherently architect-ed not a union of disparate data marts
It is a single , central storage of data.
There are centralized rules and control.
It provides quick results if implemented with iterations.
Disadvantages:
This approach takes longer to build.

It has high exposure to risk and failure.
Creating data warehouse needs cross functional skills.
Bottom Up Approach (Creating data marts first):

Advantages:
Faster and easy implementation.
Less risk of failure.
Inherently incremental , can schedule important data marts first.
Favorable returns on POC.
Disadvantages:
Permeates redundant data.
Inconsistent and irreconcilable data.
Proliferates unmanageable interfaces
Each data mart has its own narrow view of data.
1.6 Advantages of Data warehouse

Some of the benefits of a data warehouse are:
With data warehousing, you can provide a common data model for different
interest areas regardless of data's source. In this way, it becomes easier to report
and analyze information.
Many inconsistencies are identified and resolved before loading of information in

data warehousing. This makes the reporting and analyzing process simpler.
The best part of data warehousing is that the information is under the control of
users, so that in case the system gets purged over time, information can be easily
and safely stored for longer time period.
Because of being different from operational systems, a data warehouse helps in

retrieving data without slowing down the operational system.
Data warehousing enhances the value of operational business applications and

Customer relationship management systems.
Data warehousing also leads to proper functioning of support system applications

like trend reports, exception reports and the actual performance analyzing reports.
Precisely, a data warehouse system proves to be helpful in providing collective

information to all its users. It is mainly created to support different analysis, queries that
need extensive searching on a larger scale.
10
1.7 Disadvantages of Data warehouse

However, there are considerable disadvantages involved in moving data from multiple,
often highly disparate, data sources to one data warehouse that translate into long
implementation time, high cost, lack of flexibility, dated information, and limited
capabilities:
Major data schema transforms from each of the data sources to one schema in the
data warehouse, which can represent more than 50% of the total data warehouse
effort .
Data owners lose control over their data, raising ownership (responsibility and
accountability), security and privacy issues.
Long initial implementation time and associated high cost .
Adding new data sources takes time and associated high cost .
Limited flexibility of use and types of users - requires multiple separate data
marts for multiple uses and types of users .
Typically, data is static and dated .
Difficult to accommodate changes in data types and ranges, data source schema,
indexes and queries .
Typically, cannot actively monitor changes in data .
11
Chapter 2
SCHEMA ALTERNATIVES IN DATAWAREHOUSE

2.1 Introduction
The description of the database is called the database schema . A database schema is
specified during the database design and it describes how the database is organized on
the disk. As mentioned before OLTP database schemas are usually optimized for the
recording of business transactions and use various normalization forms. Normalization of
data is looked on as a process during which unsatisfactory relational schemas are
decomposed by breaking up their attributes into smaller relations that posses desirable
properties. Normalization has several advantages for data integrity (e.g. preventing
update anomalies) in an OLTP environment, but works strongly against a DSS
environment as follows:
Normalization process takes descriptions of entities from the business domain and
breaks them apart into Number of small tables. Breaking up entities into small
tables also reduces the amount of redundant data stored in the database since
these each table can be joined with other tables to form more than one business
entity. Reducing data redundancy means that the same piece of data will only be
stored in the database once. By having the data stored only once, the problem of
updating multiple copies disappears when the data changes. For DSS the main
problem is that the business entities will be distributed among a Number of tables
which will then need to be joined together in order to form the things that you will
want to use in your analysis. These joins that have to be performed can each
require a significant amount of processing by the database system.
Because of normalization, answering simple business questions will require joining

large Number of tables. Multiple join path may exist between any set of tables as
tables share multiple keys and can be joined in different ways. Hence business
users must have good knowledge as to how to join the tables.
12
Cryptic naming conventions and data formats. One of the primary reasons why
business users cannot read normalized schema is that it encourages cryptic,
shorthand naming conventions for columns and tables.
Hence the data warehouses use highly denormalized schemas to provide instant access
without having to perform a large Number of joins which is the major cause of poor
performance.
2.2 Dimension Modeling Schemas

A database comprises of some tables, these tables and the relationships among all the
tables in the database , views, indexes, and synonyms is collectively called the database
schema. Although there are many different schema designs, data warehouses
use a
dimensional schema design.

A dimensional schema physically separates the measures that quantify the business from
the descriptive elements (also called dimensions) that describe and categorize the
business. The dimensional schema is a physical or logical schema. A physical dimensional
schema is typically represented in the form of a star or snowflake schema, where the
objects in the star or snowflake schema are actually database tables. The dimensional
schema can even take the form of a single table or view, where all the facts and
dimensions are simply in different columns of that table or view. In a logical dimensional
schema, the fact, measures, and dimensions are represented as entities and attributes
that are independent of a database vendor and can therefore be transformed to a
physical dimensional schema for any database vendor.
Thus, we may use dimensional modeling to create queries that answer business
questions. Typically, a query calculates some measure of performance over several
business dimensions.
As described above there are two types of tables in a dimensional schema:
Dimension tables: These tables contain descriptive textual information and the
classification attributes which are used for analysis. Dimension attributes are used as a
source of most of the interesting constraints in the data warehouse queries, and they are
13
virtually always the source of the row headers in the SQL answers set.
Fact tables: These table contains one or more numerical measures, or facts that occur
for the combination of a multi part primary key made up of two or more foreign keys
from the dimension tables.
To design a dimensional schema the following design model is used:
Choose the business process
Declare the Grain
Identify the dimensions
Identify the Fact
CHOOSE THE BUSINESS PROCESS

The basics in the design build on the actual business process which the data
warehouse should cover. Therefore the first step in the model is to describe the
business process which the model builds on. This could for instance be a sales
situation in a retail store.
DECLARING THE GRAIN
After describing the Business Process, the next step in the design is to declare the
grain of the model. The grain of the model is the exact description of what the
dimensional model should be focusing on. This could for instance be An individual
line item on a Customer slip from a retail store. To clarify what the grain means, you
should pick the central process and describe it with one sentence. Furthermore the
grain (sentence) is what you are going to build your dimensions and fact table from.
You might find it necessary to go back to this step to alter the grain due to new
information gained on what your model is supposed to be able to deliver.
IDENTIFY THE DIMENSIONS
The third step in the design process is to define the dimensions of the model. The
dimensions must be defined within the grain from the second step of the 4-step
14
process. Dimensions are the foundation of the fact table, and is where the data for
the fact table is collected. Typically dimensions are nouns like date, store, inventory
etc. These dimensions are where all the data is stored. For example, the date
dimension could contain data such as year, month and weekday.
IDENTIFY THE FACTS
After defining the dimensions, the next step in the process is to make keys for the
fact table. This step is to identify the numeric facts that will populate each fact table
row. This step is closely related to the business users of the system, since this is
where they get access to data stored in the data warehouse. Therefore most of the
fact table rows are numerical, additive figures such as quantity or cost per unit, etc.
The two types of dimensional schemas used in data warehouse are :
Star Schema
Snowflake Schema
The star or the snowflake schema design represents data as an array where each
dimension is a subject around which analysis is performed.
2.3 Star Schema

The star schema is the simplest data warehouse schema. It is called a star
schema because the entity-relationship diagram of this schema resembles a star, with
points radiating from a central table. The center of the star consists of a large fact table
and the points of the star are the dimension tables. Each dimension table is joined to the
fact table using a primary key to foreign key join, but the dimension tables are not joined
to each other.
15
Figure 3: General Star Schema

The main advantages of star schemas are:
They are easy to understand as the data is organized around subjects.
Since data is in highly de-normalized format, browsing through data is easy.
It provide a direct and intuitive mapping between the business entities being
analyzed by end users.
It is very easy to define dimensional hierarchies.
Provide highly optimized performance for typical star queries.
Maintenance required for the data warehouse is low if we use integers and
surrogate keys for joining the tables.
The main disadvantage of star schema is :
Since data is not normalized, there is a lot of redundancy. Hence representation

requires more space as compared to the snowflake schema.
Star schemas are used for both simple data marts and very large data warehouses.
16
2.4 Snowflake Schema

The snowflake schema is extension of the star schema where each point in the star
explodes into more points. In this schema, the dimension tables are normalized as
compared to the star schema. Tables in a snowflake schema are usually normalized to
the third normal form. Each dimension table represents exactly one level in a hierarchy.
A snowflake schema can have any Number of dimensions and each dimension can have
any Number of levels.
Figure 4: General Snowflake Schema

The advantage provided by the snowflake schema are improvements in the query
performance due to minimized disk storage for the data and improved performance by
joining smaller normalized tables rather than large de-normalized ones.
17
Many consistency problems due to updates are also solved because normalization lowers
the granularity of the dimension. However, since the snowflake schema has more tables,
even browsing through the data requires joining tables to get the complete information,
hence performance is poor.
It can also give performance for the aggregation queries if data storage has very low
redundancy.
The main advantage of a snowflake schema is that it may improve the performance if
smaller tables are joined due to which it is easy to maintain and increase flexibility.
The main disadvantage of snowflake schema is that it increases the Number of tables an
end user must work with making the queries much more difficult to create and execute
because more tables need to be joined.
18
Chapter 3
INTRODUCTION TO TECHNOLOGIES USED

3.1 Introduction to SQL
Structured Query Language, commonly abbreviated to SQL and pronounced as sequel,
is not a conventional computer programming language in the normal sense of the phrase.
With SQL you don't write applications, utilities, batch processes, GUI interfaces or any of
the other types of program for which you'd use languages such as Visual Basic, C++,
Java etc. Instead SQL is a language used exclusively to create, manipulate and
interrogate databases. SQL is about data and results, each SQL statement returns a
result, whether that result be a query, an update to a record or the creation of a
database table.
SQL itself makes no references to the underlying databases which it can access, which
means that it is possible to have a SQL engine which can address relational databases,
non-relational (flat) databases, even spreadsheets. However, SQL is most often used to
address a relational database, which is what some people refer to as a SQL database.
3.1.1 What Can SQL do?

SQL can execute queries against a database
SQL can retrieve data from a database
SQL can insert records in a database
SQL can update records in a database
SQL can delete records from a database
SQL can create new databases
SQL can create new tables in a database
19
SQL can create stored procedures in a database

SQL can set permissions on tables, procedures, and views
3.1.2 Syntax for Basic SQL Queries:
Creating a Database:CREATE DATABASE dbname;
Creating Schema:CREATE SCHEMA schema_name
Dropping Schema :DROP SCHEMA schema_name
Creating Tables With SQL:CREATE TABLE name( col1 datatype, col2 datatype, , constraint_declare1);
where constraint_declare :: = [ CONSTRAINT constraint_name ]
PRIMARY KEY ( col1, col2, ... ) |
FOREIGN KEY ( col1, col2, ... ) REFERENCES f_table [ ( col1, col2, ... ) ] |
UNIQUE ( col1, col2, ... ) |
CHECK ( expression )
Drop Tables:DROP TABLE name;
Alter table:ALTER TABLE tablename ADD [COLUMN] column_name

ALTER TABLE tablename ADD constraint_declare
ALTER TABLE tablename DROP [COLUMN] column_name
ALTER TABLE tablename DROP CONSTRAINT constraint_name
ALTER TABLE tablename DROP PRIMARY KEY
Inserting Data:INSERT INTO target [(field1[, field2[, ...]])]
20
VALUES (value1[, value2[, ...]);
Updating Data:UPDATE table SET newvalue WHERE criteria
Deleting Data:DELETE [table.*] FROM table WHERE criteria;
Querying The Database:SELECT columns FROM table(s) WHERE criteria;
The Where Clause:SELECT * FROM tablename WHERE Dept="Development";
Sorting Query Results:SELECT * FROM tablename ORDER BY columnname [desc]
Joining Data:SELECT a.col1 , a.col2 , b.col1 . b.col2

FROM a INNER JOIN b ON a.col1 = b.col1
WHERE criteria;
Aggregates:SELECT col1, Avg(tablename.col3) AS col3

FROM tablename
GROUP BY col1
HAVING Avg(tablename.col3) <=64;
Creating a view:CREATE VIEW view_name [ ( column_name1, column_name2, ... ) ]

AS SELECT * FROM....
Dropping a view:DROP VIEW view_name
Creating a sequence:CREATE SEQUENCE name

[ INCREMENT increment_value ]
[ MINVALUE minimum_value ]
[ MAXVALUE maximum_value ]
[ START start_value ]
21
[ CACHE cache_value ]
[ CYCLE ]
Dropping a sequence:DROP SEQUENCE name
There is a lot more functionality available in SQL but the above commands serve as the
basis for writing the complex SQL queries.
22
Chapter 4
PERFORMANCE EVALUATION OF SCHEMAS

4.1 Introduction
In this chapter, we analyze the performance of both the schemas the star and the
snowflake schemas for various values of data repetitions. The performance can be
evaluated by using the command set timing on that returns the value of time and
calculates the time elapsed in a query.
Consider a data warehouse for a retail chain. This warehouse contains several data marts
like Sales , Purchase, Finance , Store Operationsl etc. Our focus in this analysis would be
on the Sales data mart. We create a dimensional model for this data mart with fact
table as Sales which measures the quantity sold , net invoice value , gross invoice value
in local currency and net invoice value , gross invoice value in US dollars based on the
dimensions. The dimension along which the sales data is measured are Customer who
made the transaction , Store from which the transaction was done , Product that was
bought , Geography where the store exists and the Date on which the transaction was
done and the data was loaded into the sales table. The sales table is a partitioned and
subpartitioned table so as to improve performance of the queries fired on the Sales fact
table.
The Product dimension will have Product Id, Product Line (Prod_1) , Product Brand
(Prod_2) , Product Sub category (Prod_3) and Product Category (Prod_4). The Customer
dimension will have Customer Id , Customer Organization (Cust_1) , Customer
Department (Cust_2) , Customer Location (Cust_3) and Customer Region(Cust_4) .The
geography dimension will have Geography , State (Geo_1) ,
Country (Geo_2) and
Region (Geo_3).The Cal_Master Dimension contains date, week, month , quarter and
year details for three years.The store dimension contains the store and its manager's
details.
23
We will create a star and a snowflake schema for the sales data mart and then compare
the performance of both.
4.2 Creating the star schema
Figure 5: STAR Schema for Sales Data Mart
24
4.3 Data Model for the Star Schema
Figure 6: Data Model for the Star Schema

TABLE: SALES
Partition by Range (day_id) subpartition by List (fact_type)
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
References
Fact_Id
Number(8)
Fact_type
Varchar2(10)
Transaction_date
Date
Cal_master(date)
disable novalidate
rely
Day_Id
Number
Cal_master(day_id)
disable novalidate
25
rely
Load_date
Date
Cal_master(date)
disable novalidate
rely
Billing_Type
Varchar2(10)
Cust_Id
Number(80)
Customer (Cust_id)
disable novalidate
rely
Prod_Id
Number(80)
Product(Prod_id)
disable novalidate
rely
Geo_Id
Number(80)
Geography(geo_id)
disable novalidate
rely
Store_ID
Number(80)
Store(Store_id)
disable novalidate
rely
Base_Unit_Qty
Number(80)
Net_Invoice_value
Number(80)
Gross_Invoice_Valu
e
Number(80)
Net_Invoice_USD
Number(80)
Gross_Invoice_USD
Number(80)
TABLE 1: Data model for SALES

TABLE: SALES_AGG
Partition by Range (day_id) subpartition by List (fact_type)
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
References
Fact_type
Varchar2(10)
Transaction_date
Date
Cal_master(date)
disable novalidate rely
Day_Id
Number
Cal_master(day_id)
Load_date
Date
Cal_master(date)
Billing_Type
Varchar2(10)
Cust_Id
Number(80)
Customer (Cust_id)
Prod_Id
Number(80)
Product(Prod_id)
26
Geo_Id
Number(80)
Geography(geo_id)
Store_ID
Number(80)
Store(Store_id) disable
novalidate rely
Base_Unit_Qty
Number(80)
Net_Invoice_value
Number(80)
Gross_Invoice_Valu
e
Number(80)
Net_Invoice_USD
Number(80)
Gross_Invoice_USD
Number(80)
TABLE 2: Data model for SALES_AGG

TABLE: PRODUCT
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Prod_Id
Number(8)
Prod_Name
Varchar2(10)
Prod_desc
Varchar2(100)
Prod_1_Id
Number(8)
Prod_1_Name
Varchar2(10)
Prod_1_desc
Varchar2(100)
Prod_2_Id
Number(8)
Prod_2_Name
Varchar2(10)
Prod_2_desc
Varchar2(100)
Prod_3_Id
Number(8)
Prod_3_Name
Varchar2(10)
Prod_3_desc
Varchar2(100)
Prod_4_Id
Number(8)
Prod_4_Name
Varchar2(10)
Prod_4_desc
Varchar2(100)
Prod_level
Number
Brand_desc
Varchar2(100)
Prod_size_desc
Varchar2(100)
Last_updt_date
Date
TABLE 3: Data model for PRODUCT

TABLE: CUSTOMER
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
27
Cust_Id
Number(8)
Cust_Name
Varchar2(10)
Cust_desc
Varchar2(100)
Cust_1_Id
Number(8)
Cust_1_Name
Varchar2(10)
Cust_1_desc
Varchar2(100)
Cust_2_Id
Number(8)
Cust_2_Name
Varchar2(10)
Cust_2_desc
Varchar2(100)
Cust_3_Id
Number(8)
Cust_3_Name
Varchar2(10)
Cust_3_desc
Varchar2(100)
Cust_4_Id
Number(8)
Cust_4_Name
Varchar2(10)
Cust_4_desc
Varchar2(100)
Cust_level
Number
Cust_grade
Varchar2(100)
Cust_peer_rank
Varchar2(100)
trade_chanel_curr_id
Number
trade_chanel_hist_id
Number
Last_updt_date
Date
TABLE 4: Data model for CUSTOMER

TABLE: STORE
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Store_Id
Number(8)
Store_Name
Varchar2(10)
Store_desc
Varchar2(100)
Store_Location
Number(8)
Store_Manager_Id
Number(8)
Store_Manager_Name
Varchar2(10)
TABLE 5: Data model for STORE

TABLE: GEOGRAPHY
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Geo_Id
Number(8)
28
Geo_Name
Varchar2(10)
Geo_desc
Varchar2(100)
Geo_1_Id
Number(8)
Geo_1_Name
Varchar2(10)
Geo_1_desc
Varchar2(100)
Geo_2_Id
Number(8)
Geo_2_Name
Varchar2(10)
Geo_2_desc
Varchar2(100)
Geo_3_Id
Number(8)
Geo_3_Name
Varchar2(10)
Geo_3_desc
Varchar2(100)
Geo_level
Number
Last_updt_date
Date
TABLE 6: Data model for GEOGRAPHY

TABLE: CAL_MASTER
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Day_Id
Number(8)
Date
Date
Day_Name
Varchar2(10)
Week_Id
Number(1)
Mth_Id
Number(6)
Mth_name
Varchar2(10)
Qtr_Id
Varchar2(10)
Qtr_Name
Varchar2(10)
Year_Id
Number(4)
Year_name
Varchar2(10)
Holiday_Indicator
Varchar2(1)
TABLE 7: Data model for CAL_MASTER
29
4.4 Creating a snowflake schema
Figure 7: SNOWFLAKE schema for SALES data mart
30
Figure 8: Normalized Product Dimension
Figure 9: Normalized Customer Dimension
31
Figure 10: Normalized Geography Dimension
4.5 Data Model for the Snowflake Schema

Fact tables SALES , aggregated Fact Table SALES_AGG and dimensions Cal_Master and
Store are same as that for the star schema. The dimension tables Geography, Customer
and Product have been normalized.
TABLE: PROD
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
REFERENCES
Prod_Id
Number(8)
Prod_Name
Varchar2(10)
Prod_desc
Varchar2(100)
Prod_1_Id
Number(8)
Prod_1(Prod_1_id)
disable novalidate
rely
Prod_2_Id
Number(8)
Prod_2(Prod_2_id)
disable novalidate
rely
Prod_level
Number
Brand_desc
Varchar2(100)
Prod_size_desc
Varchar2(100)
32
Last_updt_date
Date
TABLE 8: Data model for PROD

TABLE: PROD_1
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Prod_1_Id
Number(8)
Prod_1_Name
Varchar2(10)
Prod_1_desc
Varchar2(100)
TABLE 9: Data model for PROD_1

TABLE: PROD_2
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Prod_2_Id
Number(8)
Prod_2_Name
Varchar2(10)
Prod_2_desc
Varchar2(100)
Prod_3_Id
Number(8)
REFERENCES
Prod_3(Prod_3_id)
disable novalidate
rely
TABLE: PROD_3
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Prod_3_Id
Number(8)
Prod_3_Name
Varchar2(10)
Prod_3_desc
Varchar2(100)
Prod_4_Id
Number(8)
REFERENCES
Prod_4(Prod_4_id)
disable novalidate
rely

TABLE: PROD_4
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Prod_4_Id
Number(8)
Varchar2(10)
Prod_4_Name
33
Prod_4_desc
Varchar2(100)

TABLE: Cust
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
REFERENCES
Cust_Id
Number(8)
Cust_Name
Varchar2(10)
Cust_desc
Varchar2(100)
Cust_1_Id
Number(8)
Cust_1(cust_1_id)
disable novalidate
rely
Cust_2_Id
Number(8)
Cust_2(cust_2_id)
disable novalidate
rely
Cust_level
Number
Cust_grade
Varchar2(100)
Cust_peer_rank
Varchar2(100)
trade_chanel_curr_i
d
Number
trade_chanel_hist_i
d
Number
Last_updt_date
Date
TABLE 13: Data model for Cust

TABLE: Cust_1
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Cust_1_Id
Number(8)
Cust_1_Name
Varchar2(10)
Cust_1_desc
Varchar2(100)
TABLE 14: Data model for Cust_1

TABLE: Cust_2
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Cust_2_Id
Number(8)
Cust_2_Name
Varchar2(10)
Cust_2_desc
Varchar2(100)
REFERENCES
34
Cust_3_Id
Number(8)
Cust_3(cust_3_id)
disable novalidate
rely

TABLE: Cust_3
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Cust_3_Id
Number(8)
Cust_3_Name
Varchar2(10)
Cust_3_desc
Varchar2(100)
Cust_4_Id
Number(8)
REFERENCES
Cust_4(cust_4_id)
disable novalidate
rely

TABLE: Cust_4
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Cust_4_Id
Number(8)
Cust_4_Name
Varchar2(10)
Cust_4_desc
Varchar2(100)

TABLE: Geo
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Geo_Id
Number(8)
Geo_Name
Varchar2(10)
Geo_desc
Varchar2(100)
Geo_1_Id
Number(8)
Geo_level
Number
Last_updt_date
Date
REFERENCES
geo_1(geo_1_id)
disable novalidate
rely
TABLE 18: Data model for Geo

TABLE: Geo_1
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Geo_1_Id
Number(8)
REFERENCES
35
Geo_1_Name
Varchar2(10)
Geo_1_desc
Varchar2(100)
Geo_2_Id
Number(8)
geo_2(geo_2_id)
disable novalidate
rely
TABLE 19: Data model for Geo_1

TABLE: Geo_2
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Geo_2_Id
Number(8)
Geo_2_Name
Varchar2(10)
Geo_2_desc
Varchar2(100)
Geo_3_Id
Number(8)
REFERENCES
geo_3(geo_3_id)
disable novalidate
rely

TABLE: Geo_3
COLUMN NAME
PRIMARY KEY
NULL
DATA TYPE
Geo_3_Id
Number(8)
Geo_3_Name
Varchar2(10)
Geo_3_desc
Varchar2(100)
4.6 Creating Tables

After the data model for both Star and Snowflake schema has been created we need to
create these tables in the Database.
Please see appendix 2. CREATE TABLES for the DDLs of all the tables along with their
referential constraints and partition and subpartition creation.
4.7 Data for the tables

After creating the tables we need to create the dummy data for tables of both Star and
Snowflake Schema for our analysis.
Please see appendix 1. DATA FOR TABLES showing screen shots of the data to be
36
loaded into the tables. These screenshots provide an overview of the data. To view full
data to be loaded into the tables please see attached Data For Tables.zip
Data for tables.zip
4.8 Populating the data into warehouse tables using SQL Loader
To load the data created for the tables as given in 4.7 we will use SQL* Loader utility.
SQL LOADER is an Oracle utility used to load data into table given a datafile which has
the records that need to be loaded. SQL*Loader takes data file, as well as a control file,
to insert data into the table. The SQL*Loader control file contains information that
describes how the data will be loaded. It contains the table name, column datatypes,
field delimiters, etc.
To avoid the errors its better to generate control file dynamically. The SQL code for
generating the control file dynamically is given in appendix 3 : GENERATE CONTROL
FILES DYNAMICALLY FOR SQL*LOADER
Also given are the sample control files , the commands to run SQL Loader and the data
files that needs to be loaded into the sales data mart in appendix 4 : LOADING DATA
USING SQL* LOADER. To view all the control files and data files please see attached
Data loading using SQL Loader.zip
D a t a L o a d in g u s in g S Q L L o a d e r . z ip
4.9 Performance evaluation of Star and Snowflake Schema

The performance of both the star and snowflake schema is analyzed by using the time
measurements by increasing the number of rows in the centralized fact tables. The
timing measurements can be done by using set timing on command that is a system
37
call that note the time elapsed by a query.

Consider the following Business requirements for which we need to fire SQL queries on
the Data Mart.
1. To find count of all the products that were sold in the 'APAC' geography till date.
STAR: select sum(base_unit_qty) from sales, geography where

sales.geo_id=geography.geo_id and geo_3_name='APAC';
SNOWFLAKE: select sum(base_unit_qty) from sales s,geo g , geo_1 g1 ,

geo_2 g2 , geo_3 g3 where s.geo_id=g.geo_id and g.geo_1_id=g1.geo_1_id
and g1.geo_2_id=g2.geo_2_id and g2.geo_3_id=g3.geo_3_id and
g3.geo_3_name='APAC';
2. To count total net invoice value of 'Hair Care' products in 'India'.
STAR: select sum(net_invoice_value) from sales s , product p , customer c

where s.prod_id=p.prod_id and c.cust_id=s.cust_id and
c.cust_3_name='IN' and p.prod_3_name='Hair Care';
SNOWFLAKE: select sum(net_invoice_value) from sales_snow s , prod p ,

cust c , prod_2 p2, prod_3 p3, cust_2 c2 , cust_3 c3
where s.prod_id=p.prod_id and c.cust_id=s.cust_id
and p.prod_2_id=p2.prod_2_id and p2.prod_3_id=p3.prod_3_id
and c.cust_2_id=c2.cust_2_id and c2.cust_3_id=c3.cust_3_id
and p3.prod_3_name='Hair Care'and c3.cust_3_name='India' ;
3. To select names of all the shampoos that were sold till date.
STAR: select distinct prod_name from sales s , product p where s.prod_id =

p.prod_id and p.prod_1_name ='Shampoo';
SNOWFLAKE: select distinct prod_name from sales s , prod p , prod_1 p1

where s.prod_id = p.prod_id and p1.prod_1_id=p.prod_1_id and
p1.prod_1_name='Shampoo';
4. To count the products in each region that were bought by the customers belonging
38
to the same region.
STAR:select sum(base_unit_qty) ,
store_location from sales s , geography g , customer c, store st
where s.cust_id = c.cust_id
and s.geo_id = g.geo_id and
s.store_id = st.store_id and
upper(st.store_location) = upper(c.cust_3_desc) and
upper(st.store_location) = upper(g.geo_2_name) group by store_location;
SNOWFLAKE: select sum(base_unit_qty) ,

store_location from sales_snow s , geo g , cust c, store st , cust_2 c2 , cust_3
c3 , geo_1 g1 , geo_2 g2 where s.cust_id = c.cust_id
and s.geo_id = g.geo_id and s.store_id = st.store_id and
c.cust_2_id = c2.cust_2_id and c2.cust_3_id = c3.cust_3_id and
g.geo_1_id = g1.geo_1_id and g1.geo_2_id = g2.geo_2_id and
upper(st.store_location) = upper(c3.cust_3_name) and
upper(st.store_location) = upper(g2.geo_2_name) group by store_location;
5. To select toothpaste products that were regular shipments and were sold by
manual billing in year 2012.
STAR: select * from sales s , product p where billing_type='MN' and

to_char(load_date,'YYYY') ='2012' and fact_type='BR' and
s.prod_id = p.prod_id and p.prod_1_name='Toothpaste';
SNOWFLAKE: select * from sales_snow s , prod p , prod_1 p1 where

billing_type='MN' and to_char(load_date,'YYYY') ='2012' and
fact_type='BR' and s.prod_id = p.prod_id and p1.prod_1_id = p.prod_1_id
and p1.prod_1_name='Toothpaste';
4.10 Timing Measurements
39
The whole experiment is done with different number of tuples with different type of
queries and we are showing a query result to prove our considerations. We start with
approximately 2 lakh records and keep on doubling the number of records used for
analysis till 6 million records.
Given below are the screenshots for the results of queries for 2 lakh records. For other
cases the logs have been given in the appendix 5: LOGS FOR SCHEMA PERFORMANCE
EVALUATION.
The result of the analysis are shown in Table 22
Query1: Count of products sold in APAC
40
Query2: Sales Of Hair Care Products in India
Query3: Shampoos that were sold till date
41
Query4: Products in each region that were bought by the customers belonging to the
42
same region
Query5: To select toothpaste products that were regular shipments and were
sold by manual billing in year 2012.
43
For other cases the logs have been given in the appendix 5: LOGS FOR SCHEMA
PERFORMANCE EVALUATION
The timings for the queries are as given in table 22. Also a graph representing the
performance comparision is given in figure 11
Query #
No. of records
Star Schema (In

milliseconds)
Snowflake Schema (In

milliseconds)
Q1
2,14,248
39
98
Q2
2,14,248
32
66
Q3
2,14,248
61
132
Q4
2,14,248
71
131
Q5
2,14,248
12
25
Q1
4,28,496
10
12
Q2
4,28,496
10
Q3
4,28,496
12
15
Q4
4,28,496
10
32
Q5
Q1
Q2
Q3
Q4
Q5
Q1
4,28,496
8,56,992
8,56,992
8,56,992
8,56,992
8,56,992
34,27,968
1
20
20
15
21
1
932
2
64
28
17
59
1
628
Q2
34,27,968
321
631
Q3
34,27,968
168
240
Q4
34,27,968
246
269
Q5
Q1
Q2
Q3
Q4
Q5
34,27,968
68,55,936
68,55,936
68,55,936
68,55,936
68,55,936
3
1208
334
331
306
3
3
1308
482
300
494
4
Table 22 : Result of performance of designing schemas.
44
1400
1200
STAR
SNOWFLAKE
TIMINGS (MILLISECONDS)
1000
800
600
400
200
0
2 LAKH
2 LAKH
2 LAKH
2 LAKH
2 LAKH
4 LAKH
4 LAKH
4 LAKH
4 LAKH
4 LAKH
8 LAKH
8 LAKH
8 LAKH
8 LAKH
8 LAKH
3 MILLION
3 MILLION
3 MILLION
3 MILLION
3 MILLION
6 MILLION
6 MILLION
6 MILLION
6 MILLION
6 MILLION
NUMBER OF RECORDS
Figure 11: Schema Performance Comparision
As we have seen that star schema performs better than snowflake schema in most
of the cases. In general, the snowflake schema takes more time than star schema
because it takes more number of joins of the dimension tables to get the same
information.
45
4.11 STAR OR SNOWFLAKE

Figure 11 shows the star and snowflake representation for one set of data. This data
does not have any repetitions hence representing it in snowflake requires additional
space for storing 6 keys. Hence snowflake representation for this kind of data requires
more space and queries take longer time for computing the joins of various tables
whereas in star schema only browsing through whole table is needed.
Figure 12: Representation of typical non repeating data in star and snowflake schema.
Figure-12 shows the the star and snowflake schema for another set of data. This set of
data has repetitions therefore snowflake representation is more compact as it avoid
duplicating the long character attributes in all the records. In this case snowflake will
perform better than the star schema.
Figure 13: Representation of typical repeating data in star and snowflake schema.
46
Chapter 5
PERFORMANCE EVALUATION AND OPTIMIZATION OF QUERIES

The data warehouses typically maintain a lot of information which grow over time. Due to
the large volume of data, the queries fired on the database may take hours before
returning the results. Since there is a need for the right information at right time to make
right decisions, any delay in getting the information may impact the business. Thus,
there is a need to
find the ways to optimize the queries in such a way so as to
minimizes the time to provide information.
5.1 Understanding Oracle Optimizer and Explain plan

Query Optimization is the process of choosing the most efficient way to execute a Query
so that it executes with high response time and throughput. Oracle database has a built
in optimizer that decides the best possible execution plan for each query. This Optimizer
uses costing methods, cost-based optimizer (CBO), or internal rules, rule-based
optimizer (RBO), to determine the most efficient way of producing the result of the
query. Generally the cost based optimizer (CBO) is used for making these decisions
based on the statistical information it has about the data. It is also possible that the
optimizer will generate sub-optimal plans for some queries . In those cases the first step
in diagnosing why the Optimizer picked the sub-optimal plan is to visually inspect the
execution plans.
Main components of a CBO are:
Query Transformer : The main aim of query transformer is to determine if it is

advantageous to change the form of the query so that it enables generation of a
better query plan.
Estimator : The main aim of estimator is to determine the overall cost of the plan
using the statistics if available so as to improve the degree of accuracy.
Plan Generator : The main function of the plan generator is to try out different
47
possible plans for a given query and pick the one that has the lowest cost.
5.1.1 The Execution Plan

An execution plan shows the detailed steps necessary to execute a SQL statement. These
steps are expressed as a set of database operators that consume and produce rows. The
order of the operators and their implementations is decided by the query optimizer using
a combination of query transformations and physical optimization techniques. To
generate explain plan in Oracle without executing the query we use the following
command set autotrace traceonly explain or we may use DBMS_XPLAN.DISPLAY
('PLAN_TABLE', Statement ID , Format ) function.
Given below is the explain plan generated using both the options.
48
49
The main components of a explain plan are:

Cardinality It signifies the estimate of the number of rows for each of the operation.
Access method It signifies the way in which the data is being accessed
Join method It signifies the method used to join tables with each other.
Join type It signifies the type of join between the tables i.e. inner , outer or semi join
etc.
Join order It signifies the order in which the tables are joined to each other.
Partition pruning It signifies if only the necessary partitions being accessed or not.
5.1.2 Cardinality
The optimizer determines the cardinality for an operation by using the table and column
level statistics. e.g. Consider there are 100 rows in the sales table and there are 5
different products , then for a query selecting data only from sales with equality
predicate on product will return cardinality as 20 i.e. total number of rows in the table by
the number of distinct values in the column
50
However , this is not the accurate estimate i.e. we may get incorrect cardinality
estimates even when the basic table and column statistics are up to date. This may be
because of data skew i.e. there may not be even distribution of the data values in a
column.
TECHNIQUE #1 Create a histogram manually for the best cardinality estimates
SQL > Exec DBMS_STATS.GATHER_TABLE_STATS(SCHEMA,TABLE_NAME, method_opt=>'FOR
COLUMNS SIZE 254 COLUMN_NAME);
The presence of a histogram changes the formula used by the Optimizer to determine the
cardinality estimate.
TECHNIQUE #2 Use the GATHER_PLAN_STATISTICS hint for the best cardinality

estimates
51
5.1.3 ACCESS METHOD

There are various ways in which data is accessed from each table
Full table scan : Reads all rows from a table and filters out those that do not meet the
where clause predicates. A full table scan will use multi block IO.A full table scan is
selected if a large portion of the rows in the table must be accessed, no indexes exist or
the ones present cant be used or if the cost is the lowest.
Full table scan is faster than the index scan when large number of blocks are to be
retrieved as it makes cheaper I/O calls. Full table scans are cheaper than index range
scans when accessing a large fraction of the blocks in a table. This is because full table
scans can use larger I/O calls, and making fewer large I/O calls is cheaper than making
many smaller calls.
TECHNIQUE #3 Use the hint /* +FULL(table alias) */ when ever more than 30
% of the table data is to be accessed.
Oracle does I/O by blocks. Therefore, the optimizer's decision to use full table scans is
influenced by the percentage of blocks accessed, not rows. This is called the index
52
clustering factor. If blocks contain single rows, then rows accessed and blocks accessed
are the same.
Access by ROWID:
The rowid of a row specifies the datafile and data block containing the row and the
location of the row in that block. Locating a row by specifying its rowid is the fastest way
to retrieve a single row, because the exact location of the row in the database is specified
The optimizer uses rowid after retrieving the rowid from an index. Access by rowid does
not need to follow every index scan. If the index contains all the columns needed for the
statement, then table access by rowid might not occur.
Index Unique scan:
This scan returns, at most, a single rowid. Oracle performs a unique scan if a statement
contains a UNIQUE or a PRIMARY KEY constraint that guarantees that only a single row is
53
accessed.
Index Range scan:

An index range scan is a common operation for accessing selective data. Oracle accesses
adjacent index entries and then uses the ROWID values in the index to retrieve the
corresponding rows from the table. The optimizer uses a range scan when it finds one or
more leading columns of an index specified in conditions, such as col1 = :b1 ;col1 < :b1
or col1 > :b1
54
TECHNIQUE #4 Use the hint /* +INDEX(alias index_name) */ to specify the

index to use rather than full table scan
Suppose that the column prod_id in table SALES has a skewed distribution. The column
has histograms, so the optimizer knows about the distribution. However, with a bind
variable, the optimizer does not know the value and could choose a full table scan. In
this case you should use hint /* + INDEX (prod_id_idx) */
5.1.4 JOIN METHOD
55
The join method describes how data from two data producing operators will be joined
together. e.g. The table product and sales are joined using HASH join in the example
given below.
Hash Joins
Build a hash table (buckets) based on the join keys of the outer table.
Calculate hash value of the join keys of the inner table.
Join the rows of the two tables through the value of the hash buckets.
When to use:
Full table scan is preferred on both inner table and outer table for less logical reads. And
The number of rows from the outer table is not huge.
TECHNIQUE #5 Use hash join when one of the sources to be joined is small and
56
other is large and there is a lot of duplicate data in the join key column.
Nested Loops joins
Start from the rows in the outer table.
Search for matching rows in the inner table through an index.

When to use:
The inner table is fairly small (less than 1M rows). Or

If the inner table is big, the rows pulled from it will be much less.
The rows in outer table, the index, and the inner table are synchronized with low
clustering factor.
TECHNIQUE #6 Use nested loop joins when both the sources are small , there is
57
a good driving - driven relationship between the outer and inner table and there
is an efficient way of accessing the second table (for example an index look up).
Sort Merge joins
Sort the rows by the join keys of the outer table.
Sort the rows by the join keys of the inner table.
Join the rows of the two tables through the sorts.

When to use:
It needs to sort both tables. So, hash join is more preferable. This method is more used
when at least one of the joined parties is an inline view with sort operation.
So, this
method is rarely used.

TECHNIQUE #7 Use sort merge join when any of the tables to be joined are
already sorted OR if there is an index on one of the tables that will eliminate
one of the sorts.
Cartesian join
The optimizer joins every row from one data source with every row from the other data
source, creating a Cartesian product of the two sets. Typically this is only chosen if the
tables involved are small or if one or more of the tables does not have a join conditions
to any other table in the statement. Cartesian joins are not common, so it can be a sign
of problem with the cardinality estimates, if it is selected for any other reason.
58
5.1.5 JOIN ORDER

The join order is the order in which the tables are joined together in a multi-table SQL
statement. To determine the join order in an execution plan look at the indentation of the
tables in the operation column. In Figure below the SALES and CUSTOMER table are
equally indented and both of them are more indented than the GEOGRAPHY table.
Therefore the SALES and CUSTOMER table will be joined first using a hash join and the
result of that join will then be joined to the GEOGRAPHY table.
The join order is determined based on cost, which is strongly influenced by the
cardinality estimates and the access paths available. The Optimizer will also always
adhere to some basic rules:
The Optimizer can determine the joins that result in at most one row based on UNIQUE
and PRIMARY KEY constraints on the tables and calculates it first.
When outer joins are used the table without the outer join operator must come after
the table with the outer join operator to ensure all of the additional rows that dont
satisfy the join condition can be added to the result set correctly.
When a subquery has been converted into an antijoin or semijoin, the tables from the
subquery must come after those tables in the outer query block to which they were
59
connected or correlated.
If view merging is not possible all tables in the view will be joined before joining to the
tables outside the view.
TECHNIQUE #8 If using nested-loop join, always start from the larger table with
full table scan, and join to the smaller table with index, unless the smaller table
is very small, or there will be very few rows in the larger table needed.
TECHNIQUE #9 When using hash join with parallel, use pq_distribute hint to
make sure the row distribution among the parallel query servers is right.
5.1.6 Partition Pruning
Oracle partitioning is a method of breaking up a very large table and its associated
indexes into smaller pieces. The primary purpose of partitioning is faster query access to
improve performance . This is accomplished via partition pruning (elimination) . Partition
pruning is visible in an execution plan in the PSTART and PSTOP columns. The PSTART
column contains the number of the first partition that will be accessed and PSTOP column
contains the number of the last partition that will be accessed.
Other benefits of partitioning would be :
The ability of Oracle to use parallelism to resolve queries since multiple partitions
can be queried simultaneously.
Allows easy archiving of data.
Partitioning gives the ability to quickly remove large amounts of data without
fragmenting a table
For Example:
Consider the Sales table that contains records of all orders for the last 3 years and the
table has been partitioned on monthly basis , a
query requesting sales for a quarter
would access only 4 partitions rather than the entire table.
60
There are cases when a word or letters appear in the PSTART and PSTOP columns instead
of a number. For example you may see the word KEY appears in these columns. This
indicates that it was not possible at parse time to identify, which partitions would be
accessed by the query but the Optimizer believes that partition pruning will occur at
execution time (dynamic pruning). This happens when there is an equality predicate on
the partitioning key column that contains a function. For example DAY_ID = SYSDATE.
TECHNIQUE # 10 Partition the tables (especially fact tables) based on the

column that provides a logical partitioning key for the table and is used
61
maximum times in the where clause of the queries. This would help in partition
pruning and thus enhancing the speed of the query.
5.2 Star Transformation (Dimensions and Fact with bitmap indexes)

A star transformation is a methodology used to join tables to greatly improve
performance of queries on dimensional data warehouse schemas. The star transformation
relies upon implicitly rewriting (or transforming) the SQL of the original star query. The
end user never needs to know any of the details about the star transformation. Oracle's
cost-based optimizer automatically chooses the star transformation where appropriate.
Oracle processes a star query using two basic phases:
The first phase retrieves exactly the necessary rows from the fact table (the result
set). Because this retrieval utilizes bitmap indexes, it is very efficient.
The second phase joins this result set to the dimension tables.
Star Transformation with a Bitmap Index

A prerequisite of the star transformation is that there be a single-column bitmap index on
every join column of the fact table. These join columns include all foreign key columns.
For example, the sales table of the star schema should have the bitmap indexes on the
prod_id, cust_id, store_id, geo_id , and day_id columns.
Consider the following star query:
SELECT st.Store_Manager_Name, c.cust_3_name, t.Qtr_Name,
SUM(s.net_invoice_value) sales_amount
FROM sales s, cal_master t, customers c, store st
WHERE s.day_id = t.day_id
AND s.cust_id = c.cust_id
AND s.store_id = ch.store_id
AND c.cust_4_name = 'APAC'
AND st.store_desc in ('Small Sized Store')
AND t.calendar_quarter_desc IN ('1st quarter 12','2nd quarter 12')
GROUP BY st.Store_Manager_Name, c.cust_3_name, t.Qtr_Name;
62
Oracle processes this query in two phases. In the first phase, Oracle uses the bitmap
indexes on the foreign key columns of the fact table to identify and retrieve only the
necessary rows from the fact table. That is, Oracle will retrieve the result set from the
fact table using essentially the following query:
SELECT ... FROM sales
WHERE day_id IN
(SELECT day_id FROM cal_master
WHERE calendar_quarter_desc IN ('1st quarter 12','2nd quarter 12'))
AND cust_id IN
(SELECT cust_id FROM customers WHERE cust_4_name = 'APAC')
AND store_id IN
(SELECT store_id FROM store WHERE store_desc in ('Small Sized Store'));
As given above, the original star query has been transformed into this subquery
representation. This method of accessing the fact table leverages the strengths of
Oracle's bitmap indexes.
In this star query, a bitmap index on day_id is used to identify the set of all rows in the
fact table corresponding to sales in '1st quarter 12'. This set is represented as a bitmap .
A similar bitmap is retrieved for the fact table rows corresponding to the sale from '2st
quarter 12'. The bitmap OR operation is used to combine this set of 1st quarter 12 sales
with the set of 2st quarter 12 sales.
Also set operations will be done for the customer dimension and the store dimension. At
this point in the star query processing, there are three bitmaps. Each bitmap corresponds
to a separate dimension table, and each bitmap represents the set of rows of the fact
table that satisfy that individual dimension's constraints.
These three bitmaps are combined into a single bitmap using the bitmap AND operation
Once the result set is identified, the bitmap is used to access the actual data from the
sales table. Only those rows that are required for the end user's query are retrieved from
the fact table. At this point, Oracle has effectively joined all of the dimension tables to
the fact table using bitmap indexes. This technique provides excellent performance
because Oracle is joining all of the dimension tables to the fact table with one logical join
63
operation, rather than joining each dimension table to the fact table independently.
The second phase of this query is to join these rows from the fact table (the result set) to
the dimension tables. Oracle will use the most efficient method for accessing and joining
the dimension tables.
A hash join is often the most efficient algorithm for joining the dimension tables. The final
answer is returned to the user once all of the dimension tables have been joined.
Execution Plan for a Star Transformation with a Bitmap Index
The following typical execution plan might result from "Star Transformation with a Bitmap
Index:
SELECT STATEMENT
SORT GROUP BY
HASH JOIN
TABLE ACCESS FULL
STORE
HASH JOIN
TABLE ACCESS FULL
CUSTOMER
HASH JOIN
TABLE ACCESS FULL
CAL_MASTER
PARTITION RANGE ITERATOR

TABLE ACCESS BY LOCAL INDEX ROWID
SALES
BITMAP CONVERSION TO ROWIDS

BITMAP AND
BITMAP MERGE
BITMAP KEY ITERATION
BUFFER SORT
TABLE ACCESS FULL
BITMAP INDEX RANGE SCAN
CUSTOMER
SALES_CUST_BIX
BITMAP MERGE
BUFFER SORT
TABLE ACCESS FULL
STORE
SALES_STORE_BIX
BITMAP MERGE
64

BUFFER SORT
TABLE ACCESS FULL
CAL_MASTER
SALES_CAL_BIX
In this plan, the fact table is accessed through a bitmap access path based on a bitmap
AND, of three merged bitmaps. The three bitmaps are generated by the BITMAP MERGE
row source being fed bitmaps from row source trees underneath it. Each such row source
tree consists of a BITMAP KEY ITERATION row source which fetches values from the
subquery row source tree, which in this example is a full table access. For each such
value, the BITMAP KEY ITERATION row source retrieves the bitmap from the bitmap
index. After the relevant fact table rows have been retrieved using this access path, they
are joined with the dimension tables and temporary tables to produce the answer to the
query.
When to use:
Only with Star schema model with bitmap indexes built on the fact table for the
foreign key columns. And
The dimension tables for the predicated column search should not be too big. And
There are not too many rows (less than 500K) which will be pulled in the fact
table.
There are less tables which need to be joined back by the selected rows for other
attributes.
Star transformation is not supported for tables with any of the following characteristics:
Queries with a table hint that is incompatible with a bitmap access path
Queries that contain bind variables
Tables with too few bitmap indexes. There must be a bitmap index on a fact table
column for the optimizer to generate a subquery for it.
Remote fact tables. However, remote dimension tables are allowed in the
subqueries that are generated.
65
Anti-joined tables
Tables that are already used as a dimension table in a subquery
Tables that are really unmerged views, which are not view partitions
The star transformation may not be chosen by the optimizer for the following cases:
Tables that have a good single-table access path
Tables that are too small for the transformation to be worthwhile
In addition, temporary tables will not be used by star transformation under the following
conditions:
The database is in read-only mode
The star query is part of a transaction that is in serializable mode
TECHNIQUE # 11 Use star transformation to enhance the performance of the

star queries fired on the database by creatingllocal
bitmap index on all the
foreign keys in the fact table.
5.3 Understanding Optimizer Hints

Optimizer hints can be used with SQL statements to alter execution plans.For example,
66
you might know that a certain index is more selective for certain queries. Based on this
information, you might be able to choose a more efficient execution plan than the
optimizer. In such a case, use hints to force the optimizer to use the optimal execution
plan.
You can use hints to specify the following:
The optimization approach for a SQL statement
The goal of the cost-based optimizer for a SQL statement
The access path for a table accessed by the statement
The join order for a join statement
A join operation in a join statement
The following syntax shows hints contained in both styles of comments that Oracle
supports within a statement block.
{DELETE|INSERT|SELECT|UPDATE} /*+ hint [text] [hint[text]]... */
where + causes Oracle to interpret the comment as a list of hints

Commonly used Hints are:
use_nl(a,b): Join Table a and Table b with nested loop join.
use_hash(a,b): Join Table a and Table b with hash join.
Ordered : Join the tables in the order as in the from clause.
leading(a) : Join the tables starting from Table a.
parallel(a,4) : Full table scan Table a with 4 parallel query sessions.
parallel_index(a,4) : Access the indexes of Table a with 4 parallel

sessions. This only applies to local indexes, or index fast full scans.
full(a) : Access Table a with full table scan.
index(a,indx) : Access Table a with its index named indx.
star_transformation : If possible, use a star transformation execution plan.
67
fact(a) : If use star transformation, pick Table a as the fact table.
no_index(a,indx) : Do not use index indx of Table a to access the table.
no_merge(vw) : Do not merge the in-line view v into the main query.
cardinality(a,1000) : Tell the optimizer that there will be 1000 rows of Table a
being selected.
index_ffs(a,indx) : Instead of accessing the Table a, fast-full scan its index, indx.
append: Used after insert, force the data to be loaded abvoe HWM (direct load).
no_append: Used after insert, force the data to be loaded to the empty space
below HWM first
and_equal(a,indx1,indx2,indx5): merge those single column indexes of table a
Rewrite :The rewrite hint forces the cost-based optimizer to rewrite a query in
terms of materialized views, when possible, without cost consideration.
Example:
As
given
in
the
figure,
after
specifying
the
hint
/*+
index(SALES_CUST_ID_BX) */ we can inform the optimizer to use the index to improve

the performance.
Similarly as given in figure, we can use hint /*+ USE_NL(s,p) */ to inform optimizer to
use nested-loop join rather than hash join between tables sales and product.
68
69
5.4 Use of Analytical functions

Data warehouse queries revolve around analytical calculations like moving averages ,
ranking , cumulative sums , lag , lead etc. To write efficient queries for such situations
Analytical functions are used. The queries written using analytical functions are faster
than the queries written in native SQL.
The general syntax of Analytical function is
Function ( arg1 , arg2 ,....) over ( [partition by < col 1 ,... > ] [Order by <col1 , ..>]
[Window_Clause] )
The OVER analytic_clause is used to indicate that the function operates on a query result
set.That is, it is computed after the FROM, WHERE, GROUP BY, and HAVING clauses
The PARTITION BY clause is used to partition the query result set into groups based on
one or more value_expr.
The order_by_clause is used to specify how data is ordered within a partition.
Analytical functions enable :
Ranking and percentile : Calculating ranks, percentiles, and n-tiles of the values
in a result set.
Moving window calculations :Calculating cumulative and moving aggregates.

Works with these functions: SUM, AVG, MIN, MAX, COUNT, VARIANCE, STDDEV,
FIRST_VALUE, LAST_VALUE, and new statistical functions. Note that the DISTINCT
keyword is not supported in windowing functions except for MAX and MIN.
Lag/ Lead analysis :Finding a value in a row a specified number of rows from a
current row.
First / Last analysis :First or last value in an ordered group.
Linear regression statistics:Calculating linear regression and other statistics

(slope, intercept, and so on).
Given below are some sample queries that can be rewritten using Analytical functions
rather than normal SQL to improve performance.
70
Example 1:Find Top-3 customers within each geography based on sales

Basic SQLor PLSQL :
SELECT geo_id,cust_id, sales,
RANK(geo_id,sales) rnk
FROM (
SELECT geo_id, cust_id,
SUM(net_invoice_value) sales
FROM sales
GROUP geo_id, cust_id
ORDER BY region, SUM(sales) DESC)
WHERE RANK(geo_id, sales) <= 3;
Analytical Functions:
SELECT * FROM
(SELECT geo_id, cust_id,
SUM(sales) sum_sales,
RANK() OVER
(PARTITION BY geo_id
ORDER BY sum(sales) DESC) rank
FROM sales
GROUP BY geo_id, cust_id )
WHERE rank <= 3;
Technique # 12 Use Analytical Ranking Functions to improve performance.

Example 2: Compute the moving sum of net invoice value of each product within the last
year
Basic SQLor PLSQL :
SELECT s1.prod_id,s1.transaction_date , SUM(s2.net_invoice_value) as m_avg
FROM sales s1, sales s2
WHERE s1.prod_id = s2.prod_id AND
s2.transaction_date <= s1.transaction_date AND
71
s2. transaction_date >= ADD_MONTHS(s1.transaction_date , -12)

GROUP BY s1.prod_id, s1.transaction_date ;
SELECT prod_id,transaction_date
SUM(SUM(sales)) OVER
(PARTITION BY prod_id ORDER BY transaction_date
RANGE INTERVAL 1 YEAR PRECEDING)
m_avg
FROM sales
GROUP BY prod_id, s1.transaction_date
Technique # 13 Use Analytical Window Functions like ROWS / RANGE

INTERVAL to improve performance.
Example 3: Find products and its sales with maximum annual sales for each year
Basic SQLor PLSQL :
CREATE VIEW V1
AS SELECT prod_id, year(transaction_date) yr,
SUM(sales) sum_sales
FROM sales
GROUP BY prod_id, year(transaction_date) ;
CREATE VIEW V2
AS SELECT yr, MAX(sum_sales) max_sales
FROM V1
GROUP BY yr;
SELECT prod_id, year, sum_sales
FROM V1, V2
WHERE V1.year = V2.year
AND V1.sum_sales = V2.max_sales
SELECT *
72
FROM (SELECT year(transaction_date), prod_id,

MAX(SUM(sales)) OVER
(PARTITION BY prod_id)
max_sales
FROM salesTable
GROUP BY year(transaction_date), prod_id)
WHERE sum_sales = max_sales;
Technique # 13 Use Analytical Reporting Functions to improve performance.
Example 4:Give the Top-10 products for states which contribute more than 25% of
regional sales.
Basic SQL or PLSQL :
CREATE VIEW V1
AS SELECT g.geo_3_name, g.geo_1_name, SUM(sales)
sum_sales
FROM sales s, geography g where s.geo_id = g.geo_id
GROUP BY .geo_3_name, g.geo_1_name
CREATE VIEW V2
AS SELECT g.geo_3_name, SUM(sales) sum_sales
FROM sales s, geography g where s.geo_id = g.geo_id
GROUP BY g.geo_3_name;
SELECT g.geo_3_name, g.geo_1_name, prod_id,
RANK(g.geo_3_name, g.geo_1_name, X.s_sales) rank
FROM (
SELECT g.geo_3_name, g.geo_1_name, product,
FROM sales s, geography , V1, V2
WHERE s.geo_id = g.geo_id
s. geo_3_name = V1. geo_3_name
73
AND s. geo_1_name = V1. geo_1_name

AND V1. geo_3_name = V2. geo_3_name
AND V1.sum_sales > 0.25 * V2.sum_sales
GROUP BY geo_3_name,geo_1_name, product
ORDER BY geo_3_name,geo_1_name,SUM(sales) DESC)
X
WHERE
RANK(geo_3_name, geo_1_name, X.sum_sales) <= 10;
SELECT *
FROM (SELECT geo_3_name, geo_1_name, prod_id,
SUM(SUM(sales)) over
(PARTITION BY geo_3_name) a,
SUM(SUM(sales)) OVER
(PARTITION BY geo_3_name, geo_1_name) b,
RANK() OVER
(PARTITION BY geo_3_name, geo_1_name
ORDER BY SUM(sales) DESC) rank
FROM sales
GROUP BY geo_3_name, geo_1_name, prod_id)
WHERE b >= 0.25 * a AND rank <= 10;
Technique # 13 Use Analytical Reporting Functions to improve performance.
Example 5:For each product, compare sales of a year to its previous year
Basic SQL or PLSQL :
CREATE VIEW V1
AS SELECT prod_id, year(transaction_date) yr,
FROM sales
GROUP BY product, year(transaction_date);
74
SELECT prod_id, a.mn, a.sum _sales b.sum _sales dif_sales

FROM V1 a LEFT OUTER JOIN V1 b
ON (a. prod_id = b. prod_id
AND a.yr = b.yr + 1);
SELECT prod_id, year(transaction_date) yr,
SUM(sales) - LAG(SUM(sales), 1) OVER
(PARTITION BY product
ORDER BY year(date)) dif
FROM sales
GROUP BY product, month(date);
Technique # 14 Use Analytical Lag/Lead Functions to improve performance.

There are many other analytical functions that can be used to improve the queries fired
on the database.
5.5 Understanding Query Rewrite

Datawarehouse Aggregates : Materialized Views
Materialized view is one of the techniques employed in data warehouses to improve
performance is the creation of summaries, or aggregates. They are a special kind of
aggregate view which improves query execution times by calculating expensive joins and
aggregation operations prior to execution, and storing the results in a table in the
database. For example, a table may be created which would contain the sum of sales by
region and by product.
Materialized views are typically referred to as summaries since they store summarized
data. So a materialized view is used to eliminate overhead associated with expensive
joins or aggregations for a large or important class of queries. Aggregates or Materialized
views become available automatically when they are ready for use i.e. queries fired on
75
the database will automatically use the aggregates to improve query performance.
For Example , we may create a materialized view to create aggregated sales data for
every month.
Create materilaized view sales_mv
BUILD IMMEDIATE
REFRESH FAST ON DEMAND
enable query rewrite
as
select cal_master.mth_id, sum(sales.net_invoice_value)
from sales, cal_master
where sales.transaction_date = cal_master.date
group by cal_master.mth_id;
A materialized view definition can include any number of aggregates, as well as any
number of joins.The existence of a materialized view is transparent to SQL applications,
so we can create or drop materialized views at any time without affecting the validity of
SQL applications. A materialized view consumes storage space. The contents of the
materialized view must be maintained when the underlying detail tables are modified.
The types of materialized views are:
Materialized Views with Aggregates : Materialized views that preaggregate the data
present in the fact tables.
Materialized Views Containing Only Joins : Some materialized views contain only
joins and no aggregates.The advantage of creating this type of materialized view is
that expensive joins will be precalculated.
Nested Materialized Views :A nested materialized view is a materialized view

whose definition is based on another materialized view. A nested materialized view
can reference
other relations in the
database in addition
to referencing
materialized views.
Build Methods:
BUILD IMMEDIATE : Create the materialized view and then populate it with data.
76
BUILD DEFERRED : Create the materialized view definition but do not populate it
with data.
ENABLE QUERY REWRITE : You also must specify the ENABLE QUERY REWRITE
clause if the materialized view is to be considered available for rewriting queries.
Refresh Mode:
ON COMMIT : Refresh occurs automatically when a transaction that modified one

of the materialized view's detail tables commits.
ON DEMAND : Refresh occurs when a user manually executes one of the available
refresh
procedures
contained
in
the
DBMS_MVIEW
package
(REFRESH,
REFRESH_ALL_MVIEWS, REFRESH_DEPENDENT)
Refresh Options:
COMPLETE : Refreshes by recalculating the materialized view's defining query
FAST:
Applies incremental changes to refresh the materialized view using the
information logged in the materialized view logs. This requires materailized view
logs too be created as given below:
CREATE MATERIALIZED VIEW LOG ON sales

WITH ROWID
(prod_id, cust_id, day_id, net_invoice_value)
INCLUDING NEW VALUES;
FORCE: Applies FAST refresh if possible; otherwise, it applies COMPLETE refresh
NEVER: Indicates that the materialized view will not be refreshed with the Oracle
refresh mechanisms
Query Rewrite
When base fact tables like Sales contain large amount of data, it is an expensive and
time consuming process to compute the joins and to calculate the aggregates requires
for the analysis purpose. In such cases, queries can take minutes or even hours to return
77
the answer which is not at all acceptable in datawarehousing environments. Because

materialized views contain already precomputed aggregates and joins, Oracle
employs an extremely powerful process called query rewrite to quickly answer the
query using materialized views.
One of the major benefits of creating and maintaining materialized views is the ability
to take advantage of query rewrite, which transforms a SQL statement expressed in
terms of tables or views into a statement accessing one or more materialized views
that are defined on the detail tables. The transformation is transparent to the end user
or application, requiring no intervention and no reference to the materialized view in
the SQL statement.
The optimizer use query rewrite only when a certain number of conditions are met:
Query rewrite must be enabled for the session.

A materialized view must be enabled for query rewrite.
The rewrite integrity level should allow the use of the materialized view. For
example, if a materialized view is not fresh then the materialized view is not used.
Either all or part of the results requested by the query must be obtainable from
the precomputed result stored in the materialized view or views.
To determine this, the optimizer may depend on some of the data relationships declared
by the user using constraints and dimensions. Such data relationships include
hierarchies, referential integrity, and uniqueness of key data, and so on.
To enable query rewrite for a session set:
OPTIMIZER_MODE="ALL_ROWS", "FIRST_ROWS", or "CHOOSE"

With tables analyzed, ensures that the cost-based optimizer is used,
which is a requirement to get Query Rewrite.
QUERY_REWRITE_ENABLED = True
Turns on query rewrite.
QUERY_REWRITE_INTEGRITY
enforced
(default
mode)
or
trusted
or
stale_tolerated
78
Advises how fresh a materialized view must be to be eligible for

query rewrite.
To make query rewite work some tips are as follows:
Materialized views must be created with enable query rewrite option.
Dimension tables and Fact table must have Primary and foreign key relationship
with foreign key created in Novalidate Rely mode.
Database dimensions must be created for all the dimensions on which the
aggregation or rollup needs to takes place.
Fact tables and dimension tables should similarly guarantee that each fact table
row joins with one and only one dimension table row.
For each table, create a bitmap index for each key column, and create one local
index that includes all the key columns.
Partition and index the materialized view like the fact tables.
Dimensions
Dimensions are another method by which we can give even more information to Oracle.
Consider the example of sales table.In sales there is a load date and a customerid. The
load date points to another table cal_master that gives full details of what month the
load date was in, what quarter and what fiscal year the load date is in.
Now, suppose we create a materialized view that stores aggregated sales information at
the quarterly level. We know that load date implies month, month implies quarter and
quarter implies year but Oracle doesn't know this. Using a database object called a
dimension, we can alert Oracle to these facts and it will use them to rewrite queries.
A dimension declares a parent/child relationship between columns of the table. We can
use it to inform Oracle that within a row of a table - the mth_id column implies the value
you'll find in the qtr_id column, the qtr_id column implies the value you'll find in the
year_id column and so on.
Given below is the database dimension for Time dimension:
79
create dimension cal_db_dim

level day_id is (cal_master.day_id)
level week_id is (cal_master.week_id)
level mth_id is (cal_master.mth_id)
level qtr_id is (cal_master.qtr_id)
level year_id is (cal_master.year_id)
hierarchy time_rollup( day_id child of week_id child of th_id child of qtr_id child of
year_id ) ;
Example1
We have our base fact table as Sales table. The sales table contains data based on the
transaction date. First gather stats for the table sales and cal_master so that Oracle has
estimate of data present in the tables.
exec dbms_stats.gather_table_stats( user, 'SALES', cascade=>true );
exec dbms_stats.gather_table_stats( user, 'CAL_MASTER', cascade=>true );
Consider we have the query that need sales data aggregate on the quarterly basis.
We have create a materialized view sales_c_mth_mv as given below i.e. we already have
data aggregated on monthly level. So we should use this materialized view to group the
sales data on quarterly basis for better performance.
Consider the Query to aggregate the sales data based on month :

select cal_master.mth_id, sum(sales.net_invoice_value)
80

group by cal_master.mth_id;
The expalin plan of the query is as shown below:
As seen from the explain plan the materialized view containing the aggregated data is
accessed rather than base fact 'SALES' table.
Consider the Query to aggregate the sales data based on quarter :

select cal_master.qtr_id, sum(sales.net_invoice_value)
group by cal_master.qtr_id;
The explain plan of the query is as shown below:
81
As seen above rather than using the Materialized view SALES_C_MTH_MV base fact table
'SALES' is being used. This happened because Oracle does not have knowledge of the
fact that Month can be grouped to Quarter. To provide oracle this information we create
database dimensions.
Now we create a time hierechy i.e. CAL_DB_DIM database dimension on the cal_master
table as shown below:
82
We again rereun the same query to get data grouped on quartely basis. The explain plan
as given below shows that after creation of dimension SALES_C_MTH_MV is being used
in place of SALES base fact table.
select cal_master.qtr_id, sum(sales.net_invoice_value)
group by cal_master.qtr_id
Also we noticed that logical reads reduced from 650250 to 12 . This is a very high
performance gain.
Example 2
Suppose we want to get aggregated sales by year , customer location and product sub
category. This type of aggregation is very commonly used in the datawarehouses.
83
Consider we have a materialized view sales_c2_p2_mth_mv containing aggregated sales

by month , customer department and product brand.
Now if we run a query to aggregate sales by month , customer department and product
brand , query rewite will work and materialized view SALES_C2_P2_MTH_MV will be used
, but as given in the example above if we try to aggregate by year , customer location
and product sub category , query rewrite will not work until dimensions have been
created.
This is illustrated below:
84
85
Create dimensions for Product and Customer :

CUST_DB_DIM
CREATE DIMENSION CUST_DB_DIM
LEVEL Cust_Id is (customer.cust_id)
LEVEL Cust_2_Id is (customer.cust_2_id)
HIERARCHY geog_rollup (
cust_id child of
cust_2_id child of
cust_3_id child of
cust_4_id
attribute Cust_Id determines( cust_name , cust_desc)

attribute Cust_2_Id determines( cust_2_name , cust_2_desc)
attribute Cust_4_Id determines( cust_4_name , cust_4_desc);
PROD_DB_DIM
CREATE DIMENSION PROD_DB_DIM
LEVEL Prod_Id is (Product.Prod_id)
LEVEL Prod_2_Id is (Product.Prod_2_id)
HIERARCHY prod_rollup (
Prod_id child of
Prod_2_id child of
Prod_3_id child of
Prod_4_id
attribute Prod_Id determines( Prod_name , Prod_desc)

attribute Prod_2_Id determines( Prod_2_name , Prod_2_desc)
86
attribute Prod_4_Id determines( Prod_4_name , Prod_4_desc);
As Shown above the performance using Materialized views and dimensions have
increased many folds i.e. we have reduced the logical reads from 854215 to 14.
Scenario for Sales Data Mart
Consider a scenario in sales data mart where most of the queries are fired to select
aggregated sales by customer location , product brand or customer country and product
category etc. Thus in this kind of scenario we can pre create some materialized views
containg the aggregated data that select the data from the other materailized views for
our sales datamart e.g. In Figure 14 The materialized view SALES_C2_G2_MV contains
data aggregated on cust_2_id and geo_2_id. The materialized view SALES_C3_G3_MV
selects data directly from
materialized view SALES_C2_G2_MV as it already has
aggregated data on a lower level.
87
Figure 14: Aggregation setup for SALES.

We can also create mini dimensions that are smaller versions of the large dimension
tables so as to reduce the time required to create the materialized views.
PROD_3_MDIM
insert into PROD_3_DIM select Prod_Id,Prod_name , prod_desc , Prod_1_id,
Prod_1_name, prod_1_desc, Prod_2_id, Prod_2_name , prod_2_desc ,Prod_3_id
Prod_3_name , prod_3_desc from product;
PROD_2_MDIM
insert into PROD_2_DIM select Prod_Id,Prod_name , prod_desc , Prod_1_id,
Prod_1_name, prod_1_desc, Prod_2_id, Prod_2_name , prod_2_desc
from
PROD_3_MDIM;
CUST_3_MDIM
insert into CUST_3_DIM select cust_Id,cust_name , cust_desc ,
cust_1_id,
cust_1_name, cust_1_desc, cust_2_id, cust_2_name , cust_2_desc , cust_3_id,
cust_3_name , cust_3_desc from customer;
CUST_2_MDIM
insert into CUST_2_DIM select
cust_Id,cust_name , cust_desc ,
cust_1_id,
cust_1_name, cust_1_desc, cust_2_id, cust_2_name , cust_2_desc from
CUST_3_MDIM;
GEO_2_MDIM
88
insert into Geo_2_DIM select Geo_Id,Geo_name , Geo_desc ,

Geo_1_id,
Geo_1_name, Geo_1_desc, Geo_2_id, Geo_2_name , Geo_2_desc from organization;
Refresh queries for materialized views and dimensions

SALES_P3_MV
create or replace materialized view SALES_P3_MV
on prebuilt table
select
Fact_type, Transaction_date, Day_Id, Load_date,
Billing_Type, Cust_Id, Prod_3_Id, Geo_Id, Trade_channel_id,
sum(Base_Unit_Qty) as Base_Unit_Qty,
sum(Net_Invoice_value) as Net_Invoice_value,
sum(Gross_Invoice_Value) as Gross_Invoice_Value,
sum(Net_Invoice_USD) as Net_Invoice_USD,
sum(Gross_Invoice_USD) as Gross_Invoice_USD
from
sales_agg, PROD_3_MDIM
where
sales_agg.prod_id = PROD_3_MDIM.prod_id
group by
Fact_type, Transaction_date, Day_Id, Load_date,
Billing_Type, Cust_Id, Prod_3_Id, Geo_Id, Trade_channel_id
SALES_P2_C2_MV
create or replace materialized view SALES_P2_C2_MV
on prebuilt table
select
Fact_type, Transaction_date, Day_Id, Load_date,Billing_Type,
Cust_2_Id, Prod_2_Id, Geo_Id, Trade_channel_id,
89
from
SALES_P3_MV, CUST_2_MDIM , PROD_3_MDIM
where
SALES_P3_MV.prod_3_id = PROD_3_MDIM.prod_3_id
and SALES_P3_MV.cust_id = CUST_2_MDIM.cust_id
group by
Fact_type, Transaction_date,Day_Id, Load_date, Billing_Type,
Cust_2_Id, Prod_2_Id, Geo_Id, Trade_channel_id
SALES_C3_G2_MV
create or replace materialized view SALES_C3_G2_MV
on prebuilt table
select
Cust_3_Id, Prod_Id, Geo_2_ID, Trade_channel_id,
from sales_agg, CUST_3_MDIM, ORG_2_MDIM
where sales_agg.cust_id = CUST_3_MDIM.cust_id
and sales_agg.geo_id=GEO_2_MDIM.geo_id
group by
Cust_3_Id, Prod_Id, Geo_2_ID, Trade_channel_id
SALES_C2_G2_MV
create or replace materialized view SALES_C2_G2_MV
on prebuilt table
select
Fact_type,Transaction_date, Day_Id, Load_date, Billing_Type,
Cust_2_Id, Prod_Id, Geo_2_ID, Trade_channel_id,
90
from SALES_C3_O2_MV, CUST_3_MDIM
where SALES_C3_O2_MV.cust_3_id = CUST_3_MDIM.cust_3_id
group by
Fact_type,Transaction_date, Day_Id, Load_date, Billing_Type,
Cust_2_Id, Prod_Id, Geo_2_ID, Trade_channel_id
GEO_DB_DIM
CREATE DIMENSION GEO_DB_DIM
LEVEL
Geo_Id is (Geography.Geo_id)
Geo_2_Id is (Geography.Geo_1_id)
Geo_3_Id is (Geography .Geo_2_id)
Geo_4_Id is (Geography .Geo_3_id)
HIERARCHY geo_rollup (
Geo_id child of
Geo_1_id child of
Geo_2_id child of
Geo_3_id
attribute Geo_Id determines( Geo_name , Geo_desc)

attribute Geo_1_Id determines( Geo_1_name , Geo_1_desc)
CUST_DB_DIM
CREATE DIMENSION CUST_DB_DIM
LEVEL Cust_Id is (customer.cust_id)
HIERARCHY geog_rollup (
91
cust_id child of
cust_2_id child of
cust_3_id child of
cust_4_id
attribute Cust_Id determines( cust_name , cust_desc)

attribute Cust_4_Id determines( cust_4_name , cust_4_desc);
PROD_DB_DIM
CREATE DIMENSION PROD_DB_DIM
LEVEL Prod_Id is (Product.Prod_id)
HIERARCHY prod_rollup (
Prod_id child of
Prod_2_id child of
Prod_3_id child of
Prod_4_id
attribute Prod_Id determines( Prod_name , Prod_desc)

attribute Prod_4_Id determines( Prod_4_name , Prod_4_desc);
Thus using these mini dimensions and Materialized views we can perform aggregation
and set the appropriate settings for the optimizer for query rewrite to work and hence
improve the performance.
Chapter 6
CONCLUSION AND FUTURE WORK
Conclusion
92
In this dissertation we have investigated some of the techniques related to the

performance of designing schemas that are involved in the design of a data warehouse
and comparing the performance of the related schemas (star and snowflake).We have
also looked into the various schema alternatives for datawarehousing and showed that
for most of the cases performance of star schema is good.
Since all the data in the star schema is organized in a single dimension table , it is easy
to browse through the data which might be of interest to many business users.
Snowflake schema is highly normalized and can be used when attributes are very long
and redundancy is very high but at the expenses of poor browsing performance as the
data is distributed across several dimension tables.
Since data warehouses are used for OLAP and OLAP queries heavily use aggregation, it is
important that the response time of the queries should be low for interactive data
analysis applications. Thus we have looked into the techniques used toi optimize the
performance of the queries.
Designing a schema and optimizing the Queries was implemented and tested using
Oracle (SQL).
Future Work
There are several ways in which workpresented in this dissertation can be extended. In
this
dissertationwe have only focussed on the creation of data warehouse and
optimization techniques for the queries fired on the databases.
93
We assumed that data is already in the clean format but operational databases often
have major inconsistencies in data repreenation.
Hence additional work is needed in the Data extraction component that extracts the data
from various legacy sources which is to be stored in the data warehouse.
Data Cleaning component dictates consistency in the format of data which is collected
from various sources so that it can be merged under a unified schema.
These components are integral part of datawarehouse systems hence future work can be
focussed on these parts.
REFERENCES
1. A.Berson and S.JSmith. "Data Warehousing , Data Mining and OLAP". Mc
Graw-Hill, 1997.
2. G.Graefe. "Query evaluation Techniques for Large Databases". ACM Com
putting Surveys", June 1993 (73170).
94
3. W.H. Immon, J.D. Welch and Kathrine L. Glassey. "Managing the

DataWarehouse". John Wileyand Sons, 1996.
4. Stephen Peterson. "Stars: A pttern Language for Query Optimized
Schema". Sequent Computer Systems White paper, 1994.
5. Patil, Preeti S.; Srikantha Rao; Suryakant B. Patil (2011). "Optimization of
Data Warehousing System: Simplification in Reporting and Analysis".
International Journal of Computer Applications (Foundation of Computer
Science) (3337).
6. Kimball, Ralph. "The Data Warehouse Toolkit ". Wiley Publications, 1996
7. Martyn Richard Jones, Akshaya Bhatia, jdemers. "Data Warehouse
Fundamentals". Can be accessed via
http://it.toolbox.com/wiki/index.php/Data_Warehouse_Fundamentals
8. Connie Dialeris Green , "Oracle9i Database Performance Tuning Guide and
Reference Release 2 (9.2) Part Number A96533-02" , Copyright 2000,
2002 Oracle Corporation , October 2002.
9. Immanuel Chan , "Oracle Database Performance Tuning Guide, 10g
Release 2 (10.2) Part Number B14211-03" , Copyright 2000, 2008,
Oracle. , 2008.
10.
"Oracle9i Data Warehousing GuideRelease 2 (9.2) Part Number
A96520-01" , Copyright , Oracle. Can be accessed via link

http://docs.oracle.com/cd/B10501_01/server.920/a96520/concept.htm
#49759
11.
http://www.oracle.com/technetwork/database/focus-areas/bi-
datawarehousing/index.html
12.
http://ezinearticles.com/?Why-Do-We-Need-Data-Warehousing?
&id=2495158
13.
http://www.cs.wmich.edu/~yang/teach/cs595/han/ch2.pdf
APPENDIX
1. SAMPLE DATA FOR TABLES
Given below is the sample of the data in the PRODUCT table
95
96
Given below is the sample of the data in the CUSTOMER table
97
98
Given below is the sample of the data in the GEOGRAPHY table
99
Given below is the sample of the data in the STORE table
100
Given below is the sample of the data in the CAL_MASTER table
101
Given below is the sample of the data in the SALES table
102
2. CREATE TABLES
103
Given below is the DDL for all the tables that have been created for Star and Snowflake
Schema
------------------------ SALES------------------------
create table SALES (

Fact_Id
Number(8) ,
Fact_type
Varchar2(10)
Transaction_date
Date
Day_Id
Number(8)
Load_date
not null,
not null,
Date
not null,
Billing_Type
Varchar2(10)
not null,
Cust_Id
Number(80)
not null,
Prod_Id
Number(80)
not null,
Geo_Id
Number(80)
not null,
Store_ID
Number(80)
Base_Unit_Qty
not null ,
not null,
Number(80),
Net_Invoice_value
Number(80),
Gross_Invoice_Value
Number(80),
Net_Invoice_USD
Number(80),
Gross_Invoice_USD
Number(80)
)
Partition by Range (day_id)
subpartition by List (fact_type)
SUBPARTITION TEMPLATE(
SUBPARTITION BR ,
SUBPARTITION B5 ,
SUBPARTITION BK
PARTITION
(Z_LAST_PARTITION VALUE LESS THAN (MAXVALUE) );
alter table sales add constraint sales_pk primary key (Fact_id);

alter table sales add constraint sales_trdate_fk foreign key (transaction_date) references Cal_master(date) disable
novalidate rely;
alter table sales add constraint sales_dayid_fk foreign key (day_id) references Cal_master(day_id) disable novalidate rely;
alter table sales add constraint sales_loaddate_fk foreign key (load_date) references Cal_master(date) disable novalidate
rely;
alter table sales add constraint sales_custid_fk foreign key (cust_id) references Customer(cust_id) disable novalidate rely;
alter table sales add constraint sales_prod_fk foreign key (prod_id) references Product(prod_id) disable novalidate rely;
alter table sales add constraint sales_geoid_fk foreign key (geo_id) references Geography(geo_id) disable novalidate rely;
alter table sales add constraint sales_storeid_fk foreign key (store_id) references Store(store_id) disable novalidate rely;
104
-------------------- SALES_AGG ----------------------------
create table SALES_AGG (

Fact_type
Varchar2(10)
Transaction_date
Date
not null,
Day_Id
Number(8)
not null,
Load_date
Date
not null,
Billing_Type
Varchar2(10)
not null,
Cust_Id
Number(80)
not null,
Prod_Id
Number(80)
not null,
Geo_Id
Number(80)
not null,
Store_ID
Number(80)
Base_Unit_Qty
not null ,
not null,
Number(80),
Net_Invoice_value
Number(80),
Gross_Invoice_Value
Number(80),
Net_Invoice_USD
Number(80),
Gross_Invoice_USD
Number(80)
)
Partition by Range (day_id)
subpartition by List (fact_type)
SUBPARTITION TEMPLATE(
SUBPARTITION BR ,
SUBPARTITION B5 ,
SUBPARTITION BK
PARTITION
(Z_LAST_PARTITION VALUE LESS THAN (MAXVALUE) );
alter table sales_agg add constraint sales_agg_pk primary key (Fact_type,cust_id,prod_id,geo_id,store_id);

alter table sales_agg add constraint sales_agg_trdate_fk foreign key (transaction_date) references Cal_master(date)
disable novalidate rely;
alter table sales_agg add constraint sales_agg_dayid_fk foreign key (day_id) references Cal_master(day_id) disable
novalidate rely;
alter table sales_agg add constraint sales_agg_loaddate_fk foreign key (load_date) references Cal_master(date) disable
novalidate rely;
alter table sales_agg add constraint sales_agg_custid_fk foreign key (cust_id) references Customer(cust_id) disable
novalidate rely;
alter table sales_agg add constraint sales_agg_prod_fk foreign key (prod_id) references Product(prod_id) disable
novalidate rely;
alter table sales_agg add constraint sales_agg_geoid_fk foreign key (geo_id) references Geography(geo_id) disable
novalidate rely;
alter table sales_agg add constraint sales_agg_storeid_fk foreign key (store_id) references Store(store_id) disable
novalidate rely;
105
---------------------PRODUCT---------------------------
create table PRODUCT

(
Prod_Id
Number(8),
Prod_Name
Varchar2(50)
Prod_desc
Varchar2(100) ,
Prod_1_Id
Number(8)
not null,
Prod_1_Name
Varchar2(50)
not null,
Prod_1_desc
Varchar2(100),
Prod_2_Id
Number(8)
not null,
Prod_2_Name
Varchar2(50)
not null,
Prod_2_desc
Varchar2(100),
Prod_3_Id
Number(8)
not null,
Prod_3_Name
Varchar2(50)
not null,
Prod_3_desc
Varchar2(100),
Prod_4_Id
Number(8)
not null,
Prod_4_Name
Varchar2(50)
not null,
Prod_4_desc
Varchar2(100),
Prod_level
Number
Brand_desc
Varchar2(100),
Prod_size_desc
Varchar2(100),
Last_updt_date
Date
not null,
not null,
not null
alter table product add constraint product_pk primary key (prod_id);
---------------------CUSTOMER---------------------------
create table CUSTOMER

(
cust_Id
Number(8),
cust_Name
Varchar2(50)
not null,
cust_desc
Varchar2(100) ,
cust_1_Id
Number(8)
not null,
cust_1_Name
Varchar2(50)
not null,
cust_1_desc
Varchar2(100),
cust_2_Id
Number(8)
not null,
cust_2_Name
Varchar2(50)
not null,
cust_2_desc
Varchar2(100),
106
cust_3_Id
Number(8)
not null,
cust_3_Name
Varchar2(50)
not null,
cust_3_desc
Varchar2(100),
cust_4_Id
Number(8)
not null,
cust_4_Name
Varchar2(50)
not null,
cust_4_desc
Varchar2(100),
cust_level
Number
Cust_grade
Varchar2(100) not null,
not null,
Cust_peer_rank
Number not null,
Number,
Last_updt_date
Date not null
alter table customer add constraint customer_pk primary key (cust_id);
---------------------STORE--------------------------create table STORE(

Store_Id
Number(8)
not null,
Store_Name
Varchar2(50)
Store_desc
Varchar2(100)
Store_Location
varchar2(10)
Store_Manager_Id
Store_Manager_Name
not null,
,
,
Number(50) not null
Varchar2(10) not null
alter table store add constraint store_pk primary key (store_id);
---------------------GEOGRAPHY---------------------------
create table geography

(
geo_Id
Number(8),
geo_Name
Varchar2(50)
not null,
geo_desc
Varchar2(100) ,
geo_1_Id
Number(8)
not null,
geo_1_Name
Varchar2(50)
not null,
geo_1_desc
Varchar2(100),
geo_2_Id
Number(8)
not null,
geo_2_Name
Varchar2(50)
not null,
107
geo_2_desc
Varchar2(100),
geo_3_Id
Number(8)
not null,
geo_3_Name
Varchar2(50)
not null,
geo_3_desc
Varchar2(100),
geo_level
Number
Last_updt_date
not null,
Date not null
alter table geography add constraint geography_pk primary key (geo_id);

-----------------------CAL_MASTER-------------------------
create table cal_master

(
Day_Id
Number(8) not null,
Date
date
not null,
Day_Name
Varchar2(20)not null,
Week_Id
Number(1)
not null,
Mth_Id
Number(6)
not null,
Mth_name
Qtr_Id
Qtr_Name
Year_Id
Number(4)
not null,
Year_name
Holiday_Indicator
Varchar2(1)
not null
alter table cal_master add constraint cal_master_pk primary key (day_id);
-------------------PROD-----------------------------
create table PROD

(
Prod_Id
Number(8),
Prod_Name
Prod_desc
Varchar2(100) ,
Prod_1_Id
Number(8)
not null,
Prod_2_Id
Number(8)
not null,
Prod_level
Number not null,
Brand_desc
Varchar2(100),
Prod_size_desc
Varchar2(100),
108
Last_updt_date
Date
not null
alter table prod add constraint prod_pk primary key (prod_id);

alter table prod add constraint prod_prod_fk foreign key (prod_1_id) references PROD_1(prod_1_id) ;
alter table prod add constraint prod_prod_fk foreign key (prod_2_id) references PROD_2(prod_2_id) ;
------------------PROD_1------------------------------
create table PROD_1

(
Prod_1_Id
Number(8)
Prod_1_Name
Varchar2(50)
Prod_1_desc
Varchar2(100)
not null,
not null,
alter table prod_1 add constraint prod_1_pk primary key (prod_1_id);
------------------PROD_2------------------------------
create table PROD_2

(
Prod_2_Id
Number(8)
Prod_2_Name
Varchar2(50)
Prod_2_desc
Varchar2(100),
Prod_3_Id
Number(8)
not null,
not null,
not null,

alter table prod_2 add constraint prod_2_prod_fk foreign key (prod_3_id) references PROD_3(prod_3_id) ;
------------------PROD_3------------------------------
create table PROD_3

(
Prod_3_Id
Number(8)
Prod_3_Name
Varchar2(50)
Prod_3_desc
Varchar2(100),
Prod_4_Id
Number(8)
not null,
not null,
not null,
109

alter table prod_3 add constraint prod_3_prod_fk foreign key (prod_4_id) references PROD_4(prod_4_id) ;
------------------PROD_4-----------------------------create table PROD_4

(
Prod_4_Id
Number(8)
Prod_4_Name
Varchar2(50)
Prod_4_desc
Varchar2(100)
not null,
not null,
--------------------CUST---------------------------
create table CUST

(
cust_Id
Number(8),
cust_Name
Varchar2(50)
cust_desc
Varchar2(100) ,
cust_1_Id
Number(8)
not null,
cust_2_Id
Number(8)
not null,
cust_level
Number
not null,
Cust_grade
Varchar2(100)
not null,
Cust_peer_rank
Varchar2(100)
Number
Number,
Last_updt_date
Date
not null,
not null,
not null,
not null
alter table cust add constraint cust_pk primary key (cust_id);

alter table cust add constraint cust_cust_1_fk foreign key (cust_1_id) references cust_1(cust_1_id) ;
alter table cust add constraint cust_cust_2_fk foreign key (cust_2_id) references cust_2(cust_2_id) ;
---------------------CUST_1---------------------------
create table CUST_1

(
110
cust_1_Id
Number(8)
cust_1_Name
Varchar2(50)
cust_1_desc
Varchar2(100))
not null,
not null,
alter table cust_1 add constraint cust_1_pk primary key (cust_1_id);
---------------------CUST_2---------------------------
create table CUST_2

(
cust_2_Id
Number(8)
not null,
cust_2_Name
Varchar2(50)
not null,
cust_2_desc
Varchar2(100),
cust_3_Id
Number(8)
not null

alter table cust_2 add constraint cust_2_cust_3_fk foreign key (cust_3_id) references cust_3(cust_3_id) ;
---------------------CUST_3---------------------------
create table CUST_3

(
cust_3_Id
Number(8)
not null,
cust_3_Name
Varchar2(50)
not null,
cust_3_desc
Varchar2(100),
cust_4_Id
Number(8)
not null

alter table cust_3 add constraint cust_3_cust_4_fk foreign key (cust_4_id) references cust_4(cust_4_id) ;
---------------------CUST_4---------------------------
create table CUST_4

(
cust_4_Id
Number(8)
not null,
cust_4_Name
Varchar2(50)
not null,
cust_4_desc
Varchar2(100)
111
---------------------GEO---------------------------
create table geo

(
geo_Id
Number(8),
geo_Name
Varchar2(50)
geo_desc
Varchar2(100) ,
geo_1_Id
Number(8)
not null,
geo_level
Number
not null,
Last_updt_date
not null,
Date not null
)
---------------------GEO_1---------------------------
create table geo_1

(
geo_1_Id
Number(8)
not null,
geo_1_Name
Varchar2(50)
not null,
geo_1_desc
Varchar2(100)
geo_2_Id
Number(8)
not null
alter table geo_1 add constraint geo_1_pk primary key (geo_1_id);

alter table geo_1 add constraint geo_1_geo_2_fk foreign key (geo_2_id) references geo_2(geo_2_id) ;
---------------------GEO_2---------------------------
create table geo_2

(
geo_2_Id
Number(8)
not null,
geo_2_Name
Varchar2(50)
not null,
geo_2_desc
Varchar2(100),
geo_3_Id
Number(8)
not null

alter table geo_2 add constraint geo_2_geo_3_fk foreign key (geo_3_id) references geo_3(geo_3_id) ;
112
---------------------GEO_3---------------------------
create table geo_3

(
geo_3_Id
Number(8)
not null,
geo_3_Name
Varchar2(50)
not null,
geo_3_desc
Varchar2(100),
geo_4_Id
Number(8)
not null
-------------------CREATE PARTITIONS IN SALES------------------------------
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110201,
PARTITION
Z_LAST_PARTITION);
(20110201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110301,
PARTITION
Z_LAST_PARTITION);
(20110301)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110401,
PARTITION
Z_LAST_PARTITION);
(20110401)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110501,
PARTITION
Z_LAST_PARTITION);
(20110501)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110601,
PARTITION
Z_LAST_PARTITION);
(20110601)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110701,
PARTITION
Z_LAST_PARTITION);
(20110701)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110801,
PARTITION
Z_LAST_PARTITION);
(20110801)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110901,
PARTITION
Z_LAST_PARTITION);
(20110901)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20111001,
PARTITION
Z_LAST_PARTITION);
(20111001)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20111101,
PARTITION
Z_LAST_PARTITION);
(20111101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20111201,
PARTITION
Z_LAST_PARTITION);
(20111201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120101,
PARTITION
Z_LAST_PARTITION);
(20120101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120201,
PARTITION
Z_LAST_PARTITION);
(20120201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120301,
PARTITION
Z_LAST_PARTITION);
(20120301)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120401,
PARTITION
Z_LAST_PARTITION);
(20120401)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120501,
PARTITION
Z_LAST_PARTITION);
(20120501)
INTO
113
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120601,
PARTITION
Z_LAST_PARTITION);
(20120601)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120701,
PARTITION
Z_LAST_PARTITION);
(20120701)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120801,
PARTITION
Z_LAST_PARTITION);
(20120801)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120901,
PARTITION
Z_LAST_PARTITION);
(20120901)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20121001,
PARTITION
Z_LAST_PARTITION);
(20121001)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20121101,
PARTITION
Z_LAST_PARTITION);
(20121101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20121201,
PARTITION
Z_LAST_PARTITION);
(20121201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130101,
PARTITION
Z_LAST_PARTITION);
(20130101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130201,
PARTITION
Z_LAST_PARTITION);
(20130201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130301,
PARTITION
Z_LAST_PARTITION);
(20130301)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130401,
PARTITION
Z_LAST_PARTITION);
(20130401)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130501,
PARTITION
Z_LAST_PARTITION);
(20130501)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130601,
PARTITION
Z_LAST_PARTITION);
(20130601)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130701,
PARTITION
Z_LAST_PARTITION);
(20130701)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130801,
PARTITION
Z_LAST_PARTITION);
(20130801)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130901,
PARTITION
Z_LAST_PARTITION);
(20130901)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20131001,
PARTITION
Z_LAST_PARTITION);
(20131001)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20131101,
PARTITION
Z_LAST_PARTITION);
(20131101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20131201,
PARTITION
Z_LAST_PARTITION);
(20131201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20140101,
PARTITION
Z_LAST_PARTITION);
(20140101)
INTO
3. GENERATE CONTROL FILE DYNAMICALLY FOR SQL* LOADER

The process given below shows the way to generate control file dynamically:
set echo of ver of feed of pages 0
accept tname prompt 'Enter Name of Table: '
114
accept dformat prompt 'Enter Format to Use for Date Columns: '
spool &tname..ctl
select 'LOAD DATA'|| chr (10) ||
'INFILE ''' || lower (table_name) || '.csv''' || chr (10) ||
'INTO TABLE '|| table_name || chr (10)||
'FIELDS TERMINATED BY '','''||chr (10)||
'TRAILING NULLCOLS' || chr (10) || '('
from user_tables
where table_name = upper ('&tname');
select decode (rownum, 1, ' ', ' , ') ||
rpad (column_name, 33, ' ') ||
decode (data_type,
'VARCHAR2', 'CHAR NULLIF ('||column_name||'=BLANKS)',
'FLOAT', 'DECIMAL EXTERNAL NULLIF('||column_name||'=BLANKS)',
'NUMBER', decode (data_precision, 0,
'INTEGER EXTERNAL NULLIF ('||column_name||
'=BLANKS)', decode (data_scale, 0,
'INTEGER EXTERNAL NULLIF ('||
column_name||'=BLANKS)',
'DECIMAL EXTERNAL NULLIF ('||
column_name||'=BLANKS)')),
'DATE', 'DATE "&dformat" NULLIF ('||column_name||'=BLANKS)', null)
from user_tab_columns
where table_name = upper ('&tname')
order by column_id;
select ')'
from dual;
spool of
4. DATA LOADING USING SQL* LOADER

We can load data into tables using SQL Loader. We need control file that specifies the
details of the loading process and the data file containing the data that needs to be
loaded. Then we need to run the command to direct SQL Loader to load the data into the
tables. We are using the flat file i.e. comma separated file as the data file. Example of
loading data into Customer table is given below. Similary data can be loaded into all the
tables.
115
Sample Control File for CUSTOMER table generated using process given in
appendix 3
LOAD DATA
INFILE 'D:\Gunjan\ctl\files\customer.csv'
INTO TABLE CUSTOMER
FIELDS TERMINATED BY ','
TRAILING NULLCOLS
( CUST_ID
INTEGER EXTERNAL NULLIF (CUST_ID=BLANKS)
, CUST_NAME
CHAR NULLIF (CUST_NAME=BLANKS)
, CUST_DESC
, CUST_1_ID
CHAR NULLIF (CUST_DESC=BLANKS)

INTEGER EXTERNAL NULLIF (CUST_1_ID=BLANKS)
, CUST_1_NAME
CHAR NULLIF (CUST_1_NAME=BLANKS)
, CUST_1_DESC
, CUST_2_ID
CHAR NULLIF (CUST_1_DESC=BLANKS)

, CUST_2_NAME
, CUST_2_DESC
, CUST_3_ID

, CUST_3_NAME
, CUST_3_DESC
, CUST_4_ID

, CUST_4_NAME
, CUST_4_DESC
, CUST_LEVEL
, CUST_GRADE
, CUST_PEER_RANK
, TRADE_CHANEL_CURR_ID
, TRADE_CHANEL_HIST_ID
, LAST_UPDT_DATE

DECIMAL EXTERNAL NULLIF (CUST_LEVEL=BLANKS)
CHAR NULLIF (CUST_GRADE=BLANKS)
CHAR NULLIF (CUST_PEER_RANK=BLANKS)
DECIMAL EXTERNAL NULLIF (TRADE_CHANEL_CURR_ID=BLANKS)
DECIMAL EXTERNAL NULLIF (TRADE_CHANEL_HIST_ID=BLANKS)
DATE "MM/DD/YY" NULLIF (LAST_UPDT_DATE=BLANKS)
Sample data file for CUSTOMER table

1,"Burton","Male",1,"Cakewalk","IT company",1,"Accounting","Accounting
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
2,"Simone","Female ",1,"Cakewalk","IT company",1,"Accounting","Accounting
3,"Hyatt","Male",1,"Cakewalk","IT company",1,"Accounting","Accounting
4,"Zia","Male",1,"Cakewalk","IT company",1,"Accounting","Accounting
5,"Rama","Female ",1,"Cakewalk","IT company",1,"Accounting","Accounting
116
6,"Deirdre","Female ",1,"Cakewalk","IT company",1,"Accounting","Accounting
7,"Zeph","Male",1,"Cakewalk","IT company",1,"Accounting","Accounting
8,"Kylee","Male ",1,"Cakewalk","IT company",1,"Accounting","Accounting
9,"Drew","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
10,"Fay","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
11,"Lacota","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
12,"Larissa","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
13,"Griffith","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
14,"James","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
15,"Salvador","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
16,"Indira","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
17,"Kelsey","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
18,"Stephanie","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
19,"Dexter","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
20,"Otto","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
21,"Dawn","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
22,"Alec","Male ",1,"Cakewalk","IT company",2,"Advertising","Advertising
23,"Piper","Male",1,"Cakewalk","IT company",5,"Customer Service","Customer Service
Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
24,"Karly","Female ",1,"Cakewalk","IT company",5,"Customer Service","Customer
Service Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
25,"Briar","Male",1,"Cakewalk","IT company",5,"Customer Service","Customer Service
Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
26,"Laura","Female ",1,"Cakewalk","IT company",5,"Customer Service","Customer
Service Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
Command to Load data into CUSTOMER table
D:\oracle\product\10.2.0\db_1\BIN\sqlldr scott/*****@ORCL
control=D:\gunjan\ctl\CUSTOMER.ctl log=D:\gunjan\ctl\customer.log
117
5 LOGS FOR SCHEMA PERFORMANCE EVALUATION

SQL> create sequence seq_snow start with 214250 increment by 1 nocache nocycle;
Sequence created.
Elapsed: 00:00:00.00
SQL> create sequence seq start with 214250 increment by 1 nocache nocycle;
Sequence created.
Elapsed: 00:00:00.00
SQL> insert into sales select seq.nextval
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
12 NET_INVOICE_VALUE ,
13 GROSS_INVOICE_VALUE ,
14 NET_INVOICE_USD
15 GROSS_INVOICE_USD from sales;

214248 rows created.
Elapsed: 00:01:05.53
SQL> insert into sales_snow select seq_snow.nextval
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
118
14 NET_INVOICE_USD
15 GROSS_INVOICE_USD from sales_snow;

Elapsed: 00:00:42.79
SQL> select count(*) from sales;
COUNT(*)
---------428496
Elapsed: 00:00:00.09
SQL> select count(*) from sales_snow;
COUNT(*)
---------428496
Elapsed: 00:00:00.04
SQL> select sum(base_unit_qty) from sales, geography where sales.geo_id=geography.geo_id and geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------4618978
Elapsed: 00:00:00.10
SQL> select sum(base_unit_qty) from sales_snow s,geo g , geo_1 g1 , geo_2 g2 , geo_3 g3 where
2
s.geo_id=g.geo_id and g.geo_1_id=g1.geo_1_id
and g1.geo_2_id=g2.geo_2_id
and g3.geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------4618978
Elapsed: 00:00:00.12
SQL>
SQL> select sum(net_invoice_value) from sales s , product p , customer c where s.prod_id=p.prod_id and
2 c.cust_id=s.cust_id and c.cust_3_name='IN' and p.prod_3_name='Hair Care';
SUM(NET_INVOICE_VALUE)
---------------------21470160
Elapsed: 00:00:00.06
SQL> select sum(net_invoice_value) from sales_snow s , prod p , cust c , prod_2 p2, prod_3 p3, cust_2 c2 , cust_3 c3
2 where s.prod_id=p.prod_id and c.cust_id=s.cust_id
3
and p.prod_2_id=p2.prod_2_id
and p2.prod_3_id=p3.prod_3_id
5 and
119
6
7
c.cust_2_id=c2.cust_2_id
and c2.cust_3_id=c3.cust_3_id
8 and p3.prod_3_name='Hair Care'and c3.cust_3_name='India' ;

---------------------21470160
Elapsed: 00:00:00.10
SQL> select distinct prod_name from sales s , product p where s.prod_id = p.prod_id and p.prod_1_name ='Shampoo';
PROD_NAME
-------------------------------------------------Moist
NO MORE TANGLES Shampoo
NATURAL Baby Shampoo
NO MORE TANGLES Leave-in Conditioner
NO MORE TANGLES Detangling Spray
Baby Shampoo with Natural Lavender
Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
KIDS NO MORE TANGLES
Cleanse & Mend
Baby Daily Face & Body Lotion SPF 40
Opposites Attract
Mega
KIDS HEAD-TO-TOE Body Wash Tropical Blast
NO MORE TANGLES Extra Conditioning Shampoo
Sun-Touched Shine
KIDS E-Z GRIP SOAPBerry Breeze
21 rows selected.
Elapsed: 00:00:00.12
SQL> select distinct prod_name from sales s , prod p , prod_1
p1.prod_1_id=p.prod_1_id and p1.prod_1_name='Shampoo';
p1
where s.prod_id = p.prod_id and
PROD_NAME
-------------------------------------------------Moist
120

Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:00.15
SQL> select sum(base_unit_qty) ,
2 store_location from sales s , geography g , customer c, store st
3 where s.cust_id = c.cust_id
4 and s.geo_id = g.geo_id and
5 s.store_id = st.store_id and
6 upper(st.store_location) = upper(c.cust_3_desc) and
7 upper(st.store_location) = upper(g.geo_2_name) group by store_location;
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------102000 India
56720 Canada
8560 Europe
Elapsed: 00:00:00.10
2 store_location from sales_snow s , geo g , cust c, store st , cust_2 c2 , cust_3 c3 , geo_1 g1 , geo_2 g2
3 where s.cust_id = c.cust_id
121
4 and s.geo_id = g.geo_id and

5 s.store_id = st.store_id and
6 c.cust_2_id = c2.cust_2_id and
7 c2.cust_3_id = c3.cust_3_id and
8 g.geo_1_id = g1.geo_1_id and
9 g1.geo_2_id = g2.geo_2_id and
10 upper(st.store_location) = upper(c3.cust_3_name) and
11 upper(st.store_location) = upper(g2.geo_2_name) group by store_location;
------------------ ---------102000 India
56720 Canada
8560 Europe
Elapsed: 00:00:00.32
SQL> select count(*) from sales s , product p where billing_type='MN' and to_char(load_date,'YYYY') ='2012' and
fact_type='BR' and
2 s.prod_id = p.prod_id and p.prod_1_name='Toothpaste';
COUNT(*)
---------80
Elapsed: 00:00:00.01
SQL> select count(*) from sales_snow s , prod p , prod_1 p1 where billing_type='MN' and to_char(load_date,'YYYY')
='2012' and fact_type='BR' and
2 s.prod_id = p.prod_id and p1.prod_1_id = p.prod_1_id and p1.prod_1_name='Toothpaste';
COUNT(*)
---------80
Elapsed: 00:00:00.02
SQL>
2
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
122
13
GROSS_INVOICE_VALUE ,
14
NET_INVOICE_USD
15
GROSS_INVOICE_USD from sales;

Elapsed: 00:01:18.95
2
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
13
14
NET_INVOICE_USD
15
GROSS_INVOICE_USD from sales_snow;

Elapsed: 00:01:08.12
SQL>
SQL>
SQL>
SQL>
SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:00.20
2
3

SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:01.04
123
2
SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:00.28
SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:00.20
2
and
---------------------42940320
Elapsed: 00:00:00.15
2
c.cust_id=s.cust_id and c.cust_3_name='IN' and p.prod_3_name='Hair Care';
---------------------42940320
Elapsed: 00:00:00.14
SQL> select distinct prod_name from sales_snow s , prod p , prod_1 p1
PROD_NAME
-------------------------------------------------Moist
124

Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:00.17
PROD_NAME
-------------------------------------------------Moist
Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
125

Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:00.15
SQL>
2
store_location from sales_snow s , geo g , cust c, store st , cust_2 c2 , cust_3 c3 , geo_1 g1 , geo_2 g2
c.cust_2_id = c2.cust_2_id and
c2.cust_3_id = c3.cust_3_id and
g.geo_1_id = g1.geo_1_id and
g1.geo_2_id = g2.geo_2_id and
10
11

------------------ ---------204000 India
113440 Canada
17120 Europe
Elapsed: 00:00:00.59
2
------------------ ---------204000 India
113440 Canada
17120 Europe
Elapsed: 00:00:00.21
126
SQL>
SQL>
COUNT(*)
---------160
Elapsed: 00:00:00.01
fact_type='BR' and
COUNT(*)
---------160
Elapsed: 00:00:00.01
2
3
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
13
14
NET_INVOICE_USD
15
SQL> /
SQL> /
127
SQL>
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
14 NET_INVOICE_USD
SQL> /
SQL> /
SQL> commit;
Commit complete.
SQL> set timing on

SQL> select count(*) from sales;
COUNT(*)
---------3427968
Elapsed: 00:00:01.01
128
SQL> select count(*) from sales_snow;
COUNT(*)
---------3427968
Elapsed: 00:00:03.54
2
SUM(BASE_UNIT_QTY)
-----------------36951824
Elapsed: 00:00:14.92
SQL>
SUM(BASE_UNIT_QTY)
-----------------36951824
Elapsed: 00:00:10.28
2
---------------------171761280
Elapsed: 00:00:04.81
SQL>
2
4
5
6
and
129
7
8
---------------------171761280
Elapsed: 00:00:10.31
PROD_NAME
-------------------------------------------------Moist
Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:02.48
130

PROD_NAME
-------------------------------------------------Moist
Instant Freeze
Baby Shampoo
Hair Insurance
Sprunch
Aussome Volume
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:04.00
SQL>
2
131
------------------ ---------816000 India
453760 Canada
68480 Europe
Elapsed: 00:00:04.06
2
10
11

------------------ ---------816000 India
453760 Canada
68480 Europe
Elapsed: 00:00:04.29
SQL>
fact_type='BR' and
COUNT(*)
---------640
Elapsed: 00:00:00.03
132
COUNT(*)
---------640
Elapsed: 00:00:00.03
SQL>
2
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
13
14
NET_INVOICE_USD
15
Elapsed: 00:02:40.41
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
14 NET_INVOICE_USD
133
Elapsed: 00:02:42.42
SQL> commit;
Commit complete.
Elapsed: 00:00:00.00
SUM(BASE_UNIT_QTY)
-----------------73903648
Elapsed: 00:00:19.68
2
SUM(BASE_UNIT_QTY)
-----------------73903648
Elapsed: 00:00:21.48
2
---------------------343522560
Elapsed: 00:00:05.34
2
and
134
7
8
---------------------343522560
Elapsed: 00:00:07.62
PROD_NAME
-------------------------------------------------Moist
Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:05.31
135

PROD_NAME
-------------------------------------------------Moist
Instant Freeze
Baby Shampoo
Hair Insurance
Sprunch
Aussome Volume
PROD_NAME
-------------------------------------------------Sydney Smooth
Cleanse & Mend
Opposites Attract
Mega
Sun-Touched Shine
21 rows selected.
Elapsed: 00:00:06.00
2
136
------------------ ---------1632000 India
907520 Canada
136960 Europe
Elapsed: 00:00:06.06
2
10
11

------------------ ---------1632000 India
907520 Canada
136960 Europe
Elapsed: 00:00:08.14
fact_type='BR' and
COUNT(*)
---------1280
Elapsed: 00:00:00.03
COUNT(*)
----------
137
1280
Elapsed: 00:00:00.04
SQL> spool of
138

Performance Tuning On DWS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Tuning On DWS

Uploaded by

Copyright:

Available Formats

H

PERFORMANCE EVALUATION AND

Online Transaction Processing (OLTP) applications such as order processing,

Broad Academic Area of Work: Data Warehousing

Bimal kinkar Das

4.6 Creating Tables.............................................................................................................................36

20: Data model for Geo_2.........................................................................................35

1.2 Need for Data Warehousing

Get in depth knowledge of their company's performance.

Monitor how business factors change over time.

Compare their company's performance relative to competition and industry

In short, strategic information is required for continued health and survival of an

executives and managers use this data for making

1.3 Architecture of Data warehouse

Figure 1: Basic Data warehouse

Figure 2: Complex Data warehouse

The main components of a complex data warehouse are:

User desktop query sessions

Information printed on adhoc reports

Data stores of query and reporting tools for repeated usage

Downstream systems such as EIS and Data mining

1.4 Attributes of Data Warehouse

1.5 Data Warehouse vs Data Marts

It shows an enterprise view of data.

It is inherently architect-ed not a union of disparate data marts

It is a single , central storage of data.

There are centralized rules and control.

It provides quick results if implemented with iterations.

This approach takes longer to build.

Bottom Up Approach (Creating data marts first):

Faster and easy implementation.

Less risk of failure.

Inherently incremental , can schedule important data marts first.

Favorable returns on POC.

Permeates redundant data.

Inconsistent and irreconcilable data.

Proliferates unmanageable interfaces

Each data mart has its own narrow view of data.

1.6 Advantages of Data warehouse

Many inconsistencies are identified and resolved before loading of information in

Because of being different from operational systems, a data warehouse helps in

Data warehousing enhances the value of operational business applications and

Data warehousing also leads to proper functioning of support system applications

Precisely, a data warehouse system proves to be helpful in providing collective

1.7 Disadvantages of Data warehouse

Long initial implementation time and associated high cost .

Typically, data is static and dated .

Typically, cannot actively monitor changes in data .

SCHEMA ALTERNATIVES IN DATAWAREHOUSE

Because of normalization, answering simple business questions will require joining

2.2 Dimension Modeling Schemas

dimensional schema design.

Choose the business process

Declare the Grain

Identify the dimensions

Identify the Fact

CHOOSE THE BUSINESS PROCESS

2.3 Star Schema

Figure 3: General Star Schema

They are easy to understand as the data is organized around subjects.

Since data is in highly de-normalized format, browsing through data is easy.

It is very easy to define dimensional hierarchies.

Provide highly optimized performance for typical star queries.