Professional Documents
Culture Documents
Performance Tuning On DWS
Performance Tuning On DWS
e
r
e
a
r
e
s
o
m
e
o
f
t
h
e
b
e
n
e
f
i
t
s
o
f
a
d
a
t
a
w
a
r
e
h
o
u
s
e
:
o
W
i
t
h
by
Bimal Kinkar Das
ABSTRACT :
process claims, checking accounts etc are important systems that are used to run the
organization. As an organization grows larger, hundreds of computer applications are needed
to support various business processes. But as the business grows more complex and spread
globally, business executives need more information to stay competitive and to make
strategic decisions. The operational systems, although needed for running the system could
not provide the strategic information. Thus, the need arises for Data warehousing which is a
paradigm intended to provide strategic information. OLTP systems are not suitable for
decision support
applications
because
their
emphasis
is
providing
ACID (Atomicity,
Consistency, Isolation and Durability) semantics. Hence performace of these systems is poor
for decision support applications.
Decision support applications use Online Analytical Processing(OLAP) which is a recent and
important application of database systems.Large organizations spend billions on Data
warehousing to make their business turn and they expect their Decision support systems to
provide them with the timely information. Decision support systems and applications used
for extracting strategic information need enterprise view of data. The data warehouses used
by these systems typically maintain a lot of information which grow over time. Due to the
large volume of data, the queries fired on the database may take hours before returning the
results. Also, OLAP queries are complex and can take many hours if executed directly on raw
data.
Thus, in this dissertation we look into the ways to minimize the time to provide information.
To minimize the time, we need to find the ways so that the performance of the database
used for the data warehouse implementation is high. To achieve this we have :
Compared , evaluated and found the best schema i.e. the star schema design for
implementing data warehouse
Compared, evaluated and found the ways to write queries with best response time and
throughput.
ACKNOWLEDGEMENTS
I take this opportunity to thank all those who have helped me in my dissertation
work.
CONTENTS
Title Page
Certificate
Abstract
Acknowledgment
List of Figures and tables
Contents
References
Appendix
i
ii
iii
iv
v
vi
vii
vii
TABLE OF CONTENTS
INTRODUCTION ....................................................................................................................................1
1.1 Introduction to Data warehousing....................................................................................................1
1.2 Need for Data Warehousing.............................................................................................................2
1.3 Architecture of Data warehouse.......................................................................................................4
1.4 Attributes of Data Warehouse..........................................................................................................6
1.5 Data Warehouse vs Data Marts........................................................................................................9
Top Down Approach (Creating data warehouse first):.....................................................................9
Bottom Up Approach (Creating data marts first):............................................................................9
1.6 Advantages of Data warehouse.....................................................................................................10
1.7 Disadvantages of Data warehouse.................................................................................................11
SCHEMA ALTERNATIVES IN DATAWAREHOUSE..........................................................................12
2.1 Introduction...................................................................................................................................12
2.2 Dimension Modeling Schemas......................................................................................................13
2.3 Star Schema...................................................................................................................................15
2.4 Snowflake Schema........................................................................................................................17
INTRODUCTION TO TECHNOLOGIES USED..................................................................................19
3.1 Introduction to SQL......................................................................................................................19
3.1.1 What Can SQL do?................................................................................................................19
3.1.2 Syntax for Basic SQL Queries:..............................................................................................20
PERFORMANCE EVALUATION OF SCHEMAS................................................................................23
4.1 Introduction...................................................................................................................................23
4.2 Creating the star schema................................................................................................................24
4.4 Creating a snowflake schema........................................................................................................30
4.5 Data Model for the Snowflake Schema.........................................................................................31
List Of Figures
1 : Basic Data
warehouse...........................................................................................5
2 : Complex Data
warehouse......................................................................................5
3 : General Star
Schema..........................................................................................16
4 : General Snowflake
Schema..................................................................................17
5 : STAR Schema for Sales Data
Mart.........................................................................24
6 : Data Model for the Star Schema...........................................................................25
7 : SNOWFLAKE schema for SALES data
mart.............................................................30
8 : Normalized Product
Dimension.............................................................................30
9 : Normalized Customer
Dimension..........................................................................31
10 : Normalized Geography
Dimension........................................................................31
11 : Schema Performance Comparision........................................................................44
12 : Representation of typical non repeating data in star and snowflake
schema................45
13 : Representation of typical repeating data in star and snowflake
schema......................45
14 : Aggregation setup for
SALES................................................................................86
List Of Tables
1 : Data model for SALES.........................................................................................26
2 : Data model for SALES_AGG.................................................................................27
3 : Data model for PRODUCT.....................................................................................27
4 : Data model for CUSTOMER..................................................................................28
5 : Data model for STORE........................................................................................ 28
6 : Data model for GEOGRAPHY.................................................................................29
7 : Data model for CAL_MASTER...............................................................................29
8 : Data model for PROD..........................................................................................32
9 : Data model for PROD_1.......................................................................................32
10: Data model for PROD_2.......................................................................................32
11: Data model for PROD_3.......................................................................................33
12: Data model for PROD_4.......................................................................................33
13: Data model for Cust............................................................................................33
14: Data model for Cust_1........................................................................................34
15: Data model for Cust_2........................................................................................34
16: Data model for Cust_3........................................................................................34
17: Data model for Cust_4........................................................................................34
18: Data model for Geo.............................................................................................35
19: Data model for Geo_1.........................................................................................35
Chapter 1
INTRODUCTION
1.1 Introduction to Data warehousing
Data Warehouse is a term to describe a system used in an organization to collect data,
most of which are transactional data such as purchase records and etc., from one or
more data sources such as the database of a transactional system, into a central data
location, the Data Warehouse, and later report those data generally in an aggregated
way to business users in an organization. Data warehouses contain a wide variety of data
that present a coherent picture of business conditions at a single point in time that is
Business Driven, Market Focused and Technology Based.
This system consists of an ETL tool, a database, a reporting tool and other facilitating
tools such as a data modeling tool.
A Data warehouse (DW) is a database used for reporting. The data is offloaded from the
operational systems for reporting. The data may pass through an operational data store
for additional operations before it is used in the data warehouse for reporting. A data
warehouse maintains its functions in three layers: staging, integration and access.
Staging is used to store raw data for use by developers (analysis and support).
Integration layer is used to integrate data and to have a level of abstraction from
users.
Access layer is for getting data out for users. This definition of data warehouse
focuses on data storage.
The main source of the data is cleaned, transformed, cataloged and made available for
use by managers and other business professionals for data mining, online analytical
processing, market research and decision support.
Learn about key business factors and how they affect one another.
strategic decisions? This is because the available data is not readily usable for strategic
decision making. These large quantities of data are very useful and good for running the
business operations, but hardly amenable for use in making decisions about business
strategies and objectives.
The fact is that for nearly two decades or more, IT departments have been attempting to
Provide information to key personnel in their companies for making strategic decisions.
Sometimes an IT department could produce ad hoc reports from a single application. In
most cases, the reports would need data from multiple systems, requiring the writing of
extract programs to create intermediary files that could be used to produce the ad hoc
reports.
Most of these attempts by IT in the past ended in failure. The users could not clearly
define what they wanted in the first place. Once they saw the first set of reports, they
wanted more data in different formats. The chain continued. This was mainly because of
the very nature of the process of making strategic decisions. Information needed for
strategic decision making has to be available in an interactive manner. The user must be
able to query online, get results, and query some more. The information must be in a
format suitable for analysis.
What is a basic reason for the failure of all the previous attempts by IT to provide
strategic information? What has IT been doing all along? The fundamental reason for
the inability to provide strategic information is that we have been trying all
along to provide strategic information from the operational systems. These
operational systems such as order processing, inventory control, claims processing,
outpatient billing, and so on are not designed or intended to provide strategic
information. If we need the ability to provide strategic information, we must get the
information from altogether different types of systems. Only specially designed decision
support systems or informational systems can provide strategic information.
Thus, we do need different types of decision support systems to provide strategic
information. The type of information needed for strategic decision making is different
from that available from operational systems. We need a new type of system
environment for the purpose of providing strategic information for analysis and
monitoring performance.
This new system environment that users desperately need to obtain strategic information
happens to be the new paradigm of data warehousing. Enterprises that are building data
warehouses are actually building this new system environment. This new environment is
kept separate from the system environment supporting the day-to-day operations. The
data warehouse essentially holds the business intelligence for the enterprise to enable
strategic decision making. The data warehouse is the only viable solution. We have
clearly seen that solutions based on the data extracted from operational systems are all
totally unsatisfactory.
Basically, data warehouse is a simple concept, it involves different functions: data
extraction, the function of loading the data, transforming the data, storing the data, and
providing user interfaces.The end result is the creation of a new computing environment
for the purpose of providing the strategic information every enterprise needs desperately.
OLAP
Time Variant
Subject Oriented:
A data warehouse is organized around subjects. Subject orientation presents the data
in a format that is consistent and much clearer for end users to understand. In
contrast operational data such as order processing and manufacturing databases are
organized around business activities or functional areas. They are typically optimized
to serve a single static application. The functional separation of applications causes
companies to store identical information in multiple locations. The duplicated
information's format is usually inconsistent. For data warehouse subjects could be
Product, Customers and Orders as opposed to Purchasing and Payroll.
Data warehouses are designed to help you and analyze your data. For example, you
might want to learn more about your company's sales data. To do this, you could
build a warehouse concentrating on sales. In this warehouse, you could answer
questions like Who was our best Customer for this item last year? This kind of
focus on a topic, sales , in this case, is what is meant by subject oriented.
Integrated:
Integration of data within a warehouse is accomplished by dictating consistency in
format naming etc. Operational databases, for historic reasons, often have major
inconsistencies in data representation.
For example, a set of operational databases may represent male and female by
m and f, by 1 and 2 , by x and y. Frequently the inconsistencies are more
complex . Hence cleansing operations convert various representations of data into
one format so that they can be accommodated into a single schema. By definition,
data is always maintained in a consistent fashion in a data warehouse..Data
warehouses need to have the data from disparate sources put into a consistent
format. This means that naming conflicts have to be resolved and problems like data
being in different units of measure must be resolved.
Non Volatile:
Non Volatility, another primary aspect of data warehouses, means that after the
informational data is loaded into the data warehouse, changes, inserts or deletes are
rarely performed. Data is read only in most of the cases. The loaded data is
transformed data that originated from the operational databases. The data
warehouse is subsequently reloaded or more likely appended on a periodic basis with
new, transformed or summarized data from operational databases. Apart from the
loading process, the information contained in the data warehouse generally remains
static. The property of non volatility permits a data warehouse to be heavily
optimized for query processing.
Non volatile means that the data should not change once entered into the data
warehouse. This is logical because the purpose of a warehouse is to analyze what
has occurred.
Time Variant:
Data warehouse are time variant in the sense that they maintain both historical and
(nearly) current data. Operational databases, in contrast, contain only the most
current, up to date values. Furthermore, they generally maintain this information for
no more than one year (and often much less). By, comparison, data warehouses
contain data that is generally loaded from operational databases daily, weekly or
monthly and then typically maintained for a period of 3 to 5 years. This aspect marks
a major difference between the two types of environments. Historical information is
of high importance to decision makers. They often want to understand trends and
relationships between data. For example, the product manager for a soft drink maker
may want to see the relationship between coupon promotions and sales. This type of
information is typically impossible to determine with an operational database that
contains only current data. Most business analysis requires analyzing trends.
Because of this, analysts tend to need large amounts of data. This is very much in
contrast to OLTP systems , where performance requirements demand historical data
to be moved to an archive.
Disadvantages:
Disadvantages:
With data warehousing, you can provide a common data model for different
interest areas regardless of data's source. In this way, it becomes easier to report
and analyze information.
The best part of data warehousing is that the information is under the control of
users, so that in case the system gets purged over time, information can be easily
and safely stored for longer time period.
10
Major data schema transforms from each of the data sources to one schema in the
data warehouse, which can represent more than 50% of the total data warehouse
effort .
Data owners lose control over their data, raising ownership (responsibility and
accountability), security and privacy issues.
Adding new data sources takes time and associated high cost .
Limited flexibility of use and types of users - requires multiple separate data
marts for multiple uses and types of users .
Difficult to accommodate changes in data types and ranges, data source schema,
indexes and queries .
11
Chapter 2
Normalization process takes descriptions of entities from the business domain and
breaks them apart into Number of small tables. Breaking up entities into small
tables also reduces the amount of redundant data stored in the database since
these each table can be joined with other tables to form more than one business
entity. Reducing data redundancy means that the same piece of data will only be
stored in the database once. By having the data stored only once, the problem of
updating multiple copies disappears when the data changes. For DSS the main
problem is that the business entities will be distributed among a Number of tables
which will then need to be joined together in order to form the things that you will
want to use in your analysis. These joins that have to be performed can each
require a significant amount of processing by the database system.
12
Cryptic naming conventions and data formats. One of the primary reasons why
business users cannot read normalized schema is that it encourages cryptic,
shorthand naming conventions for columns and tables.
Hence the data warehouses use highly denormalized schemas to provide instant access
without having to perform a large Number of joins which is the major cause of poor
performance.
use a
13
virtually always the source of the row headers in the SQL answers set.
Fact tables: These table contains one or more numerical measures, or facts that occur
for the combination of a multi part primary key made up of two or more foreign keys
from the dimension tables.
To design a dimensional schema the following design model is used:
14
process. Dimensions are the foundation of the fact table, and is where the data for
the fact table is collected. Typically dimensions are nouns like date, store, inventory
etc. These dimensions are where all the data is stored. For example, the date
dimension could contain data such as year, month and weekday.
IDENTIFY THE FACTS
After defining the dimensions, the next step in the process is to make keys for the
fact table. This step is to identify the numeric facts that will populate each fact table
row. This step is closely related to the business users of the system, since this is
where they get access to data stored in the data warehouse. Therefore most of the
fact table rows are numerical, additive figures such as quantity or cost per unit, etc.
The two types of dimensional schemas used in data warehouse are :
Star Schema
Snowflake Schema
The star or the snowflake schema design represents data as an array where each
dimension is a subject around which analysis is performed.
15
It provide a direct and intuitive mapping between the business entities being
analyzed by end users.
Maintenance required for the data warehouse is low if we use integers and
surrogate keys for joining the tables.
Star schemas are used for both simple data marts and very large data warehouses.
16
17
Many consistency problems due to updates are also solved because normalization lowers
the granularity of the dimension. However, since the snowflake schema has more tables,
even browsing through the data requires joining tables to get the complete information,
hence performance is poor.
It can also give performance for the aggregation queries if data storage has very low
redundancy.
The main advantage of a snowflake schema is that it may improve the performance if
smaller tables are joined due to which it is easy to maintain and increase flexibility.
The main disadvantage of snowflake schema is that it increases the Number of tables an
end user must work with making the queries much more difficult to create and execute
because more tables need to be joined.
18
Chapter 3
19
Creating Tables With SQL:CREATE TABLE name( col1 datatype, col2 datatype, , constraint_declare1);
where constraint_declare :: = [ CONSTRAINT constraint_name ]
PRIMARY KEY ( col1, col2, ... ) |
FOREIGN KEY ( col1, col2, ... ) REFERENCES f_table [ ( col1, col2, ... ) ] |
UNIQUE ( col1, col2, ... ) |
CHECK ( expression )
20
21
[ CACHE cache_value ]
[ CYCLE ]
There is a lot more functionality available in SQL but the above commands serve as the
basis for writing the complex SQL queries.
22
Chapter 4
Region (Geo_3).The Cal_Master Dimension contains date, week, month , quarter and
year details for three years.The store dimension contains the store and its manager's
details.
23
We will create a star and a snowflake schema for the sales data mart and then compare
the performance of both.
24
PRIMARY KEY
NULL
DATA TYPE
References
Fact_Id
Number(8)
Fact_type
Varchar2(10)
Transaction_date
Date
Cal_master(date)
disable novalidate
rely
Day_Id
Number
Cal_master(day_id)
disable novalidate
25
rely
Load_date
Date
Cal_master(date)
disable novalidate
rely
Billing_Type
Varchar2(10)
Cust_Id
Number(80)
Customer (Cust_id)
disable novalidate
rely
Prod_Id
Number(80)
Product(Prod_id)
disable novalidate
rely
Geo_Id
Number(80)
Geography(geo_id)
disable novalidate
rely
Store_ID
Number(80)
Store(Store_id)
disable novalidate
rely
Base_Unit_Qty
Number(80)
Net_Invoice_value
Number(80)
Gross_Invoice_Valu
e
Number(80)
Net_Invoice_USD
Number(80)
Gross_Invoice_USD
Number(80)
PRIMARY KEY
NULL
DATA TYPE
References
Fact_type
Varchar2(10)
Transaction_date
Date
Cal_master(date)
disable novalidate rely
Day_Id
Number
Cal_master(day_id)
disable novalidate rely
Load_date
Date
Cal_master(date)
disable novalidate rely
Billing_Type
Varchar2(10)
Cust_Id
Number(80)
Customer (Cust_id)
disable novalidate rely
Prod_Id
Number(80)
Product(Prod_id)
disable novalidate rely
26
Geo_Id
Number(80)
Geography(geo_id)
disable novalidate rely
Store_ID
Number(80)
Store(Store_id) disable
novalidate rely
Base_Unit_Qty
Number(80)
Net_Invoice_value
Number(80)
Gross_Invoice_Valu
e
Number(80)
Net_Invoice_USD
Number(80)
Gross_Invoice_USD
Number(80)
PRIMARY KEY
NULL
DATA TYPE
Prod_Id
Number(8)
Prod_Name
Varchar2(10)
Prod_desc
Varchar2(100)
Prod_1_Id
Number(8)
Prod_1_Name
Varchar2(10)
Prod_1_desc
Varchar2(100)
Prod_2_Id
Number(8)
Prod_2_Name
Varchar2(10)
Prod_2_desc
Varchar2(100)
Prod_3_Id
Number(8)
Prod_3_Name
Varchar2(10)
Prod_3_desc
Varchar2(100)
Prod_4_Id
Number(8)
Prod_4_Name
Varchar2(10)
Prod_4_desc
Varchar2(100)
Prod_level
Number
Brand_desc
Varchar2(100)
Prod_size_desc
Varchar2(100)
Last_updt_date
Date
PRIMARY KEY
NULL
DATA TYPE
27
Cust_Id
Number(8)
Cust_Name
Varchar2(10)
Cust_desc
Varchar2(100)
Cust_1_Id
Number(8)
Cust_1_Name
Varchar2(10)
Cust_1_desc
Varchar2(100)
Cust_2_Id
Number(8)
Cust_2_Name
Varchar2(10)
Cust_2_desc
Varchar2(100)
Cust_3_Id
Number(8)
Cust_3_Name
Varchar2(10)
Cust_3_desc
Varchar2(100)
Cust_4_Id
Number(8)
Cust_4_Name
Varchar2(10)
Cust_4_desc
Varchar2(100)
Cust_level
Number
Cust_grade
Varchar2(100)
Cust_peer_rank
Varchar2(100)
trade_chanel_curr_id
Number
trade_chanel_hist_id
Number
Last_updt_date
Date
PRIMARY KEY
NULL
DATA TYPE
Store_Id
Number(8)
Store_Name
Varchar2(10)
Store_desc
Varchar2(100)
Store_Location
Number(8)
Store_Manager_Id
Number(8)
Store_Manager_Name
Varchar2(10)
PRIMARY KEY
NULL
DATA TYPE
Geo_Id
Number(8)
28
Geo_Name
Varchar2(10)
Geo_desc
Varchar2(100)
Geo_1_Id
Number(8)
Geo_1_Name
Varchar2(10)
Geo_1_desc
Varchar2(100)
Geo_2_Id
Number(8)
Geo_2_Name
Varchar2(10)
Geo_2_desc
Varchar2(100)
Geo_3_Id
Number(8)
Geo_3_Name
Varchar2(10)
Geo_3_desc
Varchar2(100)
Geo_level
Number
Last_updt_date
Date
PRIMARY KEY
NULL
DATA TYPE
Day_Id
Number(8)
Date
Date
Day_Name
Varchar2(10)
Week_Id
Number(1)
Mth_Id
Number(6)
Mth_name
Varchar2(10)
Qtr_Id
Varchar2(10)
Qtr_Name
Varchar2(10)
Year_Id
Number(4)
Year_name
Varchar2(10)
Holiday_Indicator
Varchar2(1)
29
30
31
TABLE: PROD
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
REFERENCES
Prod_Id
Number(8)
Prod_Name
Varchar2(10)
Prod_desc
Varchar2(100)
Prod_1_Id
Number(8)
Prod_1(Prod_1_id)
disable novalidate
rely
Prod_2_Id
Number(8)
Prod_2(Prod_2_id)
disable novalidate
rely
Prod_level
Number
Brand_desc
Varchar2(100)
Prod_size_desc
Varchar2(100)
32
Last_updt_date
Date
PRIMARY KEY
NULL
DATA TYPE
Prod_1_Id
Number(8)
Prod_1_Name
Varchar2(10)
Prod_1_desc
Varchar2(100)
PRIMARY
KEY
NULL
DATA TYPE
Prod_2_Id
Number(8)
Prod_2_Name
Varchar2(10)
Prod_2_desc
Varchar2(100)
Prod_3_Id
Number(8)
REFERENCES
Prod_3(Prod_3_id)
disable novalidate
rely
TABLE: PROD_3
COLUMN NAME
PRIMARY
KEY
NULL
DATA TYPE
Prod_3_Id
Number(8)
Prod_3_Name
Varchar2(10)
Prod_3_desc
Varchar2(100)
Prod_4_Id
Number(8)
REFERENCES
Prod_4(Prod_4_id)
disable novalidate
rely
PRIMARY KEY
NULL
DATA TYPE
Prod_4_Id
Number(8)
Varchar2(10)
Prod_4_Name
33
Prod_4_desc
Varchar2(100)
PRIMARY
KEY
NULL
DATA TYPE
REFERENCES
Cust_Id
Number(8)
Cust_Name
Varchar2(10)
Cust_desc
Varchar2(100)
Cust_1_Id
Number(8)
Cust_1(cust_1_id)
disable novalidate
rely
Cust_2_Id
Number(8)
Cust_2(cust_2_id)
disable novalidate
rely
Cust_level
Number
Cust_grade
Varchar2(100)
Cust_peer_rank
Varchar2(100)
trade_chanel_curr_i
d
Number
trade_chanel_hist_i
d
Number
Last_updt_date
Date
PRIMARY KEY
NULL
DATA TYPE
Cust_1_Id
Number(8)
Cust_1_Name
Varchar2(10)
Cust_1_desc
Varchar2(100)
PRIMARY
KEY
NULL
DATA TYPE
Cust_2_Id
Number(8)
Cust_2_Name
Varchar2(10)
Cust_2_desc
Varchar2(100)
REFERENCES
34
Cust_3_Id
Number(8)
Cust_3(cust_3_id)
disable novalidate
rely
PRIMARY
KEY
NULL
DATA TYPE
Cust_3_Id
Number(8)
Cust_3_Name
Varchar2(10)
Cust_3_desc
Varchar2(100)
Cust_4_Id
Number(8)
REFERENCES
Cust_4(cust_4_id)
disable novalidate
rely
PRIMARY KEY
NULL
DATA TYPE
Cust_4_Id
Number(8)
Cust_4_Name
Varchar2(10)
Cust_4_desc
Varchar2(100)
PRIMARY
KEY
NULL
DATA TYPE
Geo_Id
Number(8)
Geo_Name
Varchar2(10)
Geo_desc
Varchar2(100)
Geo_1_Id
Number(8)
Geo_level
Number
Last_updt_date
Date
REFERENCES
geo_1(geo_1_id)
disable novalidate
rely
PRIMARY
KEY
NULL
DATA TYPE
Geo_1_Id
Number(8)
REFERENCES
35
Geo_1_Name
Varchar2(10)
Geo_1_desc
Varchar2(100)
Geo_2_Id
Number(8)
geo_2(geo_2_id)
disable novalidate
rely
PRIMARY
KEY
NULL
DATA TYPE
Geo_2_Id
Number(8)
Geo_2_Name
Varchar2(10)
Geo_2_desc
Varchar2(100)
Geo_3_Id
Number(8)
REFERENCES
geo_3(geo_3_id)
disable novalidate
rely
PRIMARY KEY
NULL
DATA TYPE
Geo_3_Id
Number(8)
Geo_3_Name
Varchar2(10)
Geo_3_desc
Varchar2(100)
36
loaded into the tables. These screenshots provide an overview of the data. To view full
data to be loaded into the tables please see attached Data For Tables.zip
4.8 Populating the data into warehouse tables using SQL Loader
To load the data created for the tables as given in 4.7 we will use SQL* Loader utility.
SQL LOADER is an Oracle utility used to load data into table given a datafile which has
the records that need to be loaded. SQL*Loader takes data file, as well as a control file,
to insert data into the table. The SQL*Loader control file contains information that
describes how the data will be loaded. It contains the table name, column datatypes,
field delimiters, etc.
To avoid the errors its better to generate control file dynamically. The SQL code for
generating the control file dynamically is given in appendix 3 : GENERATE CONTROL
FILES DYNAMICALLY FOR SQL*LOADER
Also given are the sample control files , the commands to run SQL Loader and the data
files that needs to be loaded into the sales data mart in appendix 4 : LOADING DATA
USING SQL* LOADER. To view all the control files and data files please see attached
Data loading using SQL Loader.zip
D a t a L o a d in g u s in g S Q L L o a d e r . z ip
37
3. To select names of all the shampoos that were sold till date.
4. To count the products in each region that were bought by the customers belonging
38
STAR:select sum(base_unit_qty) ,
store_location from sales s , geography g , customer c, store st
where s.cust_id = c.cust_id
and s.geo_id = g.geo_id and
s.store_id = st.store_id and
upper(st.store_location) = upper(c.cust_3_desc) and
upper(st.store_location) = upper(g.geo_2_name) group by store_location;
5. To select toothpaste products that were regular shipments and were sold by
manual billing in year 2012.
39
The whole experiment is done with different number of tuples with different type of
queries and we are showing a query result to prove our considerations. We start with
approximately 2 lakh records and keep on doubling the number of records used for
analysis till 6 million records.
Given below are the screenshots for the results of queries for 2 lakh records. For other
cases the logs have been given in the appendix 5: LOGS FOR SCHEMA PERFORMANCE
EVALUATION.
The result of the analysis are shown in Table 22
40
41
Query4: Products in each region that were bought by the customers belonging to the
42
same region
Query5: To select toothpaste products that were regular shipments and were
sold by manual billing in year 2012.
43
For other cases the logs have been given in the appendix 5: LOGS FOR SCHEMA
PERFORMANCE EVALUATION
The timings for the queries are as given in table 22. Also a graph representing the
performance comparision is given in figure 11
Query #
No. of records
Q1
2,14,248
39
98
Q2
2,14,248
32
66
Q3
2,14,248
61
132
Q4
2,14,248
71
131
Q5
2,14,248
12
25
Q1
4,28,496
10
12
Q2
4,28,496
10
Q3
4,28,496
12
15
Q4
4,28,496
10
32
Q5
Q1
Q2
Q3
Q4
Q5
Q1
4,28,496
8,56,992
8,56,992
8,56,992
8,56,992
8,56,992
34,27,968
1
20
20
15
21
1
932
2
64
28
17
59
1
628
Q2
34,27,968
321
631
Q3
34,27,968
168
240
Q4
34,27,968
246
269
Q5
Q1
Q2
Q3
Q4
Q5
34,27,968
68,55,936
68,55,936
68,55,936
68,55,936
68,55,936
3
1208
334
331
306
3
3
1308
482
300
494
4
44
1400
1200
STAR
SNOWFLAKE
TIMINGS (MILLISECONDS)
1000
800
600
400
200
0
2 LAKH
2 LAKH
2 LAKH
2 LAKH
2 LAKH
4 LAKH
4 LAKH
4 LAKH
4 LAKH
4 LAKH
8 LAKH
8 LAKH
8 LAKH
8 LAKH
8 LAKH
3 MILLION
3 MILLION
3 MILLION
3 MILLION
3 MILLION
6 MILLION
6 MILLION
6 MILLION
6 MILLION
6 MILLION
NUMBER OF RECORDS
As we have seen that star schema performs better than snowflake schema in most
of the cases. In general, the snowflake schema takes more time than star schema
because it takes more number of joins of the dimension tables to get the same
information.
45
Figure 12: Representation of typical non repeating data in star and snowflake schema.
Figure-12 shows the the star and snowflake schema for another set of data. This set of
data has repetitions therefore snowflake representation is more compact as it avoid
duplicating the long character attributes in all the records. In this case snowflake will
perform better than the star schema.
Figure 13: Representation of typical repeating data in star and snowflake schema.
46
Chapter 5
Estimator : The main aim of estimator is to determine the overall cost of the plan
using the statistics if available so as to improve the degree of accuracy.
Plan Generator : The main function of the plan generator is to try out different
47
possible plans for a given query and pick the one that has the lowest cost.
48
49
50
However , this is not the accurate estimate i.e. we may get incorrect cardinality
estimates even when the basic table and column statistics are up to date. This may be
because of data skew i.e. there may not be even distribution of the data values in a
column.
TECHNIQUE #1 Create a histogram manually for the best cardinality estimates
SQL > Exec DBMS_STATS.GATHER_TABLE_STATS(SCHEMA,TABLE_NAME, method_opt=>'FOR
COLUMNS SIZE 254 COLUMN_NAME);
The presence of a histogram changes the formula used by the Optimizer to determine the
cardinality estimate.
51
Oracle does I/O by blocks. Therefore, the optimizer's decision to use full table scans is
influenced by the percentage of blocks accessed, not rows. This is called the index
52
clustering factor. If blocks contain single rows, then rows accessed and blocks accessed
are the same.
Access by ROWID:
The rowid of a row specifies the datafile and data block containing the row and the
location of the row in that block. Locating a row by specifying its rowid is the fastest way
to retrieve a single row, because the exact location of the row in the database is specified
The optimizer uses rowid after retrieving the rowid from an index. Access by rowid does
not need to follow every index scan. If the index contains all the columns needed for the
statement, then table access by rowid might not occur.
Index Unique scan:
This scan returns, at most, a single rowid. Oracle performs a unique scan if a statement
contains a UNIQUE or a PRIMARY KEY constraint that guarantees that only a single row is
53
accessed.
54
55
The join method describes how data from two data producing operators will be joined
together. e.g. The table product and sales are joined using HASH join in the example
given below.
Hash Joins
Build a hash table (buckets) based on the join keys of the outer table.
Join the rows of the two tables through the value of the hash buckets.
When to use:
Full table scan is preferred on both inner table and outer table for less logical reads. And
The number of rows from the outer table is not huge.
TECHNIQUE #5 Use hash join when one of the sources to be joined is small and
56
other is large and there is a lot of duplicate data in the join key column.
Nested Loops joins
57
a good driving - driven relationship between the outer and inner table and there
is an efficient way of accessing the second table (for example an index look up).
Sort Merge joins
It needs to sort both tables. So, hash join is more preferable. This method is more used
when at least one of the joined parties is an inline view with sort operation.
So, this
58
The join order is determined based on cost, which is strongly influenced by the
cardinality estimates and the access paths available. The Optimizer will also always
adhere to some basic rules:
The Optimizer can determine the joins that result in at most one row based on UNIQUE
and PRIMARY KEY constraints on the tables and calculates it first.
When outer joins are used the table without the outer join operator must come after
the table with the outer join operator to ensure all of the additional rows that dont
satisfy the join condition can be added to the result set correctly.
When a subquery has been converted into an antijoin or semijoin, the tables from the
subquery must come after those tables in the outer query block to which they were
59
connected or correlated.
If view merging is not possible all tables in the view will be joined before joining to the
tables outside the view.
TECHNIQUE #8 If using nested-loop join, always start from the larger table with
full table scan, and join to the smaller table with index, unless the smaller table
is very small, or there will be very few rows in the larger table needed.
TECHNIQUE #9 When using hash join with parallel, use pq_distribute hint to
make sure the row distribution among the parallel query servers is right.
5.1.6 Partition Pruning
Oracle partitioning is a method of breaking up a very large table and its associated
indexes into smaller pieces. The primary purpose of partitioning is faster query access to
improve performance . This is accomplished via partition pruning (elimination) . Partition
pruning is visible in an execution plan in the PSTART and PSTOP columns. The PSTART
column contains the number of the first partition that will be accessed and PSTOP column
contains the number of the last partition that will be accessed.
Other benefits of partitioning would be :
The ability of Oracle to use parallelism to resolve queries since multiple partitions
can be queried simultaneously.
Partitioning gives the ability to quickly remove large amounts of data without
fragmenting a table
For Example:
Consider the Sales table that contains records of all orders for the last 3 years and the
table has been partitioned on monthly basis , a
60
There are cases when a word or letters appear in the PSTART and PSTOP columns instead
of a number. For example you may see the word KEY appears in these columns. This
indicates that it was not possible at parse time to identify, which partitions would be
accessed by the query but the Optimizer believes that partition pruning will occur at
execution time (dynamic pruning). This happens when there is an equality predicate on
the partitioning key column that contains a function. For example DAY_ID = SYSDATE.
61
maximum times in the where clause of the queries. This would help in partition
pruning and thus enhancing the speed of the query.
The first phase retrieves exactly the necessary rows from the fact table (the result
set). Because this retrieval utilizes bitmap indexes, it is very efficient.
The second phase joins this result set to the dimension tables.
62
Oracle processes this query in two phases. In the first phase, Oracle uses the bitmap
indexes on the foreign key columns of the fact table to identify and retrieve only the
necessary rows from the fact table. That is, Oracle will retrieve the result set from the
fact table using essentially the following query:
SELECT ... FROM sales
WHERE day_id IN
(SELECT day_id FROM cal_master
WHERE calendar_quarter_desc IN ('1st quarter 12','2nd quarter 12'))
AND cust_id IN
(SELECT cust_id FROM customers WHERE cust_4_name = 'APAC')
AND store_id IN
(SELECT store_id FROM store WHERE store_desc in ('Small Sized Store'));
As given above, the original star query has been transformed into this subquery
representation. This method of accessing the fact table leverages the strengths of
Oracle's bitmap indexes.
In this star query, a bitmap index on day_id is used to identify the set of all rows in the
fact table corresponding to sales in '1st quarter 12'. This set is represented as a bitmap .
A similar bitmap is retrieved for the fact table rows corresponding to the sale from '2st
quarter 12'. The bitmap OR operation is used to combine this set of 1st quarter 12 sales
with the set of 2st quarter 12 sales.
Also set operations will be done for the customer dimension and the store dimension. At
this point in the star query processing, there are three bitmaps. Each bitmap corresponds
to a separate dimension table, and each bitmap represents the set of rows of the fact
table that satisfy that individual dimension's constraints.
These three bitmaps are combined into a single bitmap using the bitmap AND operation
Once the result set is identified, the bitmap is used to access the actual data from the
sales table. Only those rows that are required for the end user's query are retrieved from
the fact table. At this point, Oracle has effectively joined all of the dimension tables to
the fact table using bitmap indexes. This technique provides excellent performance
because Oracle is joining all of the dimension tables to the fact table with one logical join
63
operation, rather than joining each dimension table to the fact table independently.
The second phase of this query is to join these rows from the fact table (the result set) to
the dimension tables. Oracle will use the most efficient method for accessing and joining
the dimension tables.
A hash join is often the most efficient algorithm for joining the dimension tables. The final
answer is returned to the user once all of the dimension tables have been joined.
Execution Plan for a Star Transformation with a Bitmap Index
The following typical execution plan might result from "Star Transformation with a Bitmap
Index:
SELECT STATEMENT
SORT GROUP BY
HASH JOIN
TABLE ACCESS FULL
STORE
HASH JOIN
TABLE ACCESS FULL
CUSTOMER
HASH JOIN
TABLE ACCESS FULL
CAL_MASTER
SALES
CUSTOMER
SALES_CUST_BIX
BITMAP MERGE
BITMAP KEY ITERATION
BUFFER SORT
TABLE ACCESS FULL
BITMAP INDEX RANGE SCAN
STORE
SALES_STORE_BIX
BITMAP MERGE
64
CAL_MASTER
SALES_CAL_BIX
In this plan, the fact table is accessed through a bitmap access path based on a bitmap
AND, of three merged bitmaps. The three bitmaps are generated by the BITMAP MERGE
row source being fed bitmaps from row source trees underneath it. Each such row source
tree consists of a BITMAP KEY ITERATION row source which fetches values from the
subquery row source tree, which in this example is a full table access. For each such
value, the BITMAP KEY ITERATION row source retrieves the bitmap from the bitmap
index. After the relevant fact table rows have been retrieved using this access path, they
are joined with the dimension tables and temporary tables to produce the answer to the
query.
When to use:
Only with Star schema model with bitmap indexes built on the fact table for the
foreign key columns. And
The dimension tables for the predicated column search should not be too big. And
There are not too many rows (less than 500K) which will be pulled in the fact
table.
There are less tables which need to be joined back by the selected rows for other
attributes.
Star transformation is not supported for tables with any of the following characteristics:
Queries with a table hint that is incompatible with a bitmap access path
Queries that contain bind variables
Tables with too few bitmap indexes. There must be a bitmap index on a fact table
column for the optimizer to generate a subquery for it.
Remote fact tables. However, remote dimension tables are allowed in the
subqueries that are generated.
65
Anti-joined tables
Tables that are already used as a dimension table in a subquery
Tables that are really unmerged views, which are not view partitions
The star transformation may not be chosen by the optimizer for the following cases:
Tables that have a good single-table access path
Tables that are too small for the transformation to be worthwhile
In addition, temporary tables will not be used by star transformation under the following
conditions:
The database is in read-only mode
The star query is part of a transaction that is in serializable mode
66
you might know that a certain index is more selective for certain queries. Based on this
information, you might be able to choose a more efficient execution plan than the
optimizer. In such a case, use hints to force the optimizer to use the optimal execution
plan.
You can use hints to specify the following:
The optimization approach for a SQL statement
The goal of the cost-based optimizer for a SQL statement
The access path for a table accessed by the statement
The join order for a join statement
A join operation in a join statement
The following syntax shows hints contained in both styles of comments that Oracle
supports within a statement block.
{DELETE|INSERT|SELECT|UPDATE} /*+ hint [text] [hint[text]]... */
67
no_merge(vw) : Do not merge the in-line view v into the main query.
cardinality(a,1000) : Tell the optimizer that there will be 1000 rows of Table a
being selected.
index_ffs(a,indx) : Instead of accessing the Table a, fast-full scan its index, indx.
append: Used after insert, force the data to be loaded abvoe HWM (direct load).
no_append: Used after insert, force the data to be loaded to the empty space
below HWM first
Rewrite :The rewrite hint forces the cost-based optimizer to rewrite a query in
terms of materialized views, when possible, without cost consideration.
Example:
As
given
in
the
figure,
after
specifying
the
hint
/*+
68
69
The OVER analytic_clause is used to indicate that the function operates on a query result
set.That is, it is computed after the FROM, WHERE, GROUP BY, and HAVING clauses
The PARTITION BY clause is used to partition the query result set into groups based on
one or more value_expr.
The order_by_clause is used to specify how data is ordered within a partition.
Analytical functions enable :
Ranking and percentile : Calculating ranks, percentiles, and n-tiles of the values
in a result set.
Lag/ Lead analysis :Finding a value in a row a specified number of rows from a
current row.
Given below are some sample queries that can be rewritten using Analytical functions
rather than normal SQL to improve performance.
70
Analytical Functions:
SELECT * FROM
(SELECT geo_id, cust_id,
SUM(sales) sum_sales,
RANK() OVER
(PARTITION BY geo_id
ORDER BY sum(sales) DESC) rank
FROM sales
GROUP BY geo_id, cust_id )
WHERE rank <= 3;
71
Analytical Functions:
SELECT prod_id,transaction_date
SUM(sales) sum_sales,
SUM(SUM(sales)) OVER
(PARTITION BY prod_id ORDER BY transaction_date
RANGE INTERVAL 1 YEAR PRECEDING)
m_avg
FROM sales
GROUP BY prod_id, s1.transaction_date
Analytical Functions:
SELECT *
72
Example 4:Give the Top-10 products for states which contribute more than 25% of
regional sales.
Basic SQL or PLSQL :
CREATE VIEW V1
AS SELECT g.geo_3_name, g.geo_1_name, SUM(sales)
sum_sales
FROM sales s, geography g where s.geo_id = g.geo_id
GROUP BY .geo_3_name, g.geo_1_name
CREATE VIEW V2
AS SELECT g.geo_3_name, SUM(sales) sum_sales
FROM sales s, geography g where s.geo_id = g.geo_id
GROUP BY g.geo_3_name;
SELECT g.geo_3_name, g.geo_1_name, prod_id,
RANK(g.geo_3_name, g.geo_1_name, X.s_sales) rank
FROM (
SELECT g.geo_3_name, g.geo_1_name, product,
SUM(sales) sum_sales
FROM sales s, geography , V1, V2
WHERE s.geo_id = g.geo_id
s. geo_3_name = V1. geo_3_name
73
Analytical Functions:
SELECT *
FROM (SELECT geo_3_name, geo_1_name, prod_id,
SUM(sales) sum_sales,
SUM(SUM(sales)) over
(PARTITION BY geo_3_name) a,
SUM(SUM(sales)) OVER
(PARTITION BY geo_3_name, geo_1_name) b,
RANK() OVER
(PARTITION BY geo_3_name, geo_1_name
ORDER BY SUM(sales) DESC) rank
FROM sales
GROUP BY geo_3_name, geo_1_name, prod_id)
WHERE b >= 0.25 * a AND rank <= 10;
Example 5:For each product, compare sales of a year to its previous year
Basic SQL or PLSQL :
CREATE VIEW V1
AS SELECT prod_id, year(transaction_date) yr,
SUM(sales) sum_sales
FROM sales
GROUP BY product, year(transaction_date);
74
Analytical Functions:
SELECT prod_id, year(transaction_date) yr,
SUM(sales) sum_sales,
SUM(sales) - LAG(SUM(sales), 1) OVER
(PARTITION BY product
ORDER BY year(date)) dif
FROM sales
GROUP BY product, month(date);
75
the database will automatically use the aggregates to improve query performance.
For Example , we may create a materialized view to create aggregated sales data for
every month.
Create materilaized view sales_mv
BUILD IMMEDIATE
REFRESH FAST ON DEMAND
enable query rewrite
as
select cal_master.mth_id, sum(sales.net_invoice_value)
from sales, cal_master
where sales.transaction_date = cal_master.date
group by cal_master.mth_id;
A materialized view definition can include any number of aggregates, as well as any
number of joins.The existence of a materialized view is transparent to SQL applications,
so we can create or drop materialized views at any time without affecting the validity of
SQL applications. A materialized view consumes storage space. The contents of the
materialized view must be maintained when the underlying detail tables are modified.
The types of materialized views are:
Materialized Views with Aggregates : Materialized views that preaggregate the data
present in the fact tables.
Materialized Views Containing Only Joins : Some materialized views contain only
joins and no aggregates.The advantage of creating this type of materialized view is
that expensive joins will be precalculated.
database in addition
to referencing
materialized views.
Build Methods:
BUILD IMMEDIATE : Create the materialized view and then populate it with data.
76
BUILD DEFERRED : Create the materialized view definition but do not populate it
with data.
ENABLE QUERY REWRITE : You also must specify the ENABLE QUERY REWRITE
clause if the materialized view is to be considered available for rewriting queries.
Refresh Mode:
ON DEMAND : Refresh occurs when a user manually executes one of the available
refresh
procedures
contained
in
the
DBMS_MVIEW
package
(REFRESH,
REFRESH_ALL_MVIEWS, REFRESH_DEPENDENT)
Refresh Options:
FAST:
information logged in the materialized view logs. This requires materailized view
logs too be created as given below:
NEVER: Indicates that the materialized view will not be refreshed with the Oracle
refresh mechanisms
Query Rewrite
When base fact tables like Sales contain large amount of data, it is an expensive and
time consuming process to compute the joins and to calculate the aggregates requires
for the analysis purpose. In such cases, queries can take minutes or even hours to return
77
Either all or part of the results requested by the query must be obtainable from
the precomputed result stored in the materialized view or views.
To determine this, the optimizer may depend on some of the data relationships declared
by the user using constraints and dimensions. Such data relationships include
hierarchies, referential integrity, and uniqueness of key data, and so on.
To enable query rewrite for a session set:
QUERY_REWRITE_ENABLED = True
Turns on query rewrite.
QUERY_REWRITE_INTEGRITY
enforced
(default
mode)
or
trusted
or
stale_tolerated
78
Dimension tables and Fact table must have Primary and foreign key relationship
with foreign key created in Novalidate Rely mode.
Database dimensions must be created for all the dimensions on which the
aggregation or rollup needs to takes place.
Fact tables and dimension tables should similarly guarantee that each fact table
row joins with one and only one dimension table row.
For each table, create a bitmap index for each key column, and create one local
index that includes all the key columns.
Partition and index the materialized view like the fact tables.
Dimensions
Dimensions are another method by which we can give even more information to Oracle.
Consider the example of sales table.In sales there is a load date and a customerid. The
load date points to another table cal_master that gives full details of what month the
load date was in, what quarter and what fiscal year the load date is in.
Now, suppose we create a materialized view that stores aggregated sales information at
the quarterly level. We know that load date implies month, month implies quarter and
quarter implies year but Oracle doesn't know this. Using a database object called a
dimension, we can alert Oracle to these facts and it will use them to rewrite queries.
A dimension declares a parent/child relationship between columns of the table. We can
use it to inform Oracle that within a row of a table - the mth_id column implies the value
you'll find in the qtr_id column, the qtr_id column implies the value you'll find in the
year_id column and so on.
Given below is the database dimension for Time dimension:
79
Example1
We have our base fact table as Sales table. The sales table contains data based on the
transaction date. First gather stats for the table sales and cal_master so that Oracle has
estimate of data present in the tables.
exec dbms_stats.gather_table_stats( user, 'SALES', cascade=>true );
exec dbms_stats.gather_table_stats( user, 'CAL_MASTER', cascade=>true );
Consider we have the query that need sales data aggregate on the quarterly basis.
We have create a materialized view sales_c_mth_mv as given below i.e. we already have
data aggregated on monthly level. So we should use this materialized view to group the
sales data on quarterly basis for better performance.
80
As seen from the explain plan the materialized view containing the aggregated data is
accessed rather than base fact 'SALES' table.
81
As seen above rather than using the Materialized view SALES_C_MTH_MV base fact table
'SALES' is being used. This happened because Oracle does not have knowledge of the
fact that Month can be grouped to Quarter. To provide oracle this information we create
database dimensions.
Now we create a time hierechy i.e. CAL_DB_DIM database dimension on the cal_master
table as shown below:
82
We again rereun the same query to get data grouped on quartely basis. The explain plan
as given below shows that after creation of dimension SALES_C_MTH_MV is being used
in place of SALES base fact table.
select cal_master.qtr_id, sum(sales.net_invoice_value)
from sales, cal_master
where sales.transaction_date = cal_master.date
group by cal_master.qtr_id
Also we noticed that logical reads reduced from 650250 to 12 . This is a very high
performance gain.
Example 2
Suppose we want to get aggregated sales by year , customer location and product sub
category. This type of aggregation is very commonly used in the datawarehouses.
83
brand , query rewite will work and materialized view SALES_C2_P2_MTH_MV will be used
, but as given in the example above if we try to aggregate by year , customer location
and product sub category , query rewrite will not work until dimensions have been
created.
This is illustrated below:
84
85
PROD_DB_DIM
CREATE DIMENSION PROD_DB_DIM
LEVEL Prod_Id is (Product.Prod_id)
LEVEL Prod_2_Id is (Product.Prod_2_id)
LEVEL Prod_3_Id is (Product.Prod_3_id)
LEVEL Prod_4_Id is (Product.Prod_4_id)
HIERARCHY prod_rollup (
Prod_id child of
Prod_2_id child of
Prod_3_id child of
Prod_4_id
86
As Shown above the performance using Materialized views and dimensions have
increased many folds i.e. we have reduced the logical reads from 854215 to 14.
Scenario for Sales Data Mart
Consider a scenario in sales data mart where most of the queries are fired to select
aggregated sales by customer location , product brand or customer country and product
category etc. Thus in this kind of scenario we can pre create some materialized views
containg the aggregated data that select the data from the other materailized views for
our sales datamart e.g. In Figure 14 The materialized view SALES_C2_G2_MV contains
data aggregated on cust_2_id and geo_2_id. The materialized view SALES_C3_G3_MV
selects data directly from
87
PROD_2_MDIM
insert into PROD_2_DIM select Prod_Id,Prod_name , prod_desc , Prod_1_id,
Prod_1_name, prod_1_desc, Prod_2_id, Prod_2_name , prod_2_desc
from
PROD_3_MDIM;
CUST_3_MDIM
insert into CUST_3_DIM select cust_Id,cust_name , cust_desc ,
cust_1_id,
cust_1_name, cust_1_desc, cust_2_id, cust_2_name , cust_2_desc , cust_3_id,
cust_3_name , cust_3_desc from customer;
CUST_2_MDIM
insert into CUST_2_DIM select
cust_Id,cust_name , cust_desc ,
cust_1_id,
cust_1_name, cust_1_desc, cust_2_id, cust_2_name , cust_2_desc from
CUST_3_MDIM;
GEO_2_MDIM
88
sales_agg, PROD_3_MDIM
where
sales_agg.prod_id = PROD_3_MDIM.prod_id
group by
Fact_type, Transaction_date, Day_Id, Load_date,
Billing_Type, Cust_Id, Prod_3_Id, Geo_Id, Trade_channel_id
SALES_P2_C2_MV
create or replace materialized view SALES_P2_C2_MV
on prebuilt table
enable query rewrite
select
Fact_type, Transaction_date, Day_Id, Load_date,Billing_Type,
Cust_2_Id, Prod_2_Id, Geo_Id, Trade_channel_id,
sum(Base_Unit_Qty) as Base_Unit_Qty,
sum(Net_Invoice_value) as Net_Invoice_value,
sum(Gross_Invoice_Value) as Gross_Invoice_Value,
sum(Net_Invoice_USD) as Net_Invoice_USD,
sum(Gross_Invoice_USD) as Gross_Invoice_USD
89
from
SALES_P3_MV, CUST_2_MDIM , PROD_3_MDIM
where
SALES_P3_MV.prod_3_id = PROD_3_MDIM.prod_3_id
and SALES_P3_MV.cust_id = CUST_2_MDIM.cust_id
group by
Fact_type, Transaction_date,Day_Id, Load_date, Billing_Type,
Cust_2_Id, Prod_2_Id, Geo_Id, Trade_channel_id
SALES_C3_G2_MV
create or replace materialized view SALES_C3_G2_MV
on prebuilt table
enable query rewrite
select
Fact_type, Transaction_date,Day_Id, Load_date, Billing_Type,
Cust_3_Id, Prod_Id, Geo_2_ID, Trade_channel_id,
sum(Base_Unit_Qty) as Base_Unit_Qty,
sum(Net_Invoice_value) as Net_Invoice_value,
sum(Gross_Invoice_Value) as Gross_Invoice_Value,
sum(Net_Invoice_USD) as Net_Invoice_USD,
sum(Gross_Invoice_USD) as Gross_Invoice_USD
from sales_agg, CUST_3_MDIM, ORG_2_MDIM
where sales_agg.cust_id = CUST_3_MDIM.cust_id
and sales_agg.geo_id=GEO_2_MDIM.geo_id
group by
Fact_type, Transaction_date,Day_Id, Load_date, Billing_Type,
Cust_3_Id, Prod_Id, Geo_2_ID, Trade_channel_id
SALES_C2_G2_MV
create or replace materialized view SALES_C2_G2_MV
on prebuilt table
enable query rewrite
select
Fact_type,Transaction_date, Day_Id, Load_date, Billing_Type,
Cust_2_Id, Prod_Id, Geo_2_ID, Trade_channel_id,
90
sum(Base_Unit_Qty) as Base_Unit_Qty,
sum(Net_Invoice_value) as Net_Invoice_value,
sum(Gross_Invoice_Value) as Gross_Invoice_Value,
sum(Net_Invoice_USD) as Net_Invoice_USD,
sum(Gross_Invoice_USD) as Gross_Invoice_USD
from SALES_C3_O2_MV, CUST_3_MDIM
where SALES_C3_O2_MV.cust_3_id = CUST_3_MDIM.cust_3_id
group by
Fact_type,Transaction_date, Day_Id, Load_date, Billing_Type,
Cust_2_Id, Prod_Id, Geo_2_ID, Trade_channel_id
GEO_DB_DIM
CREATE DIMENSION GEO_DB_DIM
LEVEL
Geo_Id is (Geography.Geo_id)
Geo_2_Id is (Geography.Geo_1_id)
Geo_3_Id is (Geography .Geo_2_id)
Geo_4_Id is (Geography .Geo_3_id)
HIERARCHY geo_rollup (
Geo_id child of
Geo_1_id child of
Geo_2_id child of
Geo_3_id
CUST_DB_DIM
CREATE DIMENSION CUST_DB_DIM
LEVEL Cust_Id is (customer.cust_id)
LEVEL Cust_2_Id is (customer.cust_2_id)
LEVEL Cust_3_Id is (customer.cust_3_id)
LEVEL Cust_4_Id is (customer.cust_4_id)
HIERARCHY geog_rollup (
91
cust_id child of
cust_2_id child of
cust_3_id child of
cust_4_id
PROD_DB_DIM
CREATE DIMENSION PROD_DB_DIM
LEVEL Prod_Id is (Product.Prod_id)
LEVEL Prod_2_Id is (Product.Prod_2_id)
LEVEL Prod_3_Id is (Product.Prod_3_id)
LEVEL Prod_4_Id is (Product.Prod_4_id)
HIERARCHY prod_rollup (
Prod_id child of
Prod_2_id child of
Prod_3_id child of
Prod_4_id
Thus using these mini dimensions and Materialized views we can perform aggregation
and set the appropriate settings for the optimizer for query rewrite to work and hence
improve the performance.
Chapter 6
Conclusion
92
Future Work
There are several ways in which workpresented in this dissertation can be extended. In
this
93
We assumed that data is already in the clean format but operational databases often
have major inconsistencies in data repreenation.
Hence additional work is needed in the Data extraction component that extracts the data
from various legacy sources which is to be stored in the data warehouse.
Data Cleaning component dictates consistency in the format of data which is collected
from various sources so that it can be merged under a unified schema.
These components are integral part of datawarehouse systems hence future work can be
focussed on these parts.
REFERENCES
1. A.Berson and S.JSmith. "Data Warehousing , Data Mining and OLAP". Mc
Graw-Hill, 1997.
2. G.Graefe. "Query evaluation Techniques for Large Databases". ACM Com
putting Surveys", June 1993 (73170).
94
http://www.oracle.com/technetwork/database/focus-areas/bi-
datawarehousing/index.html
12.
http://ezinearticles.com/?Why-Do-We-Need-Data-Warehousing?
&id=2495158
13.
http://www.cs.wmich.edu/~yang/teach/cs595/han/ch2.pdf
APPENDIX
1. SAMPLE DATA FOR TABLES
Given below is the sample of the data in the PRODUCT table
95
96
97
98
99
100
101
102
2. CREATE TABLES
103
Given below is the DDL for all the tables that have been created for Star and Snowflake
Schema
------------------------ SALES------------------------
Number(8) ,
Fact_type
Varchar2(10)
Transaction_date
Date
Day_Id
Number(8)
Load_date
not null,
not null,
Date
not null,
Billing_Type
Varchar2(10)
not null,
Cust_Id
Number(80)
not null,
Prod_Id
Number(80)
not null,
Geo_Id
Number(80)
not null,
Store_ID
Number(80)
Base_Unit_Qty
not null ,
not null,
Number(80),
Net_Invoice_value
Number(80),
Gross_Invoice_Value
Number(80),
Net_Invoice_USD
Number(80),
Gross_Invoice_USD
Number(80)
)
Partition by Range (day_id)
subpartition by List (fact_type)
SUBPARTITION TEMPLATE(
SUBPARTITION BR ,
SUBPARTITION B5 ,
SUBPARTITION BK
PARTITION
(Z_LAST_PARTITION VALUE LESS THAN (MAXVALUE) );
104
Varchar2(10)
Transaction_date
Date
not null,
Day_Id
Number(8)
not null,
Load_date
Date
not null,
Billing_Type
Varchar2(10)
not null,
Cust_Id
Number(80)
not null,
Prod_Id
Number(80)
not null,
Geo_Id
Number(80)
not null,
Store_ID
Number(80)
Base_Unit_Qty
not null ,
not null,
Number(80),
Net_Invoice_value
Number(80),
Gross_Invoice_Value
Number(80),
Net_Invoice_USD
Number(80),
Gross_Invoice_USD
Number(80)
)
Partition by Range (day_id)
subpartition by List (fact_type)
SUBPARTITION TEMPLATE(
SUBPARTITION BR ,
SUBPARTITION B5 ,
SUBPARTITION BK
PARTITION
(Z_LAST_PARTITION VALUE LESS THAN (MAXVALUE) );
105
---------------------PRODUCT---------------------------
Number(8),
Prod_Name
Varchar2(50)
Prod_desc
Varchar2(100) ,
Prod_1_Id
Number(8)
not null,
Prod_1_Name
Varchar2(50)
not null,
Prod_1_desc
Varchar2(100),
Prod_2_Id
Number(8)
not null,
Prod_2_Name
Varchar2(50)
not null,
Prod_2_desc
Varchar2(100),
Prod_3_Id
Number(8)
not null,
Prod_3_Name
Varchar2(50)
not null,
Prod_3_desc
Varchar2(100),
Prod_4_Id
Number(8)
not null,
Prod_4_Name
Varchar2(50)
not null,
Prod_4_desc
Varchar2(100),
Prod_level
Number
Brand_desc
Varchar2(100),
Prod_size_desc
Varchar2(100),
Last_updt_date
Date
not null,
not null,
not null
---------------------CUSTOMER---------------------------
Number(8),
cust_Name
Varchar2(50)
not null,
cust_desc
Varchar2(100) ,
cust_1_Id
Number(8)
not null,
cust_1_Name
Varchar2(50)
not null,
cust_1_desc
Varchar2(100),
cust_2_Id
Number(8)
not null,
cust_2_Name
Varchar2(50)
not null,
cust_2_desc
Varchar2(100),
106
cust_3_Id
Number(8)
not null,
cust_3_Name
Varchar2(50)
not null,
cust_3_desc
Varchar2(100),
cust_4_Id
Number(8)
not null,
cust_4_Name
Varchar2(50)
not null,
cust_4_desc
Varchar2(100),
cust_level
Number
Cust_grade
not null,
Cust_peer_rank
trade_chanel_curr_id
trade_chanel_hist_id
Number,
Last_updt_date
Number(8)
not null,
Store_Name
Varchar2(50)
Store_desc
Varchar2(100)
Store_Location
varchar2(10)
Store_Manager_Id
Store_Manager_Name
not null,
,
,
---------------------GEOGRAPHY---------------------------
Number(8),
geo_Name
Varchar2(50)
not null,
geo_desc
Varchar2(100) ,
geo_1_Id
Number(8)
not null,
geo_1_Name
Varchar2(50)
not null,
geo_1_desc
Varchar2(100),
geo_2_Id
Number(8)
not null,
geo_2_Name
Varchar2(50)
not null,
107
geo_2_desc
Varchar2(100),
geo_3_Id
Number(8)
not null,
geo_3_Name
Varchar2(50)
not null,
geo_3_desc
Varchar2(100),
geo_level
Number
Last_updt_date
not null,
Date
date
not null,
Day_Name
Varchar2(20)not null,
Week_Id
Number(1)
not null,
Mth_Id
Number(6)
not null,
Mth_name
Qtr_Id
Varchar2(10)not null,
Varchar2(10)not null,
Qtr_Name
Year_Id
Varchar2(20)not null,
Number(4)
not null,
Year_name
Varchar2(10)not null,
Holiday_Indicator
Varchar2(1)
not null
-------------------PROD-----------------------------
Number(8),
Prod_Name
Prod_desc
Varchar2(100) ,
Prod_1_Id
Number(8)
not null,
Prod_2_Id
Number(8)
not null,
Prod_level
Brand_desc
Varchar2(100),
Prod_size_desc
Varchar2(100),
108
Last_updt_date
Date
not null
------------------PROD_1------------------------------
Number(8)
Prod_1_Name
Varchar2(50)
Prod_1_desc
Varchar2(100)
not null,
not null,
------------------PROD_2------------------------------
Number(8)
Prod_2_Name
Varchar2(50)
Prod_2_desc
Varchar2(100),
Prod_3_Id
Number(8)
not null,
not null,
not null,
------------------PROD_3------------------------------
Number(8)
Prod_3_Name
Varchar2(50)
Prod_3_desc
Varchar2(100),
Prod_4_Id
Number(8)
not null,
not null,
not null,
109
Number(8)
Prod_4_Name
Varchar2(50)
Prod_4_desc
Varchar2(100)
not null,
not null,
--------------------CUST---------------------------
Number(8),
cust_Name
Varchar2(50)
cust_desc
Varchar2(100) ,
cust_1_Id
Number(8)
not null,
cust_2_Id
Number(8)
not null,
cust_level
Number
not null,
Cust_grade
Varchar2(100)
not null,
Cust_peer_rank
Varchar2(100)
trade_chanel_curr_id
Number
trade_chanel_hist_id
Number,
Last_updt_date
Date
not null,
not null,
not null,
not null
---------------------CUST_1---------------------------
110
cust_1_Id
Number(8)
cust_1_Name
Varchar2(50)
cust_1_desc
Varchar2(100))
not null,
not null,
---------------------CUST_2---------------------------
Number(8)
not null,
cust_2_Name
Varchar2(50)
not null,
cust_2_desc
Varchar2(100),
cust_3_Id
Number(8)
not null
Number(8)
not null,
cust_3_Name
Varchar2(50)
not null,
cust_3_desc
Varchar2(100),
cust_4_Id
Number(8)
not null
---------------------CUST_4---------------------------
Number(8)
not null,
cust_4_Name
Varchar2(50)
not null,
cust_4_desc
Varchar2(100)
111
---------------------GEO---------------------------
Number(8),
geo_Name
Varchar2(50)
geo_desc
Varchar2(100) ,
geo_1_Id
Number(8)
not null,
geo_level
Number
not null,
Last_updt_date
not null,
)
---------------------GEO_1---------------------------
Number(8)
not null,
geo_1_Name
Varchar2(50)
not null,
geo_1_desc
Varchar2(100)
geo_2_Id
Number(8)
not null
---------------------GEO_2---------------------------
Number(8)
not null,
geo_2_Name
Varchar2(50)
not null,
geo_2_desc
Varchar2(100),
geo_3_Id
Number(8)
not null
112
---------------------GEO_3---------------------------
Number(8)
not null,
geo_3_Name
Varchar2(50)
not null,
geo_3_desc
Varchar2(100),
geo_4_Id
Number(8)
not null
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110201,
PARTITION
Z_LAST_PARTITION);
(20110201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110301,
PARTITION
Z_LAST_PARTITION);
(20110301)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110401,
PARTITION
Z_LAST_PARTITION);
(20110401)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110501,
PARTITION
Z_LAST_PARTITION);
(20110501)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110601,
PARTITION
Z_LAST_PARTITION);
(20110601)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110701,
PARTITION
Z_LAST_PARTITION);
(20110701)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110801,
PARTITION
Z_LAST_PARTITION);
(20110801)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20110901,
PARTITION
Z_LAST_PARTITION);
(20110901)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20111001,
PARTITION
Z_LAST_PARTITION);
(20111001)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20111101,
PARTITION
Z_LAST_PARTITION);
(20111101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20111201,
PARTITION
Z_LAST_PARTITION);
(20111201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120101,
PARTITION
Z_LAST_PARTITION);
(20120101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120201,
PARTITION
Z_LAST_PARTITION);
(20120201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120301,
PARTITION
Z_LAST_PARTITION);
(20120301)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120401,
PARTITION
Z_LAST_PARTITION);
(20120401)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120501,
PARTITION
Z_LAST_PARTITION);
(20120501)
INTO
113
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120601,
PARTITION
Z_LAST_PARTITION);
(20120601)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120701,
PARTITION
Z_LAST_PARTITION);
(20120701)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120801,
PARTITION
Z_LAST_PARTITION);
(20120801)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20120901,
PARTITION
Z_LAST_PARTITION);
(20120901)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20121001,
PARTITION
Z_LAST_PARTITION);
(20121001)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20121101,
PARTITION
Z_LAST_PARTITION);
(20121101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20121201,
PARTITION
Z_LAST_PARTITION);
(20121201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130101,
PARTITION
Z_LAST_PARTITION);
(20130101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130201,
PARTITION
Z_LAST_PARTITION);
(20130201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130301,
PARTITION
Z_LAST_PARTITION);
(20130301)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130401,
PARTITION
Z_LAST_PARTITION);
(20130401)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130501,
PARTITION
Z_LAST_PARTITION);
(20130501)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130601,
PARTITION
Z_LAST_PARTITION);
(20130601)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130701,
PARTITION
Z_LAST_PARTITION);
(20130701)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130801,
PARTITION
Z_LAST_PARTITION);
(20130801)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20130901,
PARTITION
Z_LAST_PARTITION);
(20130901)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20131001,
PARTITION
Z_LAST_PARTITION);
(20131001)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20131101,
PARTITION
Z_LAST_PARTITION);
(20131101)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20131201,
PARTITION
Z_LAST_PARTITION);
(20131201)
INTO
alter
table
sales
(PARTITION
SPLIT PARTITION
Z_LAST_PARTITION
AT
P_20140101,
PARTITION
Z_LAST_PARTITION);
(20140101)
INTO
114
accept dformat prompt 'Enter Format to Use for Date Columns: '
spool &tname..ctl
select 'LOAD DATA'|| chr (10) ||
'INFILE ''' || lower (table_name) || '.csv''' || chr (10) ||
'INTO TABLE '|| table_name || chr (10)||
'FIELDS TERMINATED BY '','''||chr (10)||
'TRAILING NULLCOLS' || chr (10) || '('
from user_tables
where table_name = upper ('&tname');
select decode (rownum, 1, ' ', ' , ') ||
rpad (column_name, 33, ' ') ||
decode (data_type,
'VARCHAR2', 'CHAR NULLIF ('||column_name||'=BLANKS)',
'FLOAT', 'DECIMAL EXTERNAL NULLIF('||column_name||'=BLANKS)',
'NUMBER', decode (data_precision, 0,
'INTEGER EXTERNAL NULLIF ('||column_name||
'=BLANKS)', decode (data_scale, 0,
'INTEGER EXTERNAL NULLIF ('||
column_name||'=BLANKS)',
'DECIMAL EXTERNAL NULLIF ('||
column_name||'=BLANKS)')),
'DATE', 'DATE "&dformat" NULLIF ('||column_name||'=BLANKS)', null)
from user_tab_columns
where table_name = upper ('&tname')
order by column_id;
select ')'
from dual;
spool of
115
Sample Control File for CUSTOMER table generated using process given in
appendix 3
LOAD DATA
INFILE 'D:\Gunjan\ctl\files\customer.csv'
INTO TABLE CUSTOMER
FIELDS TERMINATED BY ','
TRAILING NULLCOLS
( CUST_ID
, CUST_NAME
, CUST_DESC
, CUST_1_ID
, CUST_1_NAME
, CUST_1_DESC
, CUST_2_ID
, CUST_2_NAME
, CUST_2_DESC
, CUST_3_ID
, CUST_3_NAME
, CUST_3_DESC
, CUST_4_ID
, CUST_4_NAME
, CUST_4_DESC
, CUST_LEVEL
, CUST_GRADE
, CUST_PEER_RANK
, TRADE_CHANEL_CURR_ID
, TRADE_CHANEL_HIST_ID
, LAST_UPDT_DATE
116
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
6,"Deirdre","Female ",1,"Cakewalk","IT company",1,"Accounting","Accounting
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
7,"Zeph","Male",1,"Cakewalk","IT company",1,"Accounting","Accounting
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
8,"Kylee","Male ",1,"Cakewalk","IT company",1,"Accounting","Accounting
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,3,01/25/12
9,"Drew","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
10,"Fay","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
11,"Lacota","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
12,"Larissa","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
13,"Griffith","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
14,"James","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
15,"Salvador","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
16,"Indira","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
17,"Kelsey","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
18,"Stephanie","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
19,"Dexter","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
20,"Otto","Female ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
21,"Dawn","Male",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
22,"Alec","Male ",1,"Cakewalk","IT company",2,"Advertising","Advertising
Department",1,"IN","INDIA",1,"APAC","Asia Pacific",4,"C",4,1,1,01/25/12
23,"Piper","Male",1,"Cakewalk","IT company",5,"Customer Service","Customer Service
Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
24,"Karly","Female ",1,"Cakewalk","IT company",5,"Customer Service","Customer
Service Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
25,"Briar","Male",1,"Cakewalk","IT company",5,"Customer Service","Customer Service
Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
26,"Laura","Female ",1,"Cakewalk","IT company",5,"Customer Service","Customer
Service Department",2,"CH","CHINA",1,"APAC","Asia Pacific",4,"C",4,2,2,01/25/12
D:\oracle\product\10.2.0\db_1\BIN\sqlldr scott/*****@ORCL
control=D:\gunjan\ctl\CUSTOMER.ctl log=D:\gunjan\ctl\customer.log
117
SQL> create sequence seq start with 214250 increment by 1 nocache nocycle;
Sequence created.
Elapsed: 00:00:00.00
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
12 NET_INVOICE_VALUE ,
13 GROSS_INVOICE_VALUE ,
14 NET_INVOICE_USD
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
12 NET_INVOICE_VALUE ,
13 GROSS_INVOICE_VALUE ,
118
14 NET_INVOICE_USD
and g1.geo_2_id=g2.geo_2_id
and g2.geo_3_id=g3.geo_3_id
and g3.geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------4618978
Elapsed: 00:00:00.12
SQL>
SQL> select sum(net_invoice_value) from sales s , product p , customer c where s.prod_id=p.prod_id and
2 c.cust_id=s.cust_id and c.cust_3_name='IN' and p.prod_3_name='Hair Care';
SUM(NET_INVOICE_VALUE)
---------------------21470160
Elapsed: 00:00:00.06
SQL> select sum(net_invoice_value) from sales_snow s , prod p , cust c , prod_2 p2, prod_3 p3, cust_2 c2 , cust_3 c3
2 where s.prod_id=p.prod_id and c.cust_id=s.cust_id
3
and p.prod_2_id=p2.prod_2_id
and p2.prod_3_id=p3.prod_3_id
5 and
119
6
7
c.cust_2_id=c2.cust_2_id
and c2.cust_3_id=c3.cust_3_id
Elapsed: 00:00:00.12
SQL> select distinct prod_name from sales s , prod p , prod_1
p1.prod_1_id=p.prod_1_id and p1.prod_1_name='Shampoo';
p1
PROD_NAME
-------------------------------------------------Moist
120
121
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
122
13
GROSS_INVOICE_VALUE ,
14
NET_INVOICE_USD
15
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
13
GROSS_INVOICE_VALUE ,
14
NET_INVOICE_USD
15
and g2.geo_3_id=g3.geo_3_id
and g3.geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:01.04
123
SQL> select sum(base_unit_qty) from sales_snow s,geo g , geo_1 g1 , geo_2 g2 , geo_3 g3 where
2
and g1.geo_2_id=g2.geo_2_id
and g2.geo_3_id=g3.geo_3_id
and g3.geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:00.28
SQL> select sum(base_unit_qty) from sales, geography where sales.geo_id=geography.geo_id and geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------9237956
Elapsed: 00:00:00.20
SQL> select sum(net_invoice_value) from sales_snow s , prod p , cust c , prod_2 p2, prod_3 p3, cust_2 c2 , cust_3 c3
2
and p.prod_2_id=p2.prod_2_id
and p2.prod_3_id=p3.prod_3_id
and
c.cust_2_id=c2.cust_2_id
and c2.cust_3_id=c3.cust_3_id
SUM(NET_INVOICE_VALUE)
---------------------42940320
Elapsed: 00:00:00.15
SQL> select sum(net_invoice_value) from sales s , product p , customer c where s.prod_id=p.prod_id and
2
SUM(NET_INVOICE_VALUE)
---------------------42940320
Elapsed: 00:00:00.14
SQL> select distinct prod_name from sales_snow s , prod p , prod_1 p1
p1.prod_1_id=p.prod_1_id and p1.prod_1_name='Shampoo';
PROD_NAME
-------------------------------------------------Moist
NO MORE TANGLES Shampoo
NATURAL Baby Shampoo
NO MORE TANGLES Leave-in Conditioner
124
125
store_location from sales_snow s , geo g , cust c, store st , cust_2 c2 , cust_3 c3 , geo_1 g1 , geo_2 g2
10
11
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------204000 India
113440 Canada
17120 Europe
Elapsed: 00:00:00.59
SQL> select sum(base_unit_qty) ,
2
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------204000 India
113440 Canada
17120 Europe
Elapsed: 00:00:00.21
126
SQL>
SQL>
SQL> select count(*) from sales_snow s , prod p , prod_1 p1 where billing_type='MN' and to_char(load_date,'YYYY')
='2012' and fact_type='BR' and
2 s.prod_id = p.prod_id and p1.prod_1_id = p.prod_1_id and p1.prod_1_name='Toothpaste';
COUNT(*)
---------160
Elapsed: 00:00:00.01
SQL> select count(*) from sales s , product p where billing_type='MN' and to_char(load_date,'YYYY') ='2012' and
fact_type='BR' and
2 s.prod_id = p.prod_id and p.prod_1_name='Toothpaste';
COUNT(*)
---------160
Elapsed: 00:00:00.01
SQL> insert into sales select seq.nextval
2
3
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
13
GROSS_INVOICE_VALUE ,
14
NET_INVOICE_USD
15
SQL> /
SQL> /
127
SQL>
SQL> insert into sales_snow select seq_snow.nextval
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
12 NET_INVOICE_VALUE ,
13 GROSS_INVOICE_VALUE ,
14 NET_INVOICE_USD
SQL> /
SQL> /
SQL> commit;
Commit complete.
COUNT(*)
---------3427968
Elapsed: 00:00:01.01
128
COUNT(*)
---------3427968
Elapsed: 00:00:03.54
SQL> select sum(base_unit_qty) from sales_snow s,geo g , geo_1 g1 , geo_2 g2 , geo_3 g3 where
2
and g1.geo_2_id=g2.geo_2_id
and g2.geo_3_id=g3.geo_3_id
and g3.geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------36951824
Elapsed: 00:00:14.92
SQL>
SQL> select sum(base_unit_qty) from sales, geography where sales.geo_id=geography.geo_id and geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------36951824
Elapsed: 00:00:10.28
SQL> select sum(net_invoice_value) from sales s , product p , customer c where s.prod_id=p.prod_id and
2
SUM(NET_INVOICE_VALUE)
---------------------171761280
Elapsed: 00:00:04.81
SQL>
SQL> select sum(net_invoice_value) from sales_snow s , prod p , cust c , prod_2 p2, prod_3 p3, cust_2 c2 , cust_3 c3
2
and p.prod_2_id=p2.prod_2_id
4
5
6
and p2.prod_3_id=p3.prod_3_id
and
c.cust_2_id=c2.cust_2_id
129
7
8
and c2.cust_3_id=c3.cust_3_id
and p3.prod_3_name='Hair Care'and c3.cust_3_name='India' ;
SUM(NET_INVOICE_VALUE)
---------------------171761280
Elapsed: 00:00:10.31
SQL> select distinct prod_name from sales s , product p where s.prod_id = p.prod_id and p.prod_1_name ='Shampoo';
PROD_NAME
-------------------------------------------------Moist
NO MORE TANGLES Shampoo
NATURAL Baby Shampoo
NO MORE TANGLES Leave-in Conditioner
NO MORE TANGLES Detangling Spray
Baby Shampoo with Natural Lavender
Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
KIDS NO MORE TANGLES
Cleanse & Mend
Baby Daily Face & Body Lotion SPF 40
Opposites Attract
Mega
KIDS HEAD-TO-TOE Body Wash Tropical Blast
NO MORE TANGLES Extra Conditioning Shampoo
Sun-Touched Shine
KIDS E-Z GRIP SOAPBerry Breeze
21 rows selected.
Elapsed: 00:00:02.48
130
PROD_NAME
-------------------------------------------------Moist
NO MORE TANGLES Shampoo
NATURAL Baby Shampoo
NO MORE TANGLES Leave-in Conditioner
NO MORE TANGLES Detangling Spray
Baby Shampoo with Natural Lavender
Instant Freeze
Baby Shampoo
Hair Insurance
Sprunch
Aussome Volume
PROD_NAME
-------------------------------------------------Sydney Smooth
KIDS NO MORE TANGLES
Cleanse & Mend
Baby Daily Face & Body Lotion SPF 40
Opposites Attract
Mega
KIDS HEAD-TO-TOE Body Wash Tropical Blast
NO MORE TANGLES Extra Conditioning Shampoo
Sun-Touched Shine
KIDS E-Z GRIP SOAPBerry Breeze
21 rows selected.
Elapsed: 00:00:04.00
SQL>
SQL> select sum(base_unit_qty) ,
2
131
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------816000 India
453760 Canada
68480 Europe
Elapsed: 00:00:04.06
SQL> select sum(base_unit_qty) ,
2
store_location from sales_snow s , geo g , cust c, store st , cust_2 c2 , cust_3 c3 , geo_1 g1 , geo_2 g2
10
11
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------816000 India
453760 Canada
68480 Europe
Elapsed: 00:00:04.29
SQL>
SQL> select count(*) from sales s , product p where billing_type='MN' and to_char(load_date,'YYYY') ='2012' and
fact_type='BR' and
2 s.prod_id = p.prod_id and p.prod_1_name='Toothpaste';
COUNT(*)
---------640
Elapsed: 00:00:00.03
SQL> select count(*) from sales_snow s , prod p , prod_1 p1 where billing_type='MN' and to_char(load_date,'YYYY')
='2012' and fact_type='BR' and
2 s.prod_id = p.prod_id and p1.prod_1_id = p.prod_1_id and p1.prod_1_name='Toothpaste';
132
COUNT(*)
---------640
Elapsed: 00:00:00.03
SQL>
SQL> insert into sales select seq.nextval
2
FACT_TYPE ,
TRANSACTION_DATE
DAY_ID
LOAD_DATE ,
BILLING_TYPE ,
CUST_ID
PROD_ID
GEO_ID
10
STORE_ID
11
BASE_UNIT_QTY
12
NET_INVOICE_VALUE ,
13
GROSS_INVOICE_VALUE ,
14
NET_INVOICE_USD
15
Elapsed: 00:02:40.41
SQL> insert into sales_snow select seq_snow.nextval
2 FACT_TYPE ,
3 TRANSACTION_DATE
4 DAY_ID
5 LOAD_DATE ,
6 BILLING_TYPE ,
7 CUST_ID
8 PROD_ID
9 GEO_ID
10 STORE_ID
11 BASE_UNIT_QTY
12 NET_INVOICE_VALUE ,
13 GROSS_INVOICE_VALUE ,
14 NET_INVOICE_USD
133
Elapsed: 00:02:42.42
SQL> commit;
Commit complete.
Elapsed: 00:00:00.00
SQL> select sum(base_unit_qty) from sales, geography where sales.geo_id=geography.geo_id and geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------73903648
Elapsed: 00:00:19.68
SQL> select sum(base_unit_qty) from sales_snow s,geo g , geo_1 g1 , geo_2 g2 , geo_3 g3 where
2
and g1.geo_2_id=g2.geo_2_id
and g2.geo_3_id=g3.geo_3_id
and g3.geo_3_name='APAC';
SUM(BASE_UNIT_QTY)
-----------------73903648
Elapsed: 00:00:21.48
SQL> select sum(net_invoice_value) from sales s , product p , customer c where s.prod_id=p.prod_id and
2
SUM(NET_INVOICE_VALUE)
---------------------343522560
Elapsed: 00:00:05.34
SQL> select sum(net_invoice_value) from sales_snow s , prod p , cust c , prod_2 p2, prod_3 p3, cust_2 c2 , cust_3 c3
2
and p.prod_2_id=p2.prod_2_id
and p2.prod_3_id=p3.prod_3_id
and
134
c.cust_2_id=c2.cust_2_id
7
8
and c2.cust_3_id=c3.cust_3_id
and p3.prod_3_name='Hair Care'and c3.cust_3_name='India' ;
SUM(NET_INVOICE_VALUE)
---------------------343522560
Elapsed: 00:00:07.62
SQL> select distinct prod_name from sales s , product p where s.prod_id = p.prod_id and p.prod_1_name ='Shampoo';
PROD_NAME
-------------------------------------------------Moist
NO MORE TANGLES Shampoo
NATURAL Baby Shampoo
NO MORE TANGLES Leave-in Conditioner
NO MORE TANGLES Detangling Spray
Baby Shampoo with Natural Lavender
Instant Freeze
Baby Shampoo
Aussome Volume
Sprunch
Hair Insurance
PROD_NAME
-------------------------------------------------Sydney Smooth
KIDS NO MORE TANGLES
Cleanse & Mend
Baby Daily Face & Body Lotion SPF 40
Opposites Attract
Mega
KIDS HEAD-TO-TOE Body Wash Tropical Blast
NO MORE TANGLES Extra Conditioning Shampoo
Sun-Touched Shine
KIDS E-Z GRIP SOAPBerry Breeze
21 rows selected.
Elapsed: 00:00:05.31
135
PROD_NAME
-------------------------------------------------Moist
NO MORE TANGLES Shampoo
NATURAL Baby Shampoo
NO MORE TANGLES Leave-in Conditioner
NO MORE TANGLES Detangling Spray
Baby Shampoo with Natural Lavender
Instant Freeze
Baby Shampoo
Hair Insurance
Sprunch
Aussome Volume
PROD_NAME
-------------------------------------------------Sydney Smooth
KIDS NO MORE TANGLES
Cleanse & Mend
Baby Daily Face & Body Lotion SPF 40
Opposites Attract
Mega
KIDS HEAD-TO-TOE Body Wash Tropical Blast
NO MORE TANGLES Extra Conditioning Shampoo
Sun-Touched Shine
KIDS E-Z GRIP SOAPBerry Breeze
21 rows selected.
Elapsed: 00:00:06.00
SQL> select sum(base_unit_qty) ,
2
136
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------1632000 India
907520 Canada
136960 Europe
Elapsed: 00:00:06.06
SQL> select sum(base_unit_qty) ,
2
store_location from sales_snow s , geo g , cust c, store st , cust_2 c2 , cust_3 c3 , geo_1 g1 , geo_2 g2
10
11
SUM(BASE_UNIT_QTY) STORE_LOCA
------------------ ---------1632000 India
907520 Canada
136960 Europe
Elapsed: 00:00:08.14
SQL> select count(*) from sales s , product p where billing_type='MN' and to_char(load_date,'YYYY') ='2012' and
fact_type='BR' and
2 s.prod_id = p.prod_id and p.prod_1_name='Toothpaste';
COUNT(*)
---------1280
Elapsed: 00:00:00.03
SQL> select count(*) from sales_snow s , prod p , prod_1 p1 where billing_type='MN' and to_char(load_date,'YYYY')
='2012' and fact_type='BR' and
2 s.prod_id = p.prod_id and p1.prod_1_id = p.prod_1_id and p1.prod_1_name='Toothpaste';
COUNT(*)
----------
137
1280
Elapsed: 00:00:00.04
SQL> spool of
138