DW Basics

Transaction:
A transaction is a business operation
Technical point of view :
It is a set of DML operations(Insert,UPDATE_DATE,Delete)
OLTP System=OLTP applications(Front end)+Database(Back end)
Data warehousing=ETL Development+BI Development
Enterprise Data warehouse:
An Enterprise Data warehouse is a relational DB which is specially designed for analyzing the
business and making decisions to achieve the business goals and responding to business
problems ,but not designed for business transactional processing
A Data warehouse is a concept of consolidating the data from multiple OLTP data bases
Storage Capacity point of view Relational DB is categorized in to three types
1.Low range
2.Mid range
3.High range
1.Low range DB:
Can organized and managed mega bytes of information
Example:Ms-Access
2.Mid range DB:

Can organized and managed Giga bytes of information
Example:Oracle,Microsoft SQL SERVER,Sybase,DB2,Informix,Postgress SQL
3.High range DB:
Can organized and managed Tera bytes and Peta Bytes of information
Example:Teradata,Netezza,GreenPlum,Hadoop.
Storage point of view data base categorized in to two types.
1.NFS-Normal File storage
2.DFS-Distributed File storage
Data storage Patterns:
There are two types of data storage patterns which are supported by relational DB
1.NFS-Normal File storage
2.DFS-Distributed File storage
NFS-Normal File storage:
1.Single Disk for storing the data
2.Shared every thing architecture(data shared in single disk)
3.Data reads in Sequential
4.All Mid range DB are developed on platform of NFS.
5.Limit scalability or expansion
6.Strongly recommended for OLTP applications
7.Recomended for data warehousing for small and medium scale enterprises with storage
capacity of gigabytes
8.default processor in NFS is only one
9.Disk cant scalable in NFS
Example:Oracle,Sybase,SQL server,DB2,Redbrics,Informix,Postgress SQL

Note:Processor is a S/W component run as .exe
DFS-Distributed File Storage:
1.Multiple disks for storing the data
2.Storing nothing architecture (every processor has dedicated memory& disk that is not shared
by another processor)
3.Data reads in parallel(supports parallelism)
4.Unlimited Scalability
5.Designed only for Building Enterprise data warehouse but not for OLTP
Example:Teradata,Netteza,Hadoop,green plum
Enterprise DWH database Evaluation:
1.Data base that supports enormous storage capacity(Billions of rows and Tera bytes)
2. DB that supports distributed file storage pattern
3.DB that supports nothing architecture
4.Database that supports unlimited scalability(expansion)
5.DB that massively parallel processing
6.DB that supports mature optimizers to handle complex SQL Queries( Run the queries more
faster with less system resource usage
7..DB that supports High Availability(Users can access)
8.100% data without data loss even S/W,H/W components are down
9.Data base that supports parallel loading
10.That DB supports low TCO (total cost of owner ship) ease to set up ,administrate & Manage
11. Single DB server that can provide access to hundreds of users concurrently
Data Acquisition:
It is a process of extracting the data from multiple source systems,transforming the data into
consistent format and load in to a target system,To implement the ETL process we need ETL
tools
Types of ETL tools
Two types of ETL tools to build Data Acquisition
1.GUI based ETL tool
2.Program Based ETL tool
Code Based ETL:
ETL applications are developed using programming languages such as
SQL, PLSQL, SAS, Teradata,ETL utilities
GUI Based ETL:

ETL applications are developed using simple graphical user interface,point& click features
Example:Informatca,Data stage, Abnitio,SSIS
MSBI is a package it has(ETL+Reporting=SSIS+SSRS)
Data Cleansing:
It is a process of filtering or rejecting Un wanted source data or records
Data Scrubbing: It is the process of Deriving new attributes or columns
Data Merging:
It is the process of combining the data from multiple source systems
Data merging are two types
1.Join
2.Union
Data warehouse:
1.Data warehouse is a relational DB that is used to store the historical data for query& Analysis
2.Data in a Data warehouse is derived from source system(OLTP/SOI)
SOR-->Source of records
OLTP: (Online transactional Processing)
Computer system that stores time sensitive transaction related data that is processed
immediately and analysis and always kept current.
Difference Between OLTP And Data ware house
Tables in Data Warehouse:
There are two types of tables we have in Data Warehouse
1. Dimension Table
2. Fact Table
1. Dimension Table:
Stores textual or descriptive information about business process

Dimension tables example s in Retail Domain:
Customer,Product,Stores,Employees,Pramotions,Time
Dimension tables example s in Banking Domain:
Applictions, Customers, Products, Branches, Promotions, Time, Billing cycle Dimension
Fact Table:
Fact table stores measurements or metrics of a business process
Fact table examples in Retail Domain:
Sales,Purchase,Inventry
Fact tables examples in Banking Domain
1. SA_LoanTransaction Fact
2. CC_Transaction Fact
3. CC_Statement Fact
Fact table consists of Keys and Measures and Fact table consist of Composite Primary key
Composite Primary Key

Store Prod Date Revenue(X
Key(X) Key(X) Key(X) )
S1 P1 D1 3000
S1 P2 D1 2000
S2 P1 D1 2000
Types of Fact tables:
There are three types of fact tables
1. Fact Less Fact table
2. Cumulative Fact table
3. Snap shot Fact table
1. Fact less Fact table:
1.Fact less Fact table consist of only keys and No Measures
2.Fact less Fact table is to record the events
3.Fact less Fact table acts as a Bridge between the Dimensional tables
Example of Fact less Fact table: Employee Attendence Fact less Fact
Dimension Tables
Auditorim Sponsors Time Paticipant Events
Aud Id Sponsor Id Date Key Paticipant Id Event Id
Sponsor Month Paticipant Event
Aud Name Name Key Name Name
Aud Type Contribution Qtr Gender Event Type
Aud Mgr Address Year Address Event Desc
Aud
Address
Fact Table
Aud Id Sponsor Id Paticipant id Event id
A1 S1 P1 E1
A1 S1 P2 E1
A2 S1 P3 E1
2. Cumulative Fact table:
It consist of additive fact it describes what happened over a period of time

Ex: Sales Fact table, Order Fact table
3. Snapshot Fact table:
It consist of semi additive facts and non additive facts it describes states of things in a
particular instance of time
Ex: Bank Fact table, Inventory Fact table
Degenarate Dimension Key:
Key In a Fact table that is not associated with any Dimension
Example:Order Id,Sale Id, Bill No,Invoice etc
Types of Facts:
There are 3 types of Facts in Fact tables
1. Additive Facts
2.Semi Additive Facts
3. Non Additive Facts
1.Additive Fact: Business measurements in a fact table that can be summed up through all of
the dimensional Keys
Fact Table
Store Key Prod Key Date Key Revenue
S1 P1 12-Jan-15 600
S1 P2 12-Jan-15 400
S2 P2 12-Jan-15 800
S2 P3 13-Jan-15 500
S3 P1 13-Jan-15 700
S3 P3 14-Jan-15 900
Reports generation using Keys In above Fact table
Revenue Report By Revenue Report By Revenue Report By

Store Product Date
Store Key Revenue Product Key Revenue Date Key Revenue
S1 1000 P1 1300 12-Jan-15 1800
S2 1300 P2 1200 13-Jan-15 1200
S3 1600 P3 1400 14-Jan-15 900
Bank Fact table:
Semi Additive Fact: Business measurements in a fact table that can be summed up across only
few Dimensional Keys
Transaction Balanc Profit

Acct Id Date e Margin
21653 12-Jan-15 700000 -
21654 12-Jan-15 400000 -
21653 13-Jan-15 900000 -
21654 13-Jan-15 600000 -
Reports:
Balance By Acct Id
Acct Balanc
Id Balance e
160000
21653 0 900000
100000
21654 0 600000
Balance By Date
Date Key Balance
12-Jan-15 1100000
13-Jan-15 1500000
The above example is for Semi additive Fact

3.Non Additive Fact:Business measurements in a fact table that cannot be summed up across
any Dimension KeysNote: In a Fact table percentage are always non additive
SEM1 80%
SEM2 60%
TOTAL 140% Wrong
Note: Example of Non Additive Fact is Unit Price
Types of Dimensions:
The following are the diff types of dimensions in DW
1. Confirmed Dimension
2. Degenerated Dimension
3. Shrunken Dimension
4. Junk Dimension
5.Dirty Dimension
Types of Dimensions:
Conformed Dimension: A Dimension that is shared across multiple Fact table that is called
Conformed Dimension Or Dimension that is used to join Data mart
Banking Domain:
Degenerated Dimension:
If a fact table act as dimension and it’s shared with another fact table (or) maintains foreign key
in another fact table .such a table called degenerated dimension.
Shrunken Dimension:
Dimension that is subsetof toanother dimension
Or
Dimension that is not directly linked to the Fact table

Junk Dimension:
Dimension that is organized based on low cardinality indicator or flag values
Cardinality is no of unique values in a column or Cardinality expresses the minimum and the
maximum no of instances of an entity ‘B’ that can be associated to an instance of Entity ‘A’
The Minimum and Maximum no can be 0,1 or “n”
Dirty Dimension:
If a record occurs more than one time in a table by the difference of non key attribute such a
table is called dirty dimension
Orders:
Order Order Payment Payment Mode Comm/Non Amoun

Id Date Mode Type Comm t
111 - Cash Cash No -
113 - Credit Master No -
116 - Credit Visa Yes -
Payment Mode Comm/Non

Ord Ind Id Payment Type Comm
1 Cash Cash No
2 Credit Master No
3 Credit Visa Yes
Order Order Amoun
Order Id Date Id t
111 - 1 -
112 - 1 -
113 - 2 -
114 - 1 -
115 - 1 -
116 - 3 -
117 - 1 -
Slowly Changing Dimension:
Dimension that change slowly and irregularly
Or
Dimension that change across time
There are three choices to handle slowly changing dimensions
1.SCD TYPE1
2.SCD TYPE-II
3.SCD TYPE-III
1. SCD TYPE-I:
Most recent changes are maintained
Type1 is current status
Type1 is used for error correction
CID CNAME DOB

11 BEN 12-JAN-1967
12 ALEN 15-FEB-1966
CKEY CID CNAME DOB

101 11 BEN 12-JAN-1967
102 12 ALEN 15-FEB-1966
SCD TYPE-II:
Change is inserted as a new record
Type-II is used to maintain historical status
PRODUCTS
PNAM
PID E PRICE EFF_DATE
11 ABC 300 12-JAN-10
12 PQR 270 15-JAN-10
PRODUCT PRICE OF 12
CHANGED 199 27-AUG-11
PKE PNAM
Y PID E PRICE EFF_DATE END_DATE
100 11 ABC 300 12-JAN-10
26-AUG-
101 12 PQR 270 15-JAN-10 11
102 12 PQR 199 27-AUG-11
Type-II Dimension is referred as Dirty Dimension
Type-II Dimension has redundant data
SCD Type-III: Change is appended as a new column
Type-III is used to maintain partial history status
CID CNAME LOC

11 BEN HYD
12 TOM CHE
CURR
CKEY CID CNAME LOC PREVLOC
101 11 BEN HYD
102 12 TOM CHE
CID CNAME LOC

11 BEN HYD
12 TOM BNG
CURR
101 11 BEN HYD -
102 12 TOM BNG CHE
CID CNAME LOC

11 BEN KER
12 TOM BNG
CURR
101 11 BEN KER HYD
102 12 TOM BNG CHE
Role Play Dimension: Dimension that is recycled in multiple applications within the DB
Data Modeling:
Model: Business presentation of the structure of the data in one or more database
OLTP:ER-Mode is used
Model is normalized
Model is efficient to wards transaction
Datawarehouse:Dimensional model is used
Model designed based on Facts&Dimensions
Model is efficient in query processiong
Schema:Scema is a collection of users’objects can be a Table,View or Synanim

Types of Schema:
1.Star Schema
2.Snow Flake Schema
3.Galaxy Schema
1.Star Schema: In a star schema a centre of a star is Fact table and corners are Dimension
tables
In simple start schema consist of only one Fact table
Star schema Dimension ‘s do not have parent tables
Star schema Dimension’s are Denarmalized
Star schema is De Normalized(every thing in one table) efficient in query processing
2. Snow Flake Schema

Snow flake schema dimensions have one or more parent tables
Snow flake schema is normalized
Snow flake schema is efficient in transaction processing
Custom
er
Cna Gend Geo
Cid me er id
11 C1 1 111
12 C2 1 111
13 C3 0 112
14 C4 1 111
Geograph
y
Stat Countr
Geoid City e y Region
111 Hyd Ts India Asia
112 VSP Ap India Asia
Countr
Cid Cname Gender Geoid City State y Region
11 C1 1 111 Hyd Ts India Asia
13 C3 0 112 VSP Ap India Asia
Star schema use more space than Snow flake schema

Galaxy Schema:
Multiple Fact tables are connected to multiple Dimensions tables
Index: (Fast accessing path)
1.B*Tree Index
2.BitMap Index
1.B*Tree Index
It is used on High Cardinality columns
Example for B*Tree Index=EMPNO
2.BitMap Index
It is used on Low Cardinality columns
Example for Bit Map Index=GENDER

1.Flat File to Oracle:
Using SQLLDR to load the flat file data in to Oracle Table
STEP1:
Create Data file with few sample records
=============================================================================
STATE_ID,STATE_NAME,COUNTRY_ID
250.00,Rio Negro,111
251.00,Buenos Aires,111
252.00,Victoria,115
253.00,South Australia,115
254.00,Queensland,115
255.00,Northern Territory,115
256.00,New South Wales,115
257.00,Australian Capital Territory,115
258.00,Sao Paulo,110
259.00,Santa Catarina,110
260.00,Rio de Janeiro,110
=============================================================================
Save the file in the following directory
C:\SOURCE\States.txt
STEP2:
Create table in the oracle data base using the script given below
CREATE TABLE STATES(STATE_ID NUMBER(5,2),STATE_NAME VARCHAR2(25),COUNTRY_ID

NUMBER(3));
STEP3: Create control file using note pad file:
LOAD DATA
INFILE 'C:\SOURCE\States.txt'
APPEND INTO TABLE STATES
FIELDS TERMINATED BY "," (STATE_ID,STATE_NAME,COUNTRY_ID)
Save control file in the following directory C:\SOURCE\”States.ctl”
STEP4:
Connect to SQL and use the following command SQL>HOST CMD
C:\oracle\product\10.2.0\db_1\BIN>SQLLDR scott@oracle/tiger direct=true skip=1 log=c:\

SOURCE\States.log control=c:\SOURCE\States.ctl
C:\oracle\product\10.2.0\db_1\BIN>exit
STEP5:
SQL>SELECT * FROM STATES
Project Development Life Cycle:

X KICK OFF MEETINGS1
X KICK OF MEETINGS2
ANALYSIS PHASE
DESIGN PHASE
CODING PHASE
REVIEWS
TESTING PHASE
GO LIVE PHASE
SUPPORT
1.Analysis Phase:
Business Analyst:
Gathers Business requirements it in Business , It consists of Business process,Organization

structure,Target users requirements details,source system
Senior Team based on BRS provides Hardware and Software requirements
Outcome:
SRS (System requirement Specification) consist of the following details
1. Operating system has to be used
2. DB Tool to be used
3. ETL, OLAP, Modeling tools to be used
2.Design Phase:
Data warehouse Architect/ETL Architect provides solution to build the DW or Data marts
Requirements it in Business consists of Business process, Organization structure, Target users

requirements details, source system
Outcome:
HLD (High Level Design Document) consists of the following Details
1.Summary Information
2.Project Architecture
3.System Architecture
4.Source
5.DB
6.ETL tool Details
7.Data Flow Diagram

8.Data Model
9.Source Object Details
10.Target Object Details
11.Staging Object Details
12.Mapping Details
Senior Technical Team:
Provides detail technical specifications for each Mapping
Outcome: Low Level Design Document It consists of Source and Target Object Details
(Field Names,Data Types,Length,Description),Entire Mapping Flow,Detail technical Design for

each Mapping.Block Diagram,Business Logic Pre and Post Dependencies,Schedule Options,Error
Handling
ETL Team:
Mapping Design Document is prepared for each Mapping
Outcome: Mapping Design Document
3.Coding Phase:
Mapping is created based on Design document
Code Review: Code review is to check Business Logic and whether naming standards are
followed or Not
Peer Review:
Team member review the same as above mentioned, If everything is Ok then do testing
4.Testing Phase:
1. Unit testing (Mappings are tested by individual users debugger or enable test load to test
mapping with limited test data
2. SIT (System Integration testing): Mappings are tested according to their dependencies
3. UAT (User acceptance testing): Mappings are tested in the presence of onsite users
4.Production Phase:
Jobs are scheduled and monitored scheduling tools-UC4,DAC,Autosis,Control-M,Tivoli Work
Load Scheduler
Project Architecture:

DW Basics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DW Basics

Uploaded by

Copyright:

Available Formats

Transaction:

A transaction is a business operation

Technical point of view :

It is a set of DML operations(Insert,UPDATE_DATE,Delete)

OLTP System=OLTP applications(Front end)+Database(Back end)

Data warehousing=ETL Development+BI Development

Enterprise Data warehouse:

Storage Capacity point of view Relational DB is categorized in to three types

1.Low range DB:

Can organized and managed mega bytes of information

2.Mid range DB:

Example:Oracle,Microsoft SQL SERVER,Sybase,DB2,Informix,Postgress SQL

3.High range DB:

Storage point of view data base categorized in to two types.

1.NFS-Normal File storage

2.DFS-Distributed File storage

Data storage Patterns:

1.NFS-Normal File storage

2.DFS-Distributed File storage

NFS-Normal File storage:

1.Single Disk for storing the data

2.Shared every thing architecture(data shared in single disk)

3.Data reads in Sequential

4.All Mid range DB are developed on platform of NFS.

5.Limit scalability or expansion

6.Strongly recommended for OLTP applications

8.default processor in NFS is only one

9.Disk cant scalable in NFS

Example:Oracle,Sybase,SQL server,DB2,Redbrics,Informix,Postgress SQL

DFS-Distributed File Storage:

1.Multiple disks for storing the data

3.Data reads in parallel(supports parallelism)

Enterprise DWH database Evaluation:

3.DB that supports nothing architecture

4.Database that supports unlimited scalability(expansion)

5.DB that massively parallel processing

7..DB that supports High Availability(Users can access)

9.Data base that supports parallel loading

Types of ETL tools

Two types of ETL tools to build Data Acquisition

1.GUI based ETL tool

2.Program Based ETL tool

Code Based ETL:

ETL applications are developed using programming languages such as

SQL, PLSQL, SAS, Teradata,ETL utilities

GUI Based ETL:

Example:Informatca,Data stage, Abnitio,SSIS

MSBI is a package it has(ETL+Reporting=SSIS+SSRS)

It is a process of filtering or rejecting Un wanted source data or records

Data Scrubbing: It is the process of Deriving new attributes or columns

It is the process of combining the data from multiple source systems

Data merging are two types

2.Data in a Data warehouse is derived from source system(OLTP/SOI)

OLTP: (Online transactional Processing)

Difference Between OLTP And Data ware house

Tables in Data Warehouse:

There are two types of tables we have in Data Warehouse

Stores textual or descriptive information about business process

Dimension tables example s in Banking Domain:

Applictions, Customers, Products, Branches, Promotions, Time, Billing cycle Dimension

Fact table stores measurements or metrics of a business process

Fact table examples in Retail Domain: