Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Transaction:

A transaction is a business operation

Technical point of view :

It is a set of DML operations(Insert,UPDATE_DATE,Delete)

OLTP System=OLTP applications(Front end)+Database(Back end)

Data warehousing=ETL Development+BI Development

Enterprise Data warehouse:

An Enterprise Data warehouse is a relational DB which is specially designed for analyzing the
business and making decisions to achieve the business goals and responding to business
problems ,but not designed for business transactional processing

A Data warehouse is a concept of consolidating the data from multiple OLTP data bases

Storage Capacity point of view Relational DB is categorized in to three types

1.Low range

2.Mid range

3.High range

1.Low range DB:

Can organized and managed mega bytes of information

Example:Ms-Access

2.Mid range DB:


Can organized and managed Giga bytes of information

Example:Oracle,Microsoft SQL SERVER,Sybase,DB2,Informix,Postgress SQL

3.High range DB:

Can organized and managed Tera bytes and Peta Bytes of information

Example:Teradata,Netezza,GreenPlum,Hadoop.

Storage point of view data base categorized in to two types.

1.NFS-Normal File storage

2.DFS-Distributed File storage

Data storage Patterns:

There are two types of data storage patterns which are supported by relational DB

1.NFS-Normal File storage

2.DFS-Distributed File storage

NFS-Normal File storage:

1.Single Disk for storing the data

2.Shared every thing architecture(data shared in single disk)

3.Data reads in Sequential

4.All Mid range DB are developed on platform of NFS.

5.Limit scalability or expansion

6.Strongly recommended for OLTP applications

7.Recomended for data warehousing for small and medium scale enterprises with storage
capacity of gigabytes

8.default processor in NFS is only one

9.Disk cant scalable in NFS

Example:Oracle,Sybase,SQL server,DB2,Redbrics,Informix,Postgress SQL


Note:Processor is a S/W component run as .exe

DFS-Distributed File Storage:

1.Multiple disks for storing the data

2.Storing nothing architecture (every processor has dedicated memory& disk that is not shared
by another processor)

3.Data reads in parallel(supports parallelism)

4.Unlimited Scalability

5.Designed only for Building Enterprise data warehouse but not for OLTP

Example:Teradata,Netteza,Hadoop,green plum

Enterprise DWH database Evaluation:

1.Data base that supports enormous storage capacity(Billions of rows and Tera bytes)
2. DB that supports distributed file storage pattern

3.DB that supports nothing architecture

4.Database that supports unlimited scalability(expansion)

5.DB that massively parallel processing

6.DB that supports mature optimizers to handle complex SQL Queries( Run the queries more
faster with less system resource usage

7..DB that supports High Availability(Users can access)

8.100% data without data loss even S/W,H/W components are down

9.Data base that supports parallel loading

10.That DB supports low TCO (total cost of owner ship) ease to set up ,administrate & Manage

11. Single DB server that can provide access to hundreds of users concurrently

Data Acquisition:

It is a process of extracting the data from multiple source systems,transforming the data into
consistent format and load in to a target system,To implement the ETL process we need ETL
tools

Types of ETL tools

Two types of ETL tools to build Data Acquisition

1.GUI based ETL tool

2.Program Based ETL tool

Code Based ETL:

ETL applications are developed using programming languages such as

SQL, PLSQL, SAS, Teradata,ETL utilities

GUI Based ETL:


ETL applications are developed using simple graphical user interface,point& click features

Example:Informatca,Data stage, Abnitio,SSIS

MSBI is a package it has(ETL+Reporting=SSIS+SSRS)

Data Cleansing:

It is a process of filtering or rejecting Un wanted source data or records

Data Scrubbing: It is the process of Deriving new attributes or columns

Data Merging:

It is the process of combining the data from multiple source systems

Data merging are two types

1.Join

2.Union
Data warehouse:

1.Data warehouse is a relational DB that is used to store the historical data for query& Analysis

2.Data in a Data warehouse is derived from source system(OLTP/SOI)

SOR-->Source of records

OLTP: (Online transactional Processing)

Computer system that stores time sensitive transaction related data that is processed
immediately and analysis and always kept current.

Difference Between OLTP And Data ware house

Tables in Data Warehouse:

There are two types of tables we have in Data Warehouse

1. Dimension Table

2. Fact Table

1. Dimension Table:

Stores textual or descriptive information about business process


Dimension tables example s in Retail Domain:

Customer,Product,Stores,Employees,Pramotions,Time

Dimension tables example s in Banking Domain:

Applictions, Customers, Products, Branches, Promotions, Time, Billing cycle Dimension

Fact Table:

Fact table stores measurements or metrics of a business process

Fact table examples in Retail Domain:

Sales,Purchase,Inventry

Fact tables examples in Banking Domain

1. SA_LoanTransaction Fact

2. CC_Transaction Fact

3. CC_Statement Fact

Fact table consists of Keys and Measures and Fact table consist of Composite Primary key

Composite Primary Key


Store Prod Date Revenue(X
Key(X) Key(X) Key(X) )
S1 P1 D1 3000
S1 P2 D1 2000
S2 P1 D1 2000

Types of Fact tables:

There are three types of fact tables

1. Fact Less Fact table

2. Cumulative Fact table

3. Snap shot Fact table

1. Fact less Fact table:

1.Fact less Fact table consist of only keys and No Measures

2.Fact less Fact table is to record the events

3.Fact less Fact table acts as a Bridge between the Dimensional tables

Example of Fact less Fact table: Employee Attendence Fact less Fact

Dimension Tables
Auditorim Sponsors Time Paticipant Events
Aud Id Sponsor Id Date Key Paticipant Id Event Id
Sponsor Month Paticipant Event
Aud Name Name Key Name Name
Aud Type Contribution Qtr Gender Event Type
Aud Mgr Address Year Address Event Desc
Aud
Address

Fact Table
Aud Id Sponsor Id Paticipant id Event id
A1 S1 P1 E1
A1 S1 P2 E1
A2 S1 P3 E1

2. Cumulative Fact table:

It consist of additive fact it describes what happened over a period of time


Ex: Sales Fact table, Order Fact table

3. Snapshot Fact table:

It consist of semi additive facts and non additive facts it describes states of things in a
particular instance of time

Ex: Bank Fact table, Inventory Fact table

Degenarate Dimension Key:

Key In a Fact table that is not associated with any Dimension

Example:Order Id,Sale Id, Bill No,Invoice etc

Types of Facts:

There are 3 types of Facts in Fact tables

1. Additive Facts

2.Semi Additive Facts

3. Non Additive Facts

1.Additive Fact: Business measurements in a fact table that can be summed up through all of
the dimensional Keys

Fact Table
Store Key Prod Key Date Key Revenue
S1 P1 12-Jan-15 600
S1 P2 12-Jan-15 400
S2 P2 12-Jan-15 800
S2 P3 13-Jan-15 500
S3 P1 13-Jan-15 700
S3 P3 14-Jan-15 900

Reports generation using Keys In above Fact table

Revenue Report By Revenue Report By Revenue Report By


Store Product Date
Store Key Revenue Product Key Revenue Date Key Revenue
S1 1000 P1 1300 12-Jan-15 1800
S2 1300 P2 1200 13-Jan-15 1200
S3 1600 P3 1400 14-Jan-15 900
Bank Fact table:

Semi Additive Fact: Business measurements in a fact table that can be summed up across only
few Dimensional Keys

Transaction Balanc Profit


Acct Id Date e Margin
21653 12-Jan-15 700000 -
21654 12-Jan-15 400000 -
21653 13-Jan-15 900000 -
21654 13-Jan-15 600000 -
Reports:

Balance By Acct Id
Acct Balanc
Id Balance e
160000
21653 0 900000
100000
21654 0 600000

Balance By Date
Date Key Balance
12-Jan-15 1100000
13-Jan-15 1500000

The above example is for Semi additive Fact


3.Non Additive Fact:Business measurements in a fact table that cannot be summed up across
any Dimension KeysNote: In a Fact table percentage are always non additive

SEM1 80%
SEM2 60%
TOTAL 140% Wrong

Note: Example of Non Additive Fact is Unit Price

Types of Dimensions:

The following are the diff types of dimensions in DW

1. Confirmed Dimension

2. Degenerated Dimension

3. Shrunken Dimension

4. Junk Dimension

5.Dirty Dimension

Types of Dimensions:

Conformed Dimension: A Dimension that is shared across multiple Fact table that is called
Conformed Dimension Or Dimension that is used to join Data mart

Banking Domain:
Degenerated Dimension:

If a fact table act as dimension and it’s shared with another fact table (or) maintains foreign key
in another fact table .such a table called degenerated dimension.

Shrunken Dimension:

Dimension that is subsetof toanother dimension

Or

Dimension that is not directly linked to the Fact table


Junk Dimension:

Dimension that is organized based on low cardinality indicator or flag values

Cardinality is no of unique values in a column or Cardinality expresses the minimum and the
maximum no of instances of an entity ‘B’ that can be associated to an instance of Entity ‘A’

The Minimum and Maximum no can be 0,1 or “n”

Dirty Dimension:

If a record occurs more than one time in a table by the difference of non key attribute such a
table is called dirty dimension

Orders:

Order Order Payment Payment Mode Comm/Non Amoun


Id Date Mode Type Comm t
111 - Cash Cash No -
112 - Cash Cash No -
113 - Credit Master No -
114 - Cash Cash No -
115 - Cash Cash No -
116 - Credit Visa Yes -
117 - Cash Cash No -

Payment Mode Comm/Non


Ord Ind Id Payment Type Comm
1 Cash Cash No
2 Credit Master No
3 Credit Visa Yes
Order Order Amoun
Order Id Date Id t
111 - 1 -
112 - 1 -
113 - 2 -
114 - 1 -
115 - 1 -
116 - 3 -
117 - 1 -

Slowly Changing Dimension:

Dimension that change slowly and irregularly

Or

Dimension that change across time

There are three choices to handle slowly changing dimensions

1.SCD TYPE1

2.SCD TYPE-II

3.SCD TYPE-III

1. SCD TYPE-I:

Most recent changes are maintained

Type1 is current status

Type1 is used for error correction

CID CNAME DOB


11 BEN 12-JAN-1967
12 ALEN 15-FEB-1966

CKEY CID CNAME DOB


101 11 BEN 12-JAN-1967
102 12 ALEN 15-FEB-1966

SCD TYPE-II:
Change is inserted as a new record

Type-II is used to maintain historical status

PRODUCTS
PNAM
PID E PRICE EFF_DATE
11 ABC 300 12-JAN-10
12 PQR 270 15-JAN-10
PRODUCT PRICE OF 12
CHANGED 199 27-AUG-11

PKE PNAM
Y PID E PRICE EFF_DATE END_DATE
100 11 ABC 300 12-JAN-10
26-AUG-
101 12 PQR 270 15-JAN-10 11
102 12 PQR 199 27-AUG-11

Type-II Dimension is referred as Dirty Dimension

Type-II Dimension has redundant data

SCD Type-III: Change is appended as a new column

Type-III is used to maintain partial history status

CID CNAME LOC


11 BEN HYD
12 TOM CHE

CURR
CKEY CID CNAME LOC PREVLOC
101 11 BEN HYD
102 12 TOM CHE

CID CNAME LOC


11 BEN HYD
12 TOM BNG
CURR
CKEY CID CNAME LOC PREVLOC
101 11 BEN HYD -
102 12 TOM BNG CHE

CID CNAME LOC


11 BEN KER
12 TOM BNG

CURR
CKEY CID CNAME LOC PREVLOC
101 11 BEN KER HYD
102 12 TOM BNG CHE

Role Play Dimension: Dimension that is recycled in multiple applications within the DB

Data Modeling:

Model: Business presentation of the structure of the data in one or more database

OLTP:ER-Mode is used

Model is normalized

Model is efficient to wards transaction

Datawarehouse:Dimensional model is used

Model designed based on Facts&Dimensions

Model is efficient in query processiong

Schema:Scema is a collection of users’objects can be a Table,View or Synanim


Types of Schema:

1.Star Schema

2.Snow Flake Schema

3.Galaxy Schema

1.Star Schema: In a star schema a centre of a star is Fact table and corners are Dimension
tables

In simple start schema consist of only one Fact table

Star schema Dimension ‘s do not have parent tables

Star schema Dimension’s are Denarmalized

Star schema is De Normalized(every thing in one table) efficient in query processing

2. Snow Flake Schema


Snow flake schema dimensions have one or more parent tables

Snow flake schema is normalized

Snow flake schema is efficient in transaction processing

Custom
er
Cna Gend Geo
Cid me er id
11 C1 1 111
12 C2 1 111
13 C3 0 112
14 C4 1 111

Geograph
y
Stat Countr
Geoid City e y Region
111 Hyd Ts India Asia
112 VSP Ap India Asia

Countr
Cid Cname Gender Geoid City State y Region
11 C1 1 111 Hyd Ts India Asia
12 C2 1 111 Hyd Ts India Asia
13 C3 0 112 VSP Ap India Asia
14 C4 1 111 Hyd Ts India Asia

Star schema use more space than Snow flake schema


Galaxy Schema:

Multiple Fact tables are connected to multiple Dimensions tables

Index: (Fast accessing path)

1.B*Tree Index

2.BitMap Index

1.B*Tree Index

It is used on High Cardinality columns

Example for B*Tree Index=EMPNO

2.BitMap Index

It is used on Low Cardinality columns

Example for Bit Map Index=GENDER


1.Flat File to Oracle:

Using SQLLDR to load the flat file data in to Oracle Table

STEP1:

Create Data file with few sample records

=============================================================================

STATE_ID,STATE_NAME,COUNTRY_ID

250.00,Rio Negro,111

251.00,Buenos Aires,111

252.00,Victoria,115

253.00,South Australia,115

254.00,Queensland,115

255.00,Northern Territory,115

256.00,New South Wales,115

257.00,Australian Capital Territory,115

258.00,Sao Paulo,110

259.00,Santa Catarina,110

260.00,Rio de Janeiro,110

=============================================================================

Save the file in the following directory

C:\SOURCE\States.txt

STEP2:
Create table in the oracle data base using the script given below

CREATE TABLE STATES(STATE_ID NUMBER(5,2),STATE_NAME VARCHAR2(25),COUNTRY_ID


NUMBER(3));

STEP3: Create control file using note pad file:

LOAD DATA

INFILE 'C:\SOURCE\States.txt'

APPEND INTO TABLE STATES

FIELDS TERMINATED BY "," (STATE_ID,STATE_NAME,COUNTRY_ID)

Save control file in the following directory C:\SOURCE\”States.ctl”

STEP4:

Connect to SQL and use the following command SQL>HOST CMD

C:\oracle\product\10.2.0\db_1\BIN>SQLLDR scott@oracle/tiger direct=true skip=1 log=c:\


SOURCE\States.log control=c:\SOURCE\States.ctl

C:\oracle\product\10.2.0\db_1\BIN>exit

STEP5:

SQL>SELECT * FROM STATES

Project Development Life Cycle:


X KICK OFF MEETINGS1

X KICK OF MEETINGS2

ANALYSIS PHASE

DESIGN PHASE

CODING PHASE

REVIEWS

TESTING PHASE

GO LIVE PHASE
SUPPORT

1.Analysis Phase:
Business Analyst:

Gathers Business requirements it in Business , It consists of Business process,Organization


structure,Target users requirements details,source system

Senior Team based on BRS provides Hardware and Software requirements

Outcome:

SRS (System requirement Specification) consist of the following details

1. Operating system has to be used

2. DB Tool to be used

3. ETL, OLAP, Modeling tools to be used

2.Design Phase:
Data warehouse Architect/ETL Architect provides solution to build the DW or Data marts

Requirements it in Business consists of Business process, Organization structure, Target users


requirements details, source system

Outcome:

HLD (High Level Design Document) consists of the following Details

1.Summary Information

2.Project Architecture

3.System Architecture

4.Source

5.DB

6.ETL tool Details

7.Data Flow Diagram


8.Data Model

9.Source Object Details

10.Target Object Details

11.Staging Object Details

12.Mapping Details

Senior Technical Team:

Provides detail technical specifications for each Mapping

Outcome: Low Level Design Document It consists of Source and Target Object Details

(Field Names,Data Types,Length,Description),Entire Mapping Flow,Detail technical Design for


each Mapping.Block Diagram,Business Logic Pre and Post Dependencies,Schedule Options,Error
Handling

ETL Team:

Mapping Design Document is prepared for each Mapping

Outcome: Mapping Design Document

3.Coding Phase:
Mapping is created based on Design document

Code Review: Code review is to check Business Logic and whether naming standards are
followed or Not

Peer Review:

Team member review the same as above mentioned, If everything is Ok then do testing

4.Testing Phase:
1. Unit testing (Mappings are tested by individual users debugger or enable test load to test
mapping with limited test data

2. SIT (System Integration testing): Mappings are tested according to their dependencies

3. UAT (User acceptance testing): Mappings are tested in the presence of onsite users
4.Production Phase:
Jobs are scheduled and monitored scheduling tools-UC4,DAC,Autosis,Control-M,Tivoli Work
Load Scheduler

Project Architecture:

You might also like