Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

INFORMATICA
Datawarehousing Basics

1. Definition of datawarehouse?

 Data warehouse is a Subject oriented, Integrated, Time variant, Non volatile collection of data
in support of management's decision making process

2. How many stages in Datawarehouse?

Data warehouse generally includes two stages

 ETL
 Report Generation

ETL

Short for extract, transform, load, three database functions that are combined into one tool

• Extract -- the process of reading data from a source database.

• Transform -- the process of converting the extracted data from its previous form into required form
• Load -- the process of writing the data into the target database.

ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to
convert databases from one format to another format

It is used to retrieve the data from various operational databases and is transformed into useful information and
finally loaded into Datawarehousing system

1 INFORMATICA
2 ABINITO
3 DATASTAGE
4. BODI
5 ORACLE WAREHOUSE BUILDERS

Report generation

In report generation, OLAP is used (i.e.) online analytical processing It is a set of specification
which allows the client applications in retrieving the data for analytical processing

It is a specialized tool that sit between a database and user in order to provide various analyses of the data
stored in the database OLAP Tool is a reporting tool which generates the reports that are useful for
Decision support for top level management

1 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
1. Business Objects
2. Cognos
3. Micro strategy
4. Hyperion
5. Oracle Express
6. Microsoft Analysis Services

3. Different Between OLTP and OLAP

OLTP OLAP
1 Application Oriented (e.g., Subject Oriented (subject in the sense customer,
purchase order it is functionality product, item, time)
of an application)

2 Used to run business Used to analyze business

3 Detailed data Summarized data

4 Repetitive access Ad-hoc access

5 Few Records accessed at a time Large volumes accessed at a time(millions),


(tens), simple query complex query

6 Small database Large Database

7 Current data Historical data

8 Clerical User Knowledge User

9 Row by Row Loading Bulk Loading

10 Time invariant Time variant

11 Normalized data De-normalized data

12 E – R schema Star schema

2 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
4. What are the types of datawarehouse?

EDW

 It provides a central database for decision support throughout the enterprise


 It is a collection of DATAMARTS

DATAMART

 It is a subset of Datawarehousing
 It is a subject oriented database which supports the needs of individuals depts. In an
organizations
 It is called high performance query structure
 It supports particular line of business like sales, marketing etc..

ODS (Operational data store)

 It is defined as an integrated view of operational database designed to support operational


monitoring
 It is a collection of operational data sources designed to support Transaction processing
 Data is refreshed near real-time and used for business activity
 It is an intermediate between the OLTP and OLAP which helps to create an instance reports

3 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
5. What are the modeling involved in Data Warehouse Architecture?

4 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
6. What are the types of Approach in DWH?

Bottom up approach: first we need to develop data mart then we integrate these data
mart into EDW

Top down approach: first we need to develop EDW then form that EDW we develop data
mart

Bottom up

OLTP ETL Data mart DWH OLAP

Top down

OLTP ETL DWH Data mart OLAP

Top down

 Cost of initial planning & design is high


 Takes longer duration of more than an year

Bottom up

 Planning & Designing the Data Marts without waiting for the Global warehouse design
 Immediate results from the data marts
 Tends to take less time to implement
 Errors in critical modules are detected earlier.
 Benefits are realized in the early phases.
 It is a Best Approach

Data Modeling Types:

 Conceptual Data Modeling


 Logical Data Modeling
 Physical Data Modeling
 Dimensional Data Modeling

5 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
Conceptual Data Modeling

 Conceptual data model includes all major entities and relationships and does not contain much
detailed level of information about attributes and is often used in the INITIAL PLANNING PHASE
 Conceptual data model is created by gathering business requirements from various sources like
business documents, discussion with functional teams, business analysts, smart management
experts and end users who do the reporting on the database. Data modelers create conceptual
data model and forward that model to functional team for their review.
 Conceptual data modeling gives an idea to the functional and technical team about how business
requirements would be projected in the logical data model.

.Logical Data Modeling

 This is the actual implementation and extension of a conceptual data model. Logical data model
includes all required entities, attributes, key groups, and relationships that represent business
information and define business rules.

6 sairavi.informatica@gmail.com
99520 29030

INFORMATICA

Physical Data Modeling

 Physical data model includes all required tables, columns, relationships, database properties
for the physical implementation of databases. Database performance, indexing strategy, physical
storage and demoralization are important parameters of a physical model.

7 sairavi.informatica@gmail.com
99520 29030

INFORMATICA

Logical vs. Physical Data Modeling

Logical Data Model Physical Data Model

Represents business information and defines business Represents the physical implementation of the model in a
rules database.

Entity Table

Attribute Column

Primary Key Primary Key Constraint

Alternate Key Unique Constraint or Unique Index

Inversion Key Entry Non Unique Index

Rule Check Constraint, Default Value

Relationship Foreign Key

Definition Comment

8 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
Dimensional Data Modeling

 Dimension model consists of fact and dimension tables


 It is an approach to develop the schema DB designs

Types of Dimensional modeling

 Star schema
 Snow flake schema
 Star flake schema (or) Hybrid schema
 Multi star schema

What is Star Schema?

 The Star Schema Logical database design which contains a centrally located fact table surrounded
by at least one or more dimension tables
 Since the database design looks like a star, hence it is called star schema db
 The Dimension table contains Primary keys and the textual descriptions
 It contain de-normalized business information
 A Fact table contains a composite key and measures
 The measure are of types of key performance indicators which are used to evaluate the
enterprise performance in the form of success and failure
 Eg Total revenue , Product sale , Discount given, no of customers
 To generate meaningful report the report should contain at least one dimension and one fact
table

The advantage of star schema

 Less number of joins


 Improve query performance
 Slicing down
 Easy understanding of data.

Disadvantage:

 Require more storage space

9 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
Example of Star Schema:

Snowflake Schema

 In star schema, If the dimension tables are spitted into one or more dimension tables
 The de-normalized dimension tables are spitted into a normalized dimension table

Example of Snowflake Schema:

10 sairavi.informatica@gmail.com
99520 29030

INFORMATICA

 In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables
and 1 fact table. The reason is that hierarchies (category, branch, state, and month) are being
broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME)
respectively and separately.
 It increases the number of joins and poor performance in retrieval of data.
 In few organizations, they try to normalize the dimension tables to save space.
 Since dimension tables hold less space snow flake schema approach may be avoided.
 Bit map indexes can not be effectively utilized

Important aspects of Star Schema & Snow Flake Schema

 In a star schema every dimension will have a primary key.


 In a star schema, a dimension table will not have any parent table.
 Whereas in a snow flake schema, a dimension table will have one or more parent tables.
 Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
 Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies
helps to drill down the data from topmost hierarchies to the lowermost hierarchies.

Hybrid Schema

 Hybrid schema is a combination of Star and Snowflake schema

11 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
Multi Star schema

 Multiple fact tables sharing a set of dimension tables

 Confirmed Dimensions are nothing but Reusable Dimensions.


 The dimensions which we are using multiple times or in multiple data marts.
 Those are common in different data marts

Measure Types

• Additive - Measures that can be summed up across all dimensions.


o Ex Sales Revenue
• Semi Additive - Measures that can be summed up across few dimensions and not with others
o Ex: Current Balance.
• Non Additive - Measures that cannot be summed up across any of the dimensions.
o Ex: Student attendance

12 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
Surrogate Key

 Joins between fact and dimension tables should be based on surrogate keys
 Users should not obtain any information by looking at these keys
 These keys should be simple integers

A sample data warehouse schema

Why need staging area for DWH?

 Staging area needs to clean operational data before loading into data warehouse.
 Cleaning in the sense your merging data which comes from different source.
 It’s the area where most of the ETL is done

Data Cleansing

 It is used to remove duplications


 It is used to correct wrong email addresses
 It is used to identify missing data
 It used to convert the data types
 It is used to capitalize name & addresses.

Types of Dimensions:

There are three types of Dimensions

 Confirmed Dimensions
 Junk Dimensions Garbage Dimension

13 sairavi.informatica@gmail.com
99520 29030

INFORMATICA
 Degenerative Dimensions

Garbage Dimension or Junk Dimension

 Confirmed is some thing which can be shared by multiple Fact Tables or multiple Data Marts.
 Junk Dimensions is grouping flagged values
 Degenerative Dimension is something dimensional in nature but exist fact table.(Invoice No)
Which is neither fact nor strictly dimension attributes. These are useful for some kind of analysis.
These are kept as attributes in fact table called degenerated dimension

Degenerate dimension:

A column of the key section of the fact table that does not have the associated dimension table but used for
reporting and analysis, such column is called degenerate dimension or line item dimension.

For ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, and date in key
section and price, quantity, amount in measure section. In this fact table, bill_no from key section is a single
value; it has no associated dimension table. Instead of creating a Separate dimension table for that single
value, we can Include it in fact table to improve performance. SO here the column, bill_no is a degenerate
dimension or line item dimension.

14 sairavi.informatica@gmail.com
99520 29030

You might also like