Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

Data Warehousing

Data Warehousing ?
• Data warehousing is the process of constructing and using a data warehouse.
• Data warehousing is the secure electronic storage of information by a
business or other organization.
• The goal of data warehousing is to create a trove of historical data that can
be retrieved and analyzed to provide useful insight into the organization's
operations.
• Data warehousing is a vital component of business intelligence.
BI ?
• Business intelligence (BI) refers to the procedural and technical
infrastructure that collects, stores, and analyzes the data produced by a
company’s activities.
• BI is a broad term that encompasses data mining, process analysis,
performance benchmarking, and descriptive analytics.
• BI parses all the data generated by a business and presents easy-to-digest
reports, performance measures, and trends that inform management
decisions.
More on BI
• BI is an umbrella term that combines architectures, tools, databases,
analytical tools, applications, and methodologies.
• BI is a content-free expression, so it means different things to different
people.
• BI helps transform data, to information (and knowledge), to decisions and
finally to action.
DIKW Pyramid
Descriptive analytics ?
• Descriptive analytics is the interpretation of historical data to better
understand changes that have occurred in a business.
• Descriptive analytics describes the use of a range of historic data to draw
comparisons.
• Most commonly reported financial metrics are a product of descriptive
analytics, for example, month-over-month sales growth, the number of users,
or the total revenue per subscriber.
Data Mining
• Data mining is a process used by companies to turn raw data into useful
information.
• By using software to look for patterns in large batches of data, businesses
can learn more about their customers to develop more effective marketing
strategies, increase sales and decrease costs.
• Data mining depends on effective data collection, warehousing, and
computer processing.
Data Warehousing Kick-off
• The need to warehouse data evolved as businesses began relying on
computer systems to create, file, and retrieve important business documents.
• The concept of data warehousing was introduced in 1988 by IBM
researchers Barry Devlin and Paul Murphy.
• Data warehousing is designed to enable the analysis of historical data.
• Comparing data consolidated from multiple heterogeneous sources can
provide insight into the performance of a company.
Data Warehouse
• A physical repository where relational data are specially organized to provide
enterprise-wide, cleansed data in a standardized format.
• “The data warehouse is a collection of integrated, subject-oriented databases
designed to support DSS functions, where each unit of data is non-volatile
and relevant to some moment in time”
DWH Properties
1.Subject-Oriented
2. Integrated
3. Time-Variant
4. Non-Volatile
DWH Components
Data Integration & ETL Process
• Data integration is a process where data from many sources goes to a single
centralized location, which is often a data warehouse. The end location needs to be
flexible enough to handle lots of different kinds of data at potentially large
volumes. Data integration is deal for powering analytical use cases.
• ETL stands for extract, transform, and load.
• ETL refers to the process of extracting data from source systems, transforming it
into a different structure or format, and loading it into a destination.
• Data integration and application integration are two types of ETL.
ETL vs ELT
• ETL transforms data on a separate processing server, while ELT transforms
data within the data warehouse itself.
• ETL does not transfer raw data into the data warehouse, while ELT sends
raw data directly to the data warehouse.
OLTP vs OLAP
• OLTP is a system that manages transaction-oriented applications on the
internet (related to day-to-day business functions) as ATM, ERP, CRM, POS.
• OLAP is an online system that reports to multidimensional analytical queries
(ad hoc analysis of organizational data) as financial reporting, forecasting,
etc.
• The basic difference between OLTP and OLAP is that OLTP is an online
database modifying system, whereas, OLAP is an online database query
answering system.
Some Definitions
1.ODS
• An operational data store (ODS) is a type of database that's often used as an interim
logical area for a data warehouse.
• ODSes are designed to integrate data from multiple sources for lightweight data
processing activities such as operational reporting and real-time analysis.
• While in the ODS, data can be scrubbed, resolved for redundancy and checked for
compliance with the corresponding business rules. An ODS can be used for
integrating disparate data from multiple sources so that business operations, analysis
and reporting can be carried out while business operations are occurring.
• This is where most of the data used in current operations is housed before it's
transferred to the data warehouse for longer-term storage or archiving.
1.ODS
2.Data Stage
• The Data Staging Area is a temporary storage area for data copied from Source
Systems.
• In a Data Warehousing Architecture, a Data Staging Area is mostly necessary for
time considerations.
• Before data can be incorporated into the Data Warehouse, all essential data must be
readily available. It is not possible to retrieve all data from all Operational databases
at the same time because of varying Business Cycles, Data Processing Cycles,
Hardware, and Network Resource Restrictions, and Geographical Variables.
3.Data Pipeline
• A data pipeline is a series of data processing steps. If the data is not currently
loaded into the data platform, then it is ingested at the beginning of the pipeline.
Then there are a series of steps in which each step delivers an output that is the
input to the next step. This continues until the pipeline is complete. In some cases,
independent steps may be run in parallel.
• Data pipelines consist of three key elements: a source, a processing step or steps,
and a destination. In some data pipelines, the destination may be called a sink.
• Common steps in data pipelines include data transformation, augmentation,
enrichment, filtering, grouping, aggregating, and the running of algorithms against
that data.
3.Data Pipeline
4. Data Mart
• A departmental data warehouse that stores only relevant data.
• Dependent data mart: a subset that is created directly from a data warehouse.
• Independent data mart: a small data warehouse designed for a strategic
business unit or a department.
Data Lake vs Data Warehouse
• Data lakes and data warehouses are both widely used for storing big data, but
they are not interchangeable terms. A data lake is a vast pool of raw data, the
purpose for which is not yet defined. A data warehouse is a repository for
structured, filtered data that has already been processed for a specific
purpose.
• There is even an emerging data management architecture trend of the data
lakehouse, which combines the flexibility of a data lake with the data
management capabilities of a data warehouse.
Dimensional Modeling
• Dimensional Modeling (DM) is a data structure technique optimized for
data storage in a Data warehouse.
• The purpose of dimensional modeling is to optimize the database for faster
retrieval of data.
• The concept of Dimensional Modelling was developed by Ralph Kimball
and consists of “fact” and “dimension” tables.
DM vs RM
• A dimensional model in data warehouse is designed to read, summarize, analyze
numeric information like values, balances, counts, weights, etc. in a data warehouse.
In contrast, relation models are optimized for addition, updating and deletion of
data in a real-time Online Transaction System.
• These dimensional and relational models have their unique way of data storage that
has specific advantages.
• In relational model, normalization and ER models reduce redundancy in data. On
the contrary, dimensional model in data warehouse arranges data in such a way that
it is easier to retrieve information and generate reports.
DM Elements
• Facts are the measurements/metrics or facts from your business process. For a
Sales business process, a measurement would be quarterly sales number
• Dimension provides the context surrounding a business process event. In simple
terms, they give who, what, where of a fact. In the Sales business process, for the
fact quarterly sales number, dimensions would be
Who – Customer Names
Where – Location
What – Product Name
DM Elements
• The Attributes are the various characteristics of the dimension in
dimensional data modeling.
• In the Location dimension, the attributes can be
1.State
2.Country
3.Zipcode
Star Schema
• The most commonly used and the simplest style of dimensional modeling.
• Contains one or more fact table/s surrounded by and connected to several
dimension tables.
• Fact table/s contain/s the descriptive attributes (numerical values) needed to
perform decision analysis and query reporting.
• Dimension tables contain classification and aggregation information about
the values in the fact table.
Dim Keys
• A surrogate key is a system generated (could be GUID, sequence, unique
identifier, etc.) value with no business meaning that is used to uniquely
identify a record in a table.

• A natural key is a column or set of columns that already exist in the table
(e.g. they are attributes of the entity within the data model) and uniquely
identify a record in the table. Since these columns are attributes of the entity
they obviously have business meaning.
Base Facts vs Derived Facts
• The types of fact columns are base fact columns and derived fact columns:
1. Base fact columns are represented by a single column in a fact table.
Base Facts vs Derived Facts
2. Derived fact columns are created through a mathematical combination of
other existing fact columns.
Additive Facts ?
• Assume that we are a retailer, and we have a fact table with the following
columns:
• The purpose of this table is to record the sales amount for each product in
each store on a daily basis.
Additive Facts ?
• Say we are a bank with the following fact table:
• The purpose of this table is to record the current balance for each account at
the end of each day, as well as the profit margin for each account for each
day.
Factless Fact Table
• Factless tables simply mean the key available in the fact that no remedies are
available. Factless fact tables are only used to establish relationships between
elements of different dimensions. And are also useful for describing events
and coverage, meaning tables contain information that nothing has
happened. It often represents many-to-many relationships
• What products were on promotion but did not sell? the sales fact table
records only the SKUs actually sold. how to know what did not happen !
Factless Fact Tables
• A factless fact table is a fact table that does not have any measures. It is
essentially an intersection of dimensions (it contains nothing but
dimensional keys).
• There are two types of factless tables: One is for capturing an event, and
one is for describing conditions.
1.Event Tracking
• An event establishes the relationship among the dimension members from various
dimensions, but there is no measured value.
• The existence of the relationship itself is the fact.
• You can count the number of occurrences with various criteria (capture student attendance,
creating a row each time a student attends a class).
• The following questions can be answered:
1. Which class has the least attendance?
2. Which teachers taught the most students?
3. What is the average number of attendance of a given course?
2. Coverage
• It is used to support negative analysis reports.
• For example, a store that did not sell a product for a given period. To
produce such a report, you need to have a factless fact table to capture all the
possible combinations of products that were on sale. You can then figure
out what is missing by comparing the factless table with the sales table for
the list of products that did sell.
Types of Dimensions
• 1. Slowly Changing Dimensions (SCD)
• 2. Rapidly Changing Dimensions (RCD)
1.SCD Types
• Dimension table attributes are relatively static; but are not fixed forever.
• Dimensions attributes might change slowly over time.
• Business Users should determine if they are interested to track dimension
attributes change or not !
• Some dimension attributes values never change. It applies for any attribute
labeled original or initial as customer initial balance
(Type 0 : Retain Original)
2. RCD
• A dimension is a fast changing or rapidly changing dimension if one or more
of its attributes in the table changes very fast and in many rows.
• Handling rapidly changing dimension in a data warehouse is very difficult
because of many performance implications.
RCD & Junk Dim
• The attribute like patient_id, Name, Gender, Marital_status will not change
or changes very rarely.
• And attribute like weight and BMI (body mass index) changes every month
based on the patient visit to hospital.
• So, we need to separate the weight column out of the patient table otherwise
we end up filling the table if we use SCD type 2 on PATIENT dimension.
• We can put the weight column which is rapidly changing into junk dimension
table.
Junk Dim Linking
• we cannot simply refer the junk dimension table by adding its primary key to
patient table as foreign key bec. any changes made to junk dimension will
have to reflect in the patient table, increasing the data in patient dimension.
• Instead, we create one more table called mini dimension that acts as a
bridge between Patient and Junk dimension, also possibly add columns such
as start and end date to track the change history.
• Mini dimension table is just bridge between two tables and does not require
any surrogate key in it.
RCD Implementation
Homework!
• In data warehousing there are 6 types of dimension:
1. Normal dimension.
2. Junk dimension.
3. Split dimension.
4. Text dimension.
5. Stacked dimension.
6. Distinct Attribute dimension.

You might also like