Professional Documents
Culture Documents
Data Warehouse
Data Warehouse
Data Warehouse
Introduction:
Advancements in IT and Database technology increased the
This explosive growth in stored data has generated an urgent need for new
dimensions.
application systems.
Data Warehouses are designed to facilitate reporting & analysis.
Definition:
Definition: A Data Warehouse is a repository of subjectively selected and adapted operational data which can answer any ad hoc, complex, statistical or analytical queries.
Definition:
A Data Warehouse is a:
subject-oriented, integrated, time- variant and non- volatile
Features:
1.
Subject Oriented:
DW is organized around major subjects like customer, supplier, product and sales.
2.
Integrated:
DW is constructed by integrating multiple, heterogeneous data sources such as relational databases, flat files, on-line transaction Records.
3.
Time Variant:
The time horizon for the data warehouse is significantly longer than that of
operational systems.
4. Nonvolatile:
Nonvolatile means that, once entered into the warehouse, data should not change.
5.
Accessible:
The primary purpose of data warehouse is to provide readily accessible information to end users.
Characteristics:
Smaller number of (concurrent) users. Instant response is less important (only for interactively composing reports.
Read-only access by users. Most data access will be targeted at a small partition of the data: the last month or quarter.
Database access less frequent but executing large and complicated queries that access many rows per table.
Inconsistent, primarily long- running and complex read-only transactions instead of high constant transaction rate.
Load from operational data store will only insert new records, existing ones do not get changed (updated).
Bulk load from operational data store, no single-record inserts (at most once daily).
Large storage capacity for historical data . May also contain aggregate data.
Benefits of data warehousing Some of the benefits that a data warehouse provides are as follows:
Common data model. DW makes it easier to report and analyze information. Prior to loading data into the data warehouse, inconsistencies are identified
and resolved.
Because they are separate from operational systems, data warehouses
trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.
Data Warehousing:
Data warehousing is a process of constructing and using data warehouses. Thus, an expanded definition for data warehousing includes business
intelligence tools (, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
Process in Warehousing:
Extract, Transform, and Load (ETL) is a process in data warehousing that
involves:
extracting data from outside sources, transforming it to fit business needs loading it into the end target, i.e. the data warehouse.
1) Extract:
The first part of an ETL process is to extract the data from the source
systems.
An intrinsic part of the extraction is the parsing of extracted data,
2) Transform:
data.
In other cases, one or more of the following transformations types to meet
the business and technical needs of the end target may be required:
Selecting only certain columns to load (or selecting null columns not to
load).
Translating coded values (e.g., if the source system stores 1 for male and 2
for female, but the warehouse stores M for male and F for female) .
Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
Filtering Sorting
vice versa)
Splitting a column into multiple columns (e.g., putting a comma-separated
3) Load:
The load phase loads the data into the end target, usually being
ranges widely.
Some data warehouses might weekly overwrite existing information
with cumulative, updated data, while other DW (or even other parts of the same DW) might add new data in a historized form, e.g. hourly.
database?
Why to spend additional time and resources to construct a separate data
warehouse?
1)Major reason for such separation is to promote high performance of both systems.
two systems.
Since the two systems provide quite different functionalities and require
Distribute
Store
Meta Data
Meta Data:
Meta data is an information about the data. It is the data that describes the structure and business meaning of the
Types of Metadata:
Technical Metadata:
Detailed picture of the warehouse.
Flow of data. Extraction sources & processes. Business rules. Data transformation rules. Load rules and procedures. Physical data structures.
Business Metadata: Connects the business user to the data warehouse data that is relevant to them.
Data context. Understandability. Accessibility.
Validity.
OLTP:
Online transaction processing.
OLTP refers to a class of systems that facilitate and manage:
transaction-oriented applications,
OLTP systems responds immediately to user requests. The major task of OLTP is to perform online transaction and query processing. They cover day to day operations of an organization such as purchasing, inventory,
Benefits:
Online Transaction Processing has two key benefits:
1.
Simplicity and
2. Efficiency.
Disadvantages:
It is a great tool for any organization, but in using OLTP, there are a
2. Economic costs.
OLAP: Online Analytical Processing. OLAP, is an approach to quickly provide answers to analytical queries
decision making.
OLTP: is customer- oriented and is used for transaction processing. OLAP: is market oriented and is used for data analysis .
Data contents:
OLTP: manages current data that are too detailed. OLAP: manages large amounts of historical data, provides facility for summarization & aggregation.
View:
OLTP: focuses mainly on the current data within an enterprise or department. OLAP: focuses on historical data.
Access Patterns:
OLTP: consists of short, atomic transactions. Requires concurrency and recovery mechanisms. OLAP: are mostly read only operations.
Feature Characteristics Orientation users function DB design DB design data View Focus usage access unit of work records accessed users DB size
OLTP Operational processing transaction clerk, IT professional day to day operations application-oriented ER based current, up-to-date Detailed Data in repetitive read/write short, simple transaction tens thousands 100MB-GB
OLAP Information processing analysis knowledge worker decision support subject-oriented Star/snowflake historical summarized Information out ad-hoc lots of scans complex query millions hundreds 100GB-TB
Data Cube:
Data cube allows data to be modeled and viewed in multiple
dimensions.
It is defined by dimensions and facts.
Dimensions:
Dimensions are perceptive or entities with respect to which an
Facts:
Multidimensional data model is organized around a central theme called as
facts.
Eg. Sales
This theme is represented by a fact table. Facts are numerical measures. They are the quantities by which we want to analyze relationships between
dimensions.
Eg. Facts for a sales DW include dollars_sold, units_sold, amt_budgeted. The fact table contains the names of the facts, or measures as well as keys to
3 D View
location=Chicago
item (type)
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 605 680 812 927 825 925 1023 1038
phone 14 31 30 38
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 1087 1130 1034 1142 968 1024 1048 1091
phone 38 41 45 54
phone 43 52 58 59
Time (quarter) home ent. Computer Q1 Q2 Q3 Q4 854 680 812 927 822 925 1023 1038 89 31 30 38
phone
security
A 3D data cube representation of the data according to the dimensions time, item and location. The measures displayed is dollars_sold.
Time
882 89 623 854 New York 1087 968 38 872 Toronto 818 746 43 591 Chicago 1087 968 968 825 605 14 400 Q1 682 Q2 925 31 512 680
Q3
Q4
Vancouver
812
927
Home Ent
1023 1038
Comp
30
38
Phone
501
580
Security
items
is referred to as a cuboid.
The cuboid that holds the lowest level of summarization is called
base cuboid.
The top most 0-D cuboid, which holds the highest-level of
OLAP Operations
- Multidimensional data model allows data to be stored in
multiple dimensions.
- Each dimension contains multiple levels of abstraction
different views.
440 New York 1560 Toronto 395 Vancouver 1087 968 968 825 14 Q1 605
Chicago
400
682
Time
Q2
Q3
Q4
Home Ent Comp Phone Security
items
Roll Up: Is also called as Drill up operation. Performs aggregation on data cube either by climbing up a concept hierarchy or for a dimension by dimension reduction.
Roll Up
USA
Canada Q1 1000
2000 968
682
Time Q2 Q3 Q4
Home Ent Comp Phone Security
by country.
When roll up is performed by dimension reduction, one or more
Drill Down:
It is the reverse of roll up. It navigates from less detailed data to more detailed data. It can be done either by stepping down concept hierarchy or
Time
July
Aug Sept Oct Nov Dec
Home Ent Comp Phone Security
Drill down occurs by descending the time hierarchy from the level of
summarized by quarters.
Since drill down adds more detail to the given data it can also be
Slice : The slice operation performs a selection on one dimension of the given cube resulting in a sub cube. Following figure shows a slice operation where a sales data are selected from the central cube for the dimension time using criterion the criterion time = Q1.
Slice for time = Q1
Chicago
Location
New York
Toronto
Vancouver
605 825
14
440
Dice:
The dice operation defines a sub cube by performing a selection on two or
more dimensions. Following figure shows a dice operation on the central cube based on following selection criterion that involves 3 dimensions: (location= Toronto or Vancouver) and (time= Q1 or Q2) and (item= home entertainment or computer).
Toronto 395
968 Q1 Time Q2
Home Ent Comp
605
Item (type)
Pivot:
Pivot is also called as rotate. It is a visualization operation that rotates the data axes in view
0-D(apex) cuboid
1-D cuboids
time,item
time,location
item,location item,supplier
location,supplier
time,supplier time,item,location
2-D cuboids
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
them.
The entity relationship data model is commonly used in the design of
relational database.
Such data model is appropriate for on-line transaction processing. A data warehouse requires a concise, subject oriented schema that facilitates
model.
Such a model can exists in the form of a star schema, snowflake schema or a
Star Schema:
no redundancy. Facts are numerical measures. A set of smaller attendant tables (dimension tables) one for each dimension.
Dimensions are perceptive or entities with respect to which an organizations wants to keep records.
item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type
branch
branch_key branch_name branch_type
location
location_key street city province_or_state country
location_key units_sold
dollars_sold
avg_sales
Measures
Snowflake schema:
It is a variant of the star schema model. Here dimension tables are normalized thereby further splitting
the data into additional tables. The resulting schema graph forms a shape similar to a snowflake. The dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such table is easy to maintain. Snowflake structure can reduce the effectiveness of browsing since more joins will be needed to execute a query. System performance may be adversely impacted.
item
Sales Fact Table
item_key item_name brand type supplier_key
supplier
supplier_key supplier_type
time_key
item_key branch_key
branch
branch_key branch_name branch_type
location
location_key street city_key
location_key
units_sold dollars_sold avg_sales Measures
city
Concept Hierarchies
Illinois.
The state can in turn be mapped to the country to which they belong such as
Canada or USA.
These mappings form a concept hierarchy for the dimension location
mapping a set of low level concepts (i.e. cities) to higher-level , more general concepts (i.e. countries)
Location
all all
Canada
USA
British Columbia
Ontario
New York
Illinois
Vancouver City
Victoria
Toronto
Ottawa
New York
Buffalo
Chicago
Country
Year
Quarter
Week
Street
One Tier:
A one-Tier D.W. is suitable for small to medium sized businesses. The relevant source data is collected and transformed into the
warehouse.
This is then accessed by all users.
Two Tier:
A Two-Tier D.W. is suitable for medium to large sized businesses. There is a high degree of variance in types of users and analysis
requirements.
All data is kept in warehouse. Specific subset of data is moved to second data storage called marts. This data is accessed by the users.
Data Warehouse
Virtual Warehouse:
In virtual warehouse a user is given direct access to the source data
Data Warehouse
Three Tier:
In a three tier data warehouse an OLAP server is placed
External sources
Operational Databases
Metadata Repository
OLAP Server
Data Warehouse
Serve
OLAP Server
Data Marts
Data Sources
OLAP Server:
Relational OLAP (ROLAP) :
Use relational or extended-relational DBMS to store and manage
multidimensional
storage
engine
(sparse
matrix
techniques)
fast indexing to pre-computed summarized data
Data Marts:
Definition: An implementation of a data warehouse with restricted scope of content and support for analytical processing, serving a single department of an organization or a particular data analysis problem domain.
smaller database and is designed to fulfill a specific purpose or data subject rather than for enterprise-wide planning.
Data marts can exist without being connected to a data warehouse. Data marts can be classified as:
Dependent data marts. Independent data marts.
Dependent Data Marts: This data mart use the same extract and transformation processes and have the same content for all the shared data.
data warehousing.
Business intelligence usually refers to the information that is
the
infrastructural,
component
for
achieving
business
intelligence.