Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 74

Data Warehouse

Why Data warehouse

The most common issue companies face when looking at data mining is that the information is not in one place. The biggest challenge business analysts face in using data mining is how to extract, integrate, cleanse, and prepare data to solve their most pressing business problems.

What is Data Warehouse

The idea of a data warehouse is to put a wide range of operational data from internal and external sources into one place so it can be better utilized by executives, line of business managers and other business analysts. Once the information is gathered, OLAP (on-line analytical processing ) software comes into play by providing the desktop analysis tools for querying, manipulating and reporting the data from the data warehouse.

Data Warehouse environment

the source systems from which data is extracted the tools used to extract data for loading the data warehouse the data warehouse database itself where the data is stored the desktop query and reporting tools used for decision support

Data Warehousing Process Overview

Operational Vs. Multidimensional View Of Sales

Creating A Data Warehouse

The Data Warehouse

The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.

The Data Warehouse

Integrated

The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization.
The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.

Subject-Oriented

The Data Warehouse

Time Variant
The

Warehouse data represent the flow of data through time. It can even contain projected data.

Non-Volatile
Once

data enter the Data Warehouse, they are never removed. The Data Warehouse is always growing.

Operational Database vs. Data warehouse


Data Warehouse Operational DB Unified view of all Similar data can have data elements different representations Subject orientation or meanings for decision support Functional or process Historical information orientation with time dimension Current transaction Data are added Frequent updating without change

Data Mart

A data mart is a small, single-subject data warehouse subset that provides decision support to a small group of people.

Data Mart

Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses. Data Marts address local or departmental problems, while a Data Warehouse involves a company-wide effort to support decision making at all levels in the organization.

Enterprise Data Warehouse (EDW)

A large scare data warehouse that is used across the enterprise for decision support EDW are used to provide data for many types of DSS, including CRM, SCM, BPM, BAM, PLM, and KMS.

BPM: Business performance management BAM: Business activity monitoring PLM: product lifecycle management KMS: Knowledge management systems

Metadata

Metadata is the data about data. In a data warehouse, metadata describe the contents of a data warehouse and the manner of its use Good metadata is essential to the effective operation of a data warehouse and it is used in data acquisition/collection, data transformation, and data access.

The needs for Technical metadata

The use of data warehousing and decision processing often involves a wide range of different products, and creating and maintaining the meta data for these products is time- consuming and error prone. Automating the meta data management process and enabling the sharing of this socalled technical meta data between products can reduce both costs and errors.

The Needs for Business metadata

Business users need to have a good understanding of what information exists in a data warehouse. They need to understand what the information means from a business viewpoint, how it was derived, from what source systems it comes, when it was created, what pre-built reports and analyses exist for manipulating the information, and so forth.

metadata in a data warehouse

Kimball lists the following types of metadata in a data warehouse:


Source system metadata Data staging metadata DBMS metadata

Ralph Kimball, The Data Warehouse Lifecycle Toolkit, Wiley, 1998, ISBN 0471-25547-5

source system metadata

source specifications, such as repositories, and source logical schemas source descriptive information, such as ownership descriptions, update frequencies and access methods process information, such as job schedules and extraction code

data staging metadata

data acquisition information, such as data transmission scheduling and results, and file usage dimension table management, such as definitions of dimensions, and surrogate key assignments transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions audit, job logs and documentation, such as data lineage records, data transform logs

Star Schema

The star schema is a data modeling technique used to map multidimensional decision support into a relational database. Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.

Star Schema

Four Components:
Facts Dimensions Attributes Attribute hierarchies

Figure 13.14 A Three-Dimensional View of Sales

Figure 13.17 Attribute Hierarchies in Multidimensional Analysis

Facts

Numeric measurements that represent specific business aspect or activity Normally stored in fact table that is center of star schema Fact table contains facts linked through their dimensions Metrics are facts computed at run time

Dimensions

Qualifying characteristics provide additional perspectives to a given fact Decision support data almost always viewed in relation to other data Study facts via dimensions Dimensions stored in dimension tables

Attributes

Dimensions provide descriptions of facts through their attributes No mathematical limit to the number of dimensions Use to search, filter, and classify facts Slice and dice: focus on slices of the data cub for more detailed analysis

Attribute Hierarchies

Provide top-down data organization Two purpose:


Aggregation Drill-down/roll-up data analysis

Determine how the data are extracted and represented Stored in a DBMSs data dictionary Used by OLAP tool to access warehouse properly.

Star Schema

A star schema consists of fact tables and dimension tables. Fact tables contain the quantitative or factual data about a business--the information being queried. This information is often numerical, additive measurements and can consist of many columns and millions or billions of rows. Dimension tables are usually smaller and hold descriptive data that reflects the dimensions, or attributes, of a business.

Figure 13.17 Star Schema For Sales

Star Schema Representation

Facts and dimensions are normally represented by physical tables in the data warehouse database. The fact table is related to each dimension table in a many-to-one (M:1) relationship. Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Figure 13.18 Orders Star Schema

Star Schema

Performance-Improving Techniques

Normalization of dimensional tables Multiple fact tables representing different aggregation levels Denormalization of fact tables Table partitioning and replication

Figure 13.19 Normalized Dimension Tables

Multiple Fact Tables

Practice

How to design a star schema for an auto insurance company to do risk analysis? What is the Objective? What are the Facts? What are the Dimensions? What are the Attributes? What are the Attribute hierarchy?

Auto insurance DW star schema

Data Warehouse Design

Grain A definition of the highest level of detail that is supported in a data warehouse Drill-down The process of probing beyond a summarized value to investigate each of the detail transactions that comprise the summary

Data Warehouse Implementation

The Data Warehouse as an Active Decision Support Network A Company-Wide Effort that Requires User Involvement and Commitment at All Levels Satisfy the Trilogy: Data, Analysis, and Users Apply Database Design Procedures

Data Warehouse Implementation

Implementing a data warehouse is generally a massive effort that must be planned and executed according to established methods There are many facets to the project lifecycle, and no single person can be an expert in each area

Data Warehouse Implementation Road Map

Data Integration and the Extraction, Transformation, and Load (ETL) Process

Data integration comprises three major processes:

data access (the ability to access and extract data from any data source) data federation (the integration of business views across multiple data stores), and change capture (the identification, capture , and delivery of the changes made to enterprise data sources).

Data Integration and the Extraction, Transformation, and Load (ETL) Process

Extraction, transformation, and load (ETL)


Extraction - reading data from a database Transformation - converting the extracted data from its previous form into the form that can be placed into a data warehouse Load - putting the data into the data warehouse

Data Integration and the Extraction, Transformation, and Load (ETL) Process

Data Cleanse

Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.

ETL tools

A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.

On-Line Analytical Processing

On-Line Analytical Processing (OLAP) is an advanced data analysis environment that supports decision making, business modeling, and operations research activities. Four Main Characteristics of OLAP

Use multidimensional data analysis techniques. Provide advanced database support. Provide easy-to-use end user interfaces. Support client/server architecture.

On-Line Analytical Processing

Additional Functions of Multidimensional Data Analysis Techniques


Advanced data presentation functions Advanced data aggregation, consolidation, and classification functions Advanced computational functions Advanced data modeling functions

Integration Of OLAP With A Spreadsheet Program

Figure 13.7 OLAP Server Arrangement

SAPs Business Information Warehouse: an Enterprise-Wide Information Hub

An end-to-end enterprise-wide information hub to support planning and decision-making. A central data repository of SAP, non-SAP, current, and historical business transactions and meta data. Timely information to all levels and roles, from analyst to executive. Years of SAP financial, logistic, and human resource information systems experience wedded with modern data warehouse methodologies.

BW Architecture details
3rd party OLAP clients 3rdparty partyOLAP OLAPclient clients 3rd 3rd party OLAP client 3rd party OLAP client 3rd party OLAP client

Business Explorer
Analyzer Analyzer (hosted (hosted by by MS MS Excel) Excel)
ODBO BAPI

Browser Browser

OLEOLE-DB DB for for OLAP OLAP Provider Provider

Administrator Workbench
Administration Administration Meta Data Repository

OLAP OLAP Processor Processor Meta Meta Data Data Manager Manager Data Data Manager Manager

InfoCubes Operational Data Store PSA


RemoteCube BAPI

Scheduling Scheduling Monitor Monitor

Business Information Warehouse Server

Staging Staging Engine Engine


Staging BAPI Staging BAPI

Non Non R/3 R/3 Production Production Data Data Extractor Extractor Non Non R/3 R/3 OLTP OLTP Applications Applications

Production Production Data Data Extractor Extractor

OLTP OLTP Reporting Reporting

Data Data Provider Provider Server Server

R/3 R/3 OLTP OLTP Applications Applications

SAP AG 1999 /2

A Sample Of Current Data Warehousing And Data Mining Vendors

Table 13.10

Success Stories at Pepsi

"Using the data warehouse, we've been able to identify important items, find national suppliers for them, and leverage those relationships to reduce costs. Thanks to the warehouse, Pepsi can monitor purchasing compliance at the user level, an ability that has boosted price and product compliance well over 90 percent. The warehouse also helps ensure 100 percent sales tax compliance, says Bridgman. Since going online in 1995, the warehouse has helped generate procurement savings in excess of $100 million.

Levels of DW Support for Enterprise Decision Making

The need for real-time data

A business often cannot afford to wait a whole day for its operational data to load into the data warehouse for analysis Provides incremental real-time data showing every state change and almost analogous patterns over time Maintaining metadata in sync is possible Less costly to develop, maintain, and secure one huge data warehouse so that data are centralized for BI/BA tools An EAI with real-time data collection can reduce or eliminate the nightly batch processes

Real-Time / Active Data Warehouse (RDW/ADW)

Loading and and providing data via the data warehouse as they become available. Expand traditional data warehouse functions into the realm of tactical decision making Empower decision making when interact directly with customers and suppliers.

Real-Time Data Warehousing

http://www.teradata.com/resources/demos

Data Warehouse Administration

Due to its huge size and its intrinsic nature, a data warehouse requires especially strong monitoring in order to sustain satisfactory efficiency and productivity A new job title: Data Warehouse Administrator

Data warehouse administration functions

Data Warehouse Administration involves the overall management of the a data warehouse. Administration tasks include archiving, consistency checks, developing/maintaining indexing and retrieval functionality, tracking data changes, migration, monitoring, performance issues, replication issues, data quality, and sizing/space management. All data warehouses should also have a backup and recovery plan in place so that data can be recovered after an emergency.

Security and Privacy Issues

Private intelligence-gathering gives some people the creeps Targeted marketing efforts are intrusive and annoying The collection, manipulation, and combination of lists of personal information amount to an ominous invasion of privacy

Data Warehouse Security Issues

Effective security in a data warehouse should focus on four main areas:

Establishing effective corporate and security policies and procedures Implementing logical security procedures and techniques to restrict access Limiting physical access to the data center environment Establishing an effective internal control review process with an emphasis on security and privacy

http://www.dwinfocenter.org/

http://www.irmac.ca/

http://www.irmac.ca/

You might also like