Professional Documents
Culture Documents
2 Datawarehouse
2 Datawarehouse
The most common issue companies face when looking at data mining is that the information is not in one place. The biggest challenge business analysts face in using data mining is how to extract, integrate, cleanse, and prepare data to solve their most pressing business problems.
The idea of a data warehouse is to put a wide range of operational data from internal and external sources into one place so it can be better utilized by executives, line of business managers and other business analysts. Once the information is gathered, OLAP (on-line analytical processing ) software comes into play by providing the desktop analysis tools for querying, manipulating and reporting the data from the data warehouse.
the source systems from which data is extracted the tools used to extract data for loading the data warehouse the data warehouse database itself where the data is stored the desktop query and reporting tools used for decision support
The Data Warehouse is an integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making.
Integrated
The Data Warehouse is a centralized, consolidated database that integrates data retrieved from the entire organization.
The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.
Subject-Oriented
Time Variant
The
Warehouse data represent the flow of data through time. It can even contain projected data.
Non-Volatile
Once
data enter the Data Warehouse, they are never removed. The Data Warehouse is always growing.
Data Mart
A data mart is a small, single-subject data warehouse subset that provides decision support to a small group of people.
Data Mart
Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses. Data Marts address local or departmental problems, while a Data Warehouse involves a company-wide effort to support decision making at all levels in the organization.
A large scare data warehouse that is used across the enterprise for decision support EDW are used to provide data for many types of DSS, including CRM, SCM, BPM, BAM, PLM, and KMS.
BPM: Business performance management BAM: Business activity monitoring PLM: product lifecycle management KMS: Knowledge management systems
Metadata
Metadata is the data about data. In a data warehouse, metadata describe the contents of a data warehouse and the manner of its use Good metadata is essential to the effective operation of a data warehouse and it is used in data acquisition/collection, data transformation, and data access.
The use of data warehousing and decision processing often involves a wide range of different products, and creating and maintaining the meta data for these products is time- consuming and error prone. Automating the meta data management process and enabling the sharing of this socalled technical meta data between products can reduce both costs and errors.
Business users need to have a good understanding of what information exists in a data warehouse. They need to understand what the information means from a business viewpoint, how it was derived, from what source systems it comes, when it was created, what pre-built reports and analyses exist for manipulating the information, and so forth.
Ralph Kimball, The Data Warehouse Lifecycle Toolkit, Wiley, 1998, ISBN 0471-25547-5
source specifications, such as repositories, and source logical schemas source descriptive information, such as ownership descriptions, update frequencies and access methods process information, such as job schedules and extraction code
data acquisition information, such as data transmission scheduling and results, and file usage dimension table management, such as definitions of dimensions, and surrogate key assignments transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions audit, job logs and documentation, such as data lineage records, data transform logs
Star Schema
The star schema is a data modeling technique used to map multidimensional decision support into a relational database. Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.
Star Schema
Four Components:
Facts Dimensions Attributes Attribute hierarchies
Facts
Numeric measurements that represent specific business aspect or activity Normally stored in fact table that is center of star schema Fact table contains facts linked through their dimensions Metrics are facts computed at run time
Dimensions
Qualifying characteristics provide additional perspectives to a given fact Decision support data almost always viewed in relation to other data Study facts via dimensions Dimensions stored in dimension tables
Attributes
Dimensions provide descriptions of facts through their attributes No mathematical limit to the number of dimensions Use to search, filter, and classify facts Slice and dice: focus on slices of the data cub for more detailed analysis
Attribute Hierarchies
Determine how the data are extracted and represented Stored in a DBMSs data dictionary Used by OLAP tool to access warehouse properly.
Star Schema
A star schema consists of fact tables and dimension tables. Fact tables contain the quantitative or factual data about a business--the information being queried. This information is often numerical, additive measurements and can consist of many columns and millions or billions of rows. Dimension tables are usually smaller and hold descriptive data that reflects the dimensions, or attributes, of a business.
Facts and dimensions are normally represented by physical tables in the data warehouse database. The fact table is related to each dimension table in a many-to-one (M:1) relationship. Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.
Star Schema
Performance-Improving Techniques
Normalization of dimensional tables Multiple fact tables representing different aggregation levels Denormalization of fact tables Table partitioning and replication
Practice
How to design a star schema for an auto insurance company to do risk analysis? What is the Objective? What are the Facts? What are the Dimensions? What are the Attributes? What are the Attribute hierarchy?
Grain A definition of the highest level of detail that is supported in a data warehouse Drill-down The process of probing beyond a summarized value to investigate each of the detail transactions that comprise the summary
The Data Warehouse as an Active Decision Support Network A Company-Wide Effort that Requires User Involvement and Commitment at All Levels Satisfy the Trilogy: Data, Analysis, and Users Apply Database Design Procedures
Implementing a data warehouse is generally a massive effort that must be planned and executed according to established methods There are many facets to the project lifecycle, and no single person can be an expert in each area
Data Integration and the Extraction, Transformation, and Load (ETL) Process
data access (the ability to access and extract data from any data source) data federation (the integration of business views across multiple data stores), and change capture (the identification, capture , and delivery of the changes made to enterprise data sources).
Data Integration and the Extraction, Transformation, and Load (ETL) Process
Extraction - reading data from a database Transformation - converting the extracted data from its previous form into the form that can be placed into a data warehouse Load - putting the data into the data warehouse
Data Integration and the Extraction, Transformation, and Load (ETL) Process
Data Cleanse
Data cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data.
ETL tools
A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.
On-Line Analytical Processing (OLAP) is an advanced data analysis environment that supports decision making, business modeling, and operations research activities. Four Main Characteristics of OLAP
Use multidimensional data analysis techniques. Provide advanced database support. Provide easy-to-use end user interfaces. Support client/server architecture.
Advanced data presentation functions Advanced data aggregation, consolidation, and classification functions Advanced computational functions Advanced data modeling functions
An end-to-end enterprise-wide information hub to support planning and decision-making. A central data repository of SAP, non-SAP, current, and historical business transactions and meta data. Timely information to all levels and roles, from analyst to executive. Years of SAP financial, logistic, and human resource information systems experience wedded with modern data warehouse methodologies.
BW Architecture details
3rd party OLAP clients 3rdparty partyOLAP OLAPclient clients 3rd 3rd party OLAP client 3rd party OLAP client 3rd party OLAP client
Business Explorer
Analyzer Analyzer (hosted (hosted by by MS MS Excel) Excel)
ODBO BAPI
Browser Browser
Administrator Workbench
Administration Administration Meta Data Repository
OLAP OLAP Processor Processor Meta Meta Data Data Manager Manager Data Data Manager Manager
Non Non R/3 R/3 Production Production Data Data Extractor Extractor Non Non R/3 R/3 OLTP OLTP Applications Applications
SAP AG 1999 /2
Table 13.10
"Using the data warehouse, we've been able to identify important items, find national suppliers for them, and leverage those relationships to reduce costs. Thanks to the warehouse, Pepsi can monitor purchasing compliance at the user level, an ability that has boosted price and product compliance well over 90 percent. The warehouse also helps ensure 100 percent sales tax compliance, says Bridgman. Since going online in 1995, the warehouse has helped generate procurement savings in excess of $100 million.
A business often cannot afford to wait a whole day for its operational data to load into the data warehouse for analysis Provides incremental real-time data showing every state change and almost analogous patterns over time Maintaining metadata in sync is possible Less costly to develop, maintain, and secure one huge data warehouse so that data are centralized for BI/BA tools An EAI with real-time data collection can reduce or eliminate the nightly batch processes
Loading and and providing data via the data warehouse as they become available. Expand traditional data warehouse functions into the realm of tactical decision making Empower decision making when interact directly with customers and suppliers.
http://www.teradata.com/resources/demos
Due to its huge size and its intrinsic nature, a data warehouse requires especially strong monitoring in order to sustain satisfactory efficiency and productivity A new job title: Data Warehouse Administrator
Data Warehouse Administration involves the overall management of the a data warehouse. Administration tasks include archiving, consistency checks, developing/maintaining indexing and retrieval functionality, tracking data changes, migration, monitoring, performance issues, replication issues, data quality, and sizing/space management. All data warehouses should also have a backup and recovery plan in place so that data can be recovered after an emergency.
Private intelligence-gathering gives some people the creeps Targeted marketing efforts are intrusive and annoying The collection, manipulation, and combination of lists of personal information amount to an ominous invasion of privacy
Establishing effective corporate and security policies and procedures Implementing logical security procedures and techniques to restrict access Limiting physical access to the data center environment Establishing an effective internal control review process with an emphasis on security and privacy
http://www.dwinfocenter.org/
http://www.irmac.ca/
http://www.irmac.ca/