Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Data Warehousing

Dr. L. Rajya Lakshmi


Multiple Hierarchies
Extraction, transformation and loading
• What is the factor that differentiates between the data in operational
systems and the data in a data warehouse?
• The set of functions that fall under extraction, transformation and loading
• Extraction, transformation and loading (ETL) reshape the data
• In which area(s) of DW environment do these ETL functions fall?
• ETL encompasses data acquisition and data storage
• ETL has all backend processes
• Functions and procedures for extraction, transformation, and loading
• Which data has to be extracted?
• The user requirements drive the data extraction process
• Requirement based selection process
Extraction transformation loading
• The extent and complexity of backend processes differ from one data
warehouse to another
• What decides the complexity?
• The disparateness of operational systems, their platforms, network protocols,
operating systems, etc.
• The quality of the data
• Feasibility of loading methods
ETL: important and challenging
• Data to strategic information
• Extract the data
• Not just dump it, transform it into the format suitable for conversion into
information
• Move the data to the repository
• Now, we can provide strategic information to the users
ETL: important and challenging
• Users want to analyse product sales by store by month
• Assume that the sales data is gathered by several applications in the
organization
• There exists a product master list
• Each sales transaction happen at a store, then?
• Sales units and sales cost in fact table;
• The product dimension, the store dimension, and the time dimension
• Extract the data, reconcile representations, and transform all product sales,
load the data into fact and dimension tables, and provide the information to
users
• All three functions should be performed in the sequence
ETL: important and challenging
• What makes ETL challenging?
• Disparities among the source systems
• Reasons for difficulties in ETL functions:
• Source systems are diverse and disparate
• Multiple platforms and different operating systems
• Older legacy applications on obsolete database technologies
• History is important for data warehouse, but changes in values are not
preserved in operational systems
• Lack of consistency (salary data in payroll systems)
• Lack of means to resolve mismatches
• Data representations are cryptic, ambiguous and not meaningful for data
warehouse users
• Data quality is dubious
ETL: time consuming and arduous
• 50 to 70% of effort on ETL functions
• Why time consuming and arduous?
• Extract big amounts of data; complex source systems makes this task difficult
• Metadata must contain information on every database and data structures
(very detailed information like database size and volatility)
• Time window for data extraction
• Mechanisms for capturing changes
• Large number of transformation methods (reformat internal data, re-
sequence data, apply various forms of conversion techniques, missing value
handling, aggregates, etc.) on big quantities of data
• Initial load populates millions of rows; creating and managing load images for
such large numbers is not an easy task (two or more weeks to complete initial
loading)
• Incremental changes: Extraction methods for individual sources and time
windows for incremental loads
ETL: requirements and steps
• Functional steps for initial and incremental data loads:
• Triggering for incremental changes
• Filtering for refreshes and incremental loads
• Extraction
• Transformation
• Integration
• Cleansing
• Loading/applying data
ETL: requirements and steps
ETL: requirements and steps
• The activities and tasks that compose the ETL process are:
• Aggregates planning
• Transformation and cleansing rules
• Data extraction rules establishment
• Source to target data element mapping
• Integrate internal and external data sources
• Decide the target data to be in the warehouse
• Combine several source data structures into a single row
• Split a single data structure into multiple data structures
• Read dictionaries and catalogues
• Read data from a variety of files
• Populating aggregates or summary fact tables with aggregates
• Transform data, change cryptic values, etc.
ETL: key factors
• Primary reason for complexity is diversity of data sources
• Begin with a complete inventory of source systems
• Mass refreshes (initial or periodic) take too long (find a proper time)
• Incremental refreshes (the best method for each source, impact less
extracts and loads)
Data extraction
• Two factors differentiate data extraction of a data warehouse and
operational systems
• Can use third party tools (expensive and records metadata), in-house
tools (increases cost of maintenance and hard to maintain)
• Data extraction issues:
• Identify source applications and structures
• Define extraction process (manual or tool based)
• Define data extraction frequency (daily, weekly, monthly, etc.)
• Time window for data sources
• Job sequencing
• Exception handling
Source identification
• Identify all proper data sources
• Examination and verification that the identified sources will add the
necessary value
• Consider a data mart that has to provide strategic information on
fulfilment of orders
• Historical data about fulfilled and pending orders, channels, order statuses
• Fact table: total order amount, discounts, commissions, expected delivery
time, actual delivery time, dates associated with different phases
• Dimensions: product, order disposition, delivery channel, and customers
• Determine whether you have source systems to provide the data
• Establish correct data elements for each target data element
• Verify the identified sources are the right ones
Source identification
Data extraction techniques
• Source data is temporal or time dependent (Current values, subjected
to constant changes)
• How do we capture the history?
• Data in operational systems:
• Two ways (current values and periodic status)
• Current values:
• Most attributes fall under this category
• Transient and current values
• Remains constant until a transaction changes it
• Periodic status
• The value is preserved as status every time its value changes (store reference effective
dates), events
• Ex: policy, claims
• Extraction of history is easier (preserved in the operational systems)

You might also like