Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Lecture#5: Components of Data Warehouse

Components of Dataware House


Data Sources
 Source data coming to warehouse.
 Can be divided into four categories:
 Production Data: Data coming from operational
databases.
 Internal Data: Data held in private files of employees
and departments (not in operational database).
 Archived Data: Data available in backups of operational
databases.
 External Data: Data not stored at organization end but
coming from some external sources but that data is
useful to organization.
Example of Production Data
 Data related to doctors, patients, treatments in a
hospital system.
 This system will be an operational database or an
online transaction processing system.
 Users will enter information in this system on regular
basis.
 Data coming from this information system to data
warehouse is called production data.
Example of Internal Data
 In a hospital, there may be some data which is not
stored in operational database but in some excel
sheets and word files.
 Manual registration slips of patient, when operational
database was not active.
 Some standard operating procedures (SOP) documents
which cannot be stored in operational system.
 Some notes taken by doctor about his patients in some
word document.
 List of some patients who visited doctor for some
consultancy, but were not registered patients of
hospital.
Example of Archived Data
 Backups of databases are maintained on regular basis.
 When amount of data stored in an operational
database increases, it is stored in backup files.
 Backup files are normally stored on some off-line
storage like a Magnetic Tape.
 For example: backup of a hospital’s database is
maintained on regular basis.
 This archived data is useful for a data warehouse to
provide historical information about data.
Example of External Data
 A car rental company have a system to store data about
the vehicles they provide for rent.
 Company need to maintain information from different
manufacturers about new models of cars.
 This information will be external to that car rental
company, not part of their system.
Data Staging Component
 Data should be converted in such a format that it may
be stored in warehouse for query and analysis.
 Why data staging?
 Data coming from multiple applications with different
formats.
 Subject oriented data may be cut down into multiple
applications. (Accounts info in multiple operational
systems).
 There are three phases of data staging.
 Extraction.
 Transformation.
 Loading.
Data may be in different formats!!
Relational
Application Relation/Table

Network Based
Application Record

Record
Hierarchical
Application Data
Object Warehouse
Object Oriented
Application
Text
File Based Some standard
Application data format is
required
Subject Oriented data from
multiple applications
Bank Online
ATM Card
Deposit/Withdraw Deposit/Withdraw
Processing system
system system

Credit Card
Processing system

Bank
Account of
Subject Oriented Data
Customer
Coming from multiple
applications
Extraction
 Data coming from different sources will be in different
formats.
 Data from sources may be in:
 Relational data model.
 Network/Hierarchical data model.
 In flat files.
 Tools for data extraction:
 Third party tools
 In house tools
Extraction
 For data extraction, data warehouse development team
may establish an environment in which data is
extracted from data source to a common data
repository.
 This data repository may be based on:
 Flat file.
 Relational database system.
 Combination of both.
Common data repository
Relational Relation/Table
Application

Network Based
Application Record Common Data
Repository
Hierarchical Record (Relational/File
Application Based/Both)
Object
Object Oriented
Application
Text
File Based
Application Data
Warehouse
Transformation
 Clean the data extracted from each source.
 Correction of spellings.
 Resolution of conflicts between postal codes, NIC
numbers, etc.
 Providing default values for missing data elements.
 Standardization of data elements.
 Standardize data types and lengths for same elements
extracted from different data sources.
 Semantic Standardization: resolving synonyms and
homonyms.
Transformation
 Resolving synonyms:
 Two or more terms from different sources mean same
thing.
 For example: reluctance in source S1 and unwillingness
in S2 (both the terms mean lack of enthusiasm)
 Resolving homonyms:
 A single term means many different things in different
source systems.
 For example: bank in source S1 used as river bank and in
S2 used as a place where people mange their money.
Transformation
 Combining pieces of data from multiple sources.
 Data may be combined from multiple sources.
 Purging of data: removal of data which is not useful for
data warehouse.
 Sorting and merging of data is also performed.
 Primary keys in operational databases are fields with
some built-in meanings.
 For example: Product Key may contains: product
number, product category, number of store in which
product is stored.
Transformation
 Data warehouses use keys which does not have built-in meanings.
 Surrogate Keys: Generated by system for each row of table.
 Oracle provides sequences.
 SQL Server provides identity() function for generation of these keys.
 Data warehouse may not store data at detailed level.
 Example:
 A grocery store may keep unit sales and revenue information for each
transaction performed in a day for an operational database.
 For a data warehouse it will be suitable to calculate summary totals of
sales for a day and no. of products sold for a day from operational
database.
 This summarization is required to be performed by data
transformation phase.

You might also like