Professional Documents
Culture Documents
Data Warhousing
Data Warhousing
A Data Warehousing (DW) is process for collecting and managing data from different
sources to provide meaningful business visions. A Data warehouse is typically used to connect
and analyze business data from heterogeneous sources. The data warehouse is the core of the BI
system which is built for data analysis and reporting. It is a blend of technologies and
components which aids the strategic use of data. It is electronic storage of a large amount of
information by a business which is designed for query and analysis instead of transaction
processing. It is a process of transforming data into information and making it available to users
in a timely manner to make a difference.
It is a blend of technologies and components which aids the strategic use of data. It is
electronic storage of a large amount of information by a business which is designed for query
and analysis instead of transaction processing. It is a process of transforming data into
information and making it available to users in a timely manner to make a difference.
A report on current inventory information can include more than 12 joined conditions.
This can quickly slow down the response time of the query and report. A data warehouse
provides a new design which can help to reduce the response time and helps to enhance the
performance of queries for reports and analytics.
1960- Dartmouth and General Mills in a joint research project, develop the terms
dimensions and facts.
1970- A Nielsen and IRI introduces dimensional data marts for retail sales.
1983- Tera Data Corporation introduces a database management system which is
specifically designed for decision support
Data warehousing started in the late 1980s when IBM worker Paul Murphy and Barry
Devlin developed the Business Data Warehouse.
However, the real concept was given by Inmon Bill. He was considered as a father of
data warehouse. He had written about a variety of topics for building, usage, and
maintenance of the warehouse & the Corporate Information Factory.
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in
the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data
warehouse merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data
warehousing makes data mining possible. Data mining is looking for patterns in the data that
may lead to higher sales and profits.
Operational Data Store, which is also called ODS, are nothing but data store required when
neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS, Data
warehouse is refreshed in real time. Hence, it is widely preferred for routine activities like
storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can collect
directly from sources.
The following are general stages of use of the data warehouse (DWH):
In this stage, data is just copied from an operational system to another server. In this way,
loading, processing, and reporting of the copied data do not impact the operational system’s
performance.
Data in the Datawarehouse is regularly updated from the Operational Database. The data in
Datawarehouse is mapped and transformed to meet the Datawarehouse objectives.
In this stage, Data warehouses are updated whenever any transaction takes place in operational
database. For example, Airline or railway booking system.
In this stage, Data Warehouses are updated continuously when the operational system performs a
transaction. The Datawarehouse then generates transactions which are passed back to the
operational system.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system.
Back-end tools are used to cleanse, transform and feed data into this layer.
Middle Tier
The middle tier in Data warehouse is an OLAP server which is implemented using either
ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the
database. This layer also acts as a mediator between the end-user and the database.
Top Tier
This is the front-end client interface (it is the tools and API) that gets data out from the
data warehouse. It holds various tools like query tools, analysis tools, reporting tools, and data
mining tools.
Figure below shows the basic components of a typical warehouse. The Source Data
component is shown on the left. The Data Staging component serves is the next building block.
In the middle, there is the Data Storage component that manages the data warehouse data. This
component not only stores and manages the data , it also keeps track of the data by means of the
metadata repository. The Information Delivery component shown on the right consists of all the
different ways of making the information from the data warehouse available to the users.
Whether you build a data warehouse for a large manufacturing company or a global banking
institution, the basic components are the same. The main difference for each organization is in
the way these building blocks are arranged. The variation is in the manner in which some of the
blocks are made stronger than others in the architecture.
Fig: Data Ware House Architecture
External
Information Delivery
Management &
Control
Metadata
Data Mining
Data Marts
Data Staging
Source Data Component:
Source data coming into the data warehouse may be grouped into four broad
categories:
i)Production Data: There is no specific standard of data among the various operational
systems of an enterprise. A term like an account may have different meanings in different
systems.
The significant and troubling characteristic of production data is difference. Your great
challenge is to standardize and transform the dissimilar data from the various production
systems, convert the data, and integrate the pieces into useful data for storage in the data
warehouse.
ii)Internal Data. In every organization, users keep their “private” spreadsheets, documents,
customer profiles, and sometimes even departmental databases. This is the internal data, parts
of which could be useful in a data warehouse. Internal data becomes complex as it needs the
process of transforming and integrating the data before it can be stored in the data warehouse.
It needs strategies for collecting data from spreadsheets, find ways of taking data from textual
documents, and combine into departmental databases to gather relevant data from those
sources.
iii)Archived Data. In every operational system, you periodically take the old data and store it in
archived files. Depending on your data warehouse requirements, you have to include
sufficient historical data. This type of data is useful for discerning patterns and analyzing
trends.
iv)External Data. Most executives depend on data from external sources for a high per centage of
the information they use. They use statistics relating to their industry produced by external
agencies. They use market share data of competitors, etc to check on their performance. Data
from outside sources are in different format which need conversions into internal formats and
data types.
Data Staging Component:
After you have extracted data from various operational systems and from external
sources, you have to prepare the data for storing in the data warehouse. The three major
functions of extraction, transformation, and preparation for loading take place in a staging area.
Data staging provides a place and an area with a set of functions to clean, change, combine,
convert, duplicate, and prepare source data for storage and use in the data warehouse.
Data Extraction: Source data may be from different source machines in diverse data formats.
Source data may be in relational database systems, legacy network and hierarchical data
models, in flat files. You may want to include data from spreadsheets and local departmental
data sets. Data extraction may become quite complex, so tools are available on the market for
data extraction. One needs to consider using outside tools suitable for certain data sources or
needs to develop in-house programs to do the data extraction.
Data Transformation: Data transformation involves many forms of combining pieces of
data from the different sources. A number of individual tasks are involved as part of data
transformation. First, you clean the data extracted from each source. Cleaning may just be
correction of misspellings, or may include resolution of conflicts between state codes and zip
codes in the source data, or may deal with providing default values for missing data elements,
or elimination of duplicates when you bring in the same data from multiple source systems.
Standardization of data elements forms a large part of data transformation. You stan-
dardize the data types and field lengths for same data elements retrieved from the various
sources. In semantic standardization two or more terms from different source systems mean
the same thing, you resolve these synonyms. When the data transformation function ends, you
have a collection of integrated data that is cleaned, standardized, and summarized. You now
have data ready to load into each data set in your data warehouse. When the data
transformation function ends, you have a collection of integrated data that is cleaned,
standardized, and summarized. You now have data ready to load into each data set in your data
warehouse.
Data Loading: Two distinct groups of tasks form the data loading function. When you
complete the design and construction of the data warehouse and go live for the first time, you
do the initial loading of the data into the data warehouse storage. The initial load moves large
volumes of data using up substantial amounts of time. As the data warehouse starts
functioning, you continue to extract the changes to the source data, transform the data
revisions, and feed the incremental data revisions on an ongoing basis.
Data
Warehouse
Delivery Component
Online
Complex queries
Intranet MD Analysis
Statistical Analysis
Internet
Information
EIS feed
Data Marts E-Mail
Data Mining
Metadata Component
The metadata component is the data about the data in the data warehouse. The metadata
component serves as a directory of the contents of your data warehouse.
Types of Metadata
Metadata in a data warehouse fall into three major categories:
• Operational Metadata
• Extraction and Transformation Metadata
• End-User Metadata
• Operational Metadata. Data for the data warehouse comes from several
operational systems of the enterprise. These source systems contain different data
structures. The data elements selected for the data warehouse have various field
lengths and data types. Operational metadata contain all of this information about
the operational data sources.
• End-User Metadata. The end-user metadata is the navigational map of the data
ware- house. It enables the end-users to find information from the data warehouse.
The end-user metadata allows the end-users to use their own business terminology
and look for information in those ways in which they normally think of the
business.
Who needs Data warehouse?
Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses of route
profitability, frequent flyer program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources available on desk effectively.
Few banks also used for the market research, performance analysis of the product and operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate
patient’s treatment reports, share data with tie-in insurance companies, medical aid services, etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
The best way to address the business risk associated with a Datawarehouse implementation is to
employ a three-prong strategy as below
1. Enterprise strategy: Here we identify technical including current architecture and tools.
We also identify facts, dimensions, and attributes. Data mapping and transformation is
also passed.
2. Phased delivery: Datawarehouse implementation should be phased based on subject
areas. Related business entities like booking and billing should be first implemented and
then integrated with each other.
3. Iterative Prototyping: Rather than a big bang approach to implementation, the
Datawarehouse should be developed and tested iteratively.
Here, are key steps in Datawarehouse implementation along with its deliverables.
Tasks Deliverables
1 Need to define project scope Scope Definition
2 Need to determine business needs Logical Data Model
3 Define Operational Datastore requirements Operational Data Store Model
4 Acquire or develop Extraction tools Extract tools and Software
5 Define Data Warehouse Data requirements Transition Data Model
6 Document missing data To Do Project List
Maps Operational Data Store to Data
7 D/W Data Integration Map
Warehouse
8 Develop Data Warehouse Database design D/W Database Design
9 Extract Data from Operational Data Store Integrated D/W Data Extracts
10 Load Data Warehouse Initial Data Load
On-going Data Access and Subsequent
11 Maintain Data Warehouse
Loads
1. MarkLogic:
MarkLogic is useful data warehousing solution that makes data integration easier and faster
using an array of enterprise features. This tool helps to perform very complex search operations.
It can query different types of data like documents, relationships, and metadata.
https://www.marklogic.com/product/getting-started/
2. Oracle:
Oracle is the industry-leading database. It offers a wide range of choice of data warehouse
solutions for both on-premises and in the cloud. It helps to optimize customer experiences by
increasing operational efficiency.
https://www.oracle.com/index.html
3. Amazon RedShift:
Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to analyze all
types of data using standard SQL and existing BI tools. It also allows running complex queries
against petabytes of structured data, using the technique of query optimization.
https://aws.amazon.com/redshift/?nc2=h_m1
If you are looking to work as a Business Intelligence (BI) professional or learn data warehousing,
you have many exciting career options available. Data architects, database administrators,
coders, and analysts are some of the most sought-after BI professionals. Prepare yourself for a
job interview with our data warehouse interview questions.
With data sources growing larger, businesses of the future need to devise better data insights and
data analysis. Prepare for the future with Data Science Courses offered by a leading eLearning
institute like Simplilearn and position yourself as an asset for top organizations.
Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented as an
option for large size data warehouse as it takes less time and money to build. However, there is
no standard definition of a data mart is differing from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for
partition of data which is created for the specific group of users.
Data marts could be created in the same database as the Datawarehouse or a physically separate
Database.