Professional Documents
Culture Documents
Data Warehouse
Data Warehouse
Principles
In computing, a data warehouse (DW) is often considered a business intelligence system used for
data analysis and for the day-to-day reporting needs of a business unit.
A data warehouse is a central repository of integrated data from one or more unequal sources.
The data warehouse contains all the data related to the business. Information systems can be
categorized in many aspects that can be long or limited. These include classification by purpose
of such systems in enterprises. The focus here will be on both operational information systems
and decision support systems. Operational information systems are concerned with the
management of the day-to-day operations of organizations and are the cornerstone of modern
enterprises.
Examples of such systems are Accounting systems, hospital patient management systems,
application management systems on a commercial website and student registration systems at
colleges and institutes. All these systems are concerned with the day-to-day operations of
enterprises and indispensable in modern institutions.
Obviously, operational information systems store and process a large amount of data every day,
and that data increases overtime to reach in some institutions to millions of gigabytes and
terabytes. Sometimes such data exist in insurance companies, universities, hospitals, and public
government sectors. Such data is a real asset and is invaluable if it is exploited by enterprises to
study and analyze and then presented to the decision-makers in the establishments in a way that
enables them to identify the performance of these facilities and formulate long-term strategies for
them. Hence, there was a need only for Decision Support Systems (DSS). One of the most
important applications of these systems, which is the main pillar of these systems, is Data
Warehousing.
Characteristics of Data warehouse
Data warehouses are systems that are concerned with studying, analyzing and presenting
enterprise data in a way that enables senior management to make decisions.
The data warehouses have some characteristics that distinguish them from any other data and
these characteristics are as follows:
Building a data warehouse in an organization is not an easy task, and requires a professional
team in business information systems analysis; it is also important to have professional technical
skills in that team.
Now, we will present the technical steps to build data warehouses, far from the analysis and
design phase that is very close to the business view. These steps are arranged as follows:
1. Creating Data Preparation Area: a database with very high storage capacity; It stores all
data coming from different Operational Systems so that data is purified and modified before
being loaded into the data warehouse. At this stage, the design of the database should be
compatible with the design of the data warehouse.
2. Creating Data Warehouse: A data warehouse is created where data will be uploaded after it
is extracted and purified. Data warehouses are always designed to allow relationships of different
dimensions, such as the relationship between the admissions of students to a specialization in a
given year.
3. Data Marts: Some large organizations divide the data warehouse into a set of data marts.
There is a Data Mart for financial management data and another for human resources
management or division based on the branches of the institution.
4. Extract, Transform, and Load (ETL) Process: At this stage, we extract data from different
sources to the Data Preparation Area and we convert it from one image to another if required.
We also often integrate some data with each other or define new data that did not exist before, in
addition to clearing the wrong data and deleting unimportant ones. At the stage of Loading Data,
The data is loaded from the Data Preparation Area to Data Warehouse. During this phase, the
data is tested.
5. Creating Decision Support Systems (DSS) Applications: At this stage, implement DSS
applications for displaying and analyzing the data warehouse, called Online Analytical
Processing (OLAP) Systems. These applications display data in several dimensions and use
complex algorithms to analyze data. At this stage, data mining tools are also used to recognize
patterns in the data sets and analyze big data.
In this Architecture, the data warehouse system is divided into three tiers (levels); Bottom Tier,
Middle Tier, and Top-Tier.
1. Bottom Tier: It is the data warehouse database server and an RDBMS (Relational Database
Management System). This may include many specific data marts and metadata repositories.
You can use back end tools and application program interfaces called a gateway to feed data at
the bottom level. The gateways and back end tools perform the extraction, cleaning, loading,
and updating functions. Examples of gateways contain Open-Linking and Embedding for
Databases (OLE-DB), Open Database Connection (ODBC) and Java Database Connection
(JDBC).
2. Middle Tier: A middle-tier consists of an online analytical processing (OLAP) server which
can be executed for a quick query of the data warehouse. The OLAP server is implemented in the
following ways:
I. Using the ROLAP (Relational online analytical processing) model, i.e., an extended relational
database management system (DBMS) that assigns tasks to multidimensional data for standard
relational operations.
II. Using the MOLAP (multidimensional online analytics processing) model, ie a special-
purpose server that directly implements multidimensional databases and operations.
3. Top-Tier: A top-tier consists of front-end tools that contain reporting tools analysis tools and
query tools to display the results provided by online analytical processing systems as well as
additional tools for data mining.
Load Processing
In a data warehouse, several phases and steps must be taken to load new data and process it
including filtering, reformatting, data conversion, indexing, and metadata updates.
Load Performance
Data warehouses need to increase the loading of new data from time to time within narrow time
windows. Load performance should be measured in gigabytes per hour and in hundreds of
millions of rows and should not artificially disrupt the data business volume.
Strategic Adaptability
Query Performance
Reliable database management and the fact-based direction should not be decelerated by the
performance of RDBMS (relational database management system). Massive and complex
queries should be completed in seconds, not hours or days.
Terabyte Scalability
A Terabyte contains more than a trillion bytes of digital information - 1015 bytes. Today, the
size of the data warehouses is evolving at staggering rates. This ranges from a few bytes to
hundreds of terabytes and gigabytes sized data warehouses. So, scalability should be taken into
account in measurement.