Professional Documents
Culture Documents
Presentation Report: Data Warehouse-Star Schema: Punjab University College of Information and Technology
Presentation Report: Data Warehouse-Star Schema: Punjab University College of Information and Technology
Presentation Report: Data Warehouse-Star Schema: Punjab University College of Information and Technology
Group Members:
ALI RAZA-BITF20M010
M.WAQAS UD DIN-BITF20M013
NUMEER QADRI-BITF20M021
M.HAMZA FAROOQI-BITF20M031
Instructor:
Definition:
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process.
• Subject-oriented, as the warehouse is organized around the major subjects of the enterprise
(such as customers, products, and sales) rather than the major application areas (such as
customer billing, order processing, and product sales). This is reflected in the need to store
decision support data rather than application oriented data.
• Integrated, because of the coming together of source data from different enterprise-wide
applications systems. The integrated data source must be made consistent to present a unified
view of the data to the users. For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a single way of identifying a
product.
• Time-variant, because data in the warehouse is accurate and valid only at some point in time
or over some time interval. Historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This feature allows the following:
Goal:
The ultimate goal of data warehousing is to integrate enterprise-wide corporate data into a
single repository from which users can easily run queries, produce reports, and perform
analysis.
Benefits:
• Potential high returns on investment. An organization must commit a huge amount of
resources to ensure the successful implementation of a data warehouse, and the cost can vary
enormously from tens of thousands to millions of dollars due to the variety of technical
solutions available. However, a study by the International Data Corporation (IDC) reported that
data warehouse projects delivered an average three-year return on investment (ROI) of 401%
(IDC, 1996).
• No business can survive without a large and accurate storehouse of historical data, from sales
and inventory data to personnel and intellectual property records. If a business executive
suddenly needs to know the sales of a key product 24 months ago, the rich historical data
provided by a data warehouse make this possible.
Data Predictable pattern of data insertions, Uses select query for fetching data
Processing deletions, updates, and queries.
Users Used by end users like clerks, cashiers and large number of operational
DBA etc. user(most knowledge workers,
business analyst)
Storage Requires relatively smaller space. Requires significantly large space
Normalization Tables in OLTP database are Tables in OLAP database are not
normalized(3NF). normalized.
Problems:
Underestimation of Resources for Data
Many developers underestimate the time required to extract, transform, and load the data
into the warehouse. This process may account for a significant proportion of the total
development time.
Data homogenization
Large-scale data warehousing can become an exercise in data homogenization that lessens the
value of the data. For example, when producing a consolidated and integrated view of the
organization’s data, the warehouse designer emphasize similarities rather than differences in
the data used by different application areas such as property sales and property renting.
Data Ownership
Data warehousing may change the attitude of end-users to the ownership of data. Sensitive
data that was originally viewed and used only by a particular department or business area, such
as sales or marketing, may now be made accessible to others in the organization.
Long-duration Projects
A data warehouse represents a single data resource for the organization. However, the building of a
warehouse can take several years, which is why some organizations are building data marts. Data marts
support only the requirements of a particular department or functional area and can therefore be built
more rapidly.
Data Warehouse Architecture
Operational Data:
The source of data for the data warehouse is supplied from:
Mainframe operational data
Departmental data
Private data
External systems i.e. internet
Building an ODS can be a helpful step toward building a data warehouse, because an ODS can
supply data that has been already extracted from the source systems and cleaned. This means
that the remaining work of integrating and restructuring the data for the data warehouse is
simplified.
ETL Manager:
The ETL manager performs all the operations associated with the ETL of data into the
warehouse.
Warehouse Manager:
The warehouse manager performs all the operations associated with the management of the
data in the warehouse. The operations performed by the warehouse manager include:
Query Manager:
The query manager performs all the operations associated with the management of user
queries. The complexity of the query manager is determined by the facilities provided by the
end-user access tools and the database.
Detailed Data:
This area of the warehouse stores all the detailed data in the database schema.
Archive/Backup Data:
This area of the warehouse stores the detailed and summarized data for the purposes of
archiving and backup. Even though summary data is generated from detailed data, it may be
necessary to back up online summary data if this data is kept beyond the retention period for
detailed data.
Metadata:
This area of the warehouse stores all the metadata (data about data) definitions used by all the
processes in the warehouse. Metadata is used for a variety of purposes, including:
The structure of metadata differs between each process, because the purpose is different.
This means that multiple copies of metadata describing the same data item are held within
the data warehouse.
Although the definitions of end-user access tools can overlap, for the purpose of this
discussion, we categorize these tools into four main groups:
These tools assume that the data is organized in a multidimensional model supported by a
special multidimensional database (MDDB) or by a relational database designed to enable
multidimensional queries.
Data mining has the potential to supersede the capabilities of OLAP tools, as the major
attraction of data mining is its ability to build predictive rather than retrospective models.
Transformation:
It applies series of rules or functions to the extracted data which is used to analyze the data.
The output from the transformations is data that is clean and consistent with the data already
held in the warehouse, and furthermore, is in a form that is ready for analysis by users of the
warehouse. It allows user to perform queries on the EDW data that are capable of drilling down
to the most detailed data.
Loading:
Loading of the data into the warehouse can occur after all the transformation have taken place.
As the data loads into the warehouse, additional constraints defined in the database schema as
well as in triggers activated upon data loading will be applied (such as uniqueness, referential
integrity, and mandatory fields), which also contribute to the overall data quality performance
of the ETL process. Important issues relating to the loading step are determining the frequency
of loading and establishing how loading is going to affect the data warehouse availability
Metadata management:
Metadata repository is managed by the ETL tool and retains information on warehouse data
regarding the details of the source system, details of any transformations on the data, and
details of any merging or splitting of data. This full data history (also called data lineage) is
available to users of the warehouse data and can facilitate the validation of query results or
provide an explanation for some anomaly shown in the result set that was caused by the ETL
process.
Query performance:
Fact-based management and ad hoc analysis must not be slowed or inhibited by the performance of the
data warehouse DBMS. Large, complex queries for key business operations must complete in reasonable
time periods.
Highly scalable:
The DBMS must support mass storage devices such as optical disk and hierarchical storage
management devices. Lastly, query performance should not be dependent on the size of the
database, but rather on the complexity of the query.
Parallel DBMSs:
Data warehousing requires the processing of enormous amounts of data and parallel database
technology offers a solution to providing the necessary growth in performance. The parallel
DBMS performs many database operations simultaneously, splitting individual tasks into
smaller parts so that tasks can be spread across multiple processors. They must be able to
decompose large complex queries into sub queries, run the separate sub queries
simultaneously, and reassemble the results at the end.
Synchronizing metadata:
The major integration issue is how to synchronize the various types of metadata used
throughout the data warehouse. The various tools of a data warehouse generate and use their
own metadata, and to achieve integration, we require that these tools are capable of sharing
their metadata. The challenge is to synchronize metadata between different products from
different vendors using different metadata stores. For example, it is necessary to identify the
correct item of metadata at the right level of detail from one product and map it to the
appropriate item of metadata at the right level of detail in another product, then sort out any
coding differences between them. The task of synchronizing two products is highly complex,
and therefore repeating this process for all the products that make up the data warehouse can
be resource-intensive. However, integration of the metadata must be achieved.
Data Mart
Introduction:
A database that contains a subset of corporate data to support the analytical
requirements of a particular business unit (such as the sales department) or to support users who
share the same requirements to analyze a particular business process (such as property sales).
Data Mart is the subset of Data Warehouse and this is used to store data of any specific
department such as sales department, marketing department and employee’s department.
Data warehouse have grown in popularity, so has related concept of data marts. There is general
agreement that a data mart is built to support the analytical requirements of a particular group
of users, and in providing this support, the data mart stores only a subset of corporate data. In
Kimball’s methodology, a data mart is the physical implementation of a single star schema
(dimensional model) modeled on a particular business process (such as property sales). The users
of Kimball’s data mart can be spread throughout an enterprise but share the same requirement
to analyze a particular business process.
Reasons for Creating a Data Mart:
There are many reasons for creating a Data Mart:
Temporal Databases:
A Data Warehousing (DW) is process for collecting and managing data from varied
sources to provide meaningful business insights. The data warehouse is the core of the BI
system which is built for data analysis and reporting.
A Temporal database that contains time-varying historical data with the possible
inclusion of current and future data and has the ability to manipulate this data.
Example of transactional data that will change over time for the Dream Home case study. The
position and salary of staff; the monthly rental (rent) and owners (owner No) of properties and the
preferred type of property (retype) and maximum rent set by clients seeking to rent properties.
However, the key difference between Dream Home’s transactional database and data warehouse is that
the transactional database commonly presents the data as being non-temporal and only holds the
current value of the data while the data warehouse presents the database being temporal and must
hold all past, present, and future versions of the data. To illustrate the complexity of dealing with
temporal data, consider the following two scenarios concerning the temporal monthly rent
values for Dream Home’s properties.
Temporal Extensions to the SQL Standard:
In this section, we examine the temporal extensions presented in the latest SQL
standard, namely SQL:
The star schema exploits the characteristics of factual data such that facts are generated
by events that occurred in the past, and are unlikely to change, regardless of how they are
analyzed. As the bulk of data in a data warehouse is represented as facts, the fact tables can be
extremely large relative to the dimension tables. Dimension tables, by contrast, generally
contain descriptive textual information. Dimension attributes are used as the constraints in
data warehouse queries. For example, the star schema. Star schemas can be used to speed up
query performance by denormalizing reference data into a single dimension table.
Denormalization is appropriate when there are a number of entities related to the dimension
table that are often accessed, avoiding the overhead of having to join additional tables to
access those attributes. Denormalization is not appropriate where the additional data is not
accessed very often, because the overhead of scanning the expanded dimension table may not
be offset by any gain in the query performance.