Presentation Report: Data Warehouse-Star Schema: Punjab University College of Information and Technology

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

PUNJAB UNIVERSITY COLLEGE OF

INFORMATION AND TECHNOLOGY

Presentation Report: Data Warehouse-Star Schema

Group Members:

ALI RAZA-BITF20M010
M.WAQAS UD DIN-BITF20M013
NUMEER QADRI-BITF20M021
M.HAMZA FAROOQI-BITF20M031

Instructor:

Sir Asif Sohail


DATA WAREHOUSE
History:
The original concept of a data warehouse was devised by IBM in 1980s as the “information
warehouse” and presented as a solution for accessing data held in no relational systems. The
information warehouse was proposed to allow organizations to use their data archives to help
them gain a business advantage. However, due to the sheer complexity and performance
problems associated with the implementation of such solutions, the early attempts at creating
an information warehouse were mostly rejected. Since then, the concept of data warehousing
has been raised several times but only in recent years has the potential of data warehousing
been seen as a valuable and viable solution. One of the earliest promoters of data warehousing
is Bill Inmon, who has earned the title of “father of data warehousing.”

Definition:
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process.

In this early definition by Inmon (1993), the data is:

• Subject-oriented, as the warehouse is organized around the major subjects of the enterprise
(such as customers, products, and sales) rather than the major application areas (such as
customer billing, order processing, and product sales). This is reflected in the need to store
decision support data rather than application oriented data.

• Integrated, because of the coming together of source data from different enterprise-wide
applications systems. The integrated data source must be made consistent to present a unified
view of the data to the users. For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a single way of identifying a
product.

• Time-variant, because data in the warehouse is accurate and valid only at some point in time
or over some time interval. Historical data is kept in a data warehouse. For example, one can
retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This feature allows the following:

 Allows the analysis of past.


 Relates information to the present.
 Enables forecast or prediction to future.
• Nonvolatile, as the data is not updated in real time. Once data is in the data warehouse, it will
not change. So, historical data in a data warehouse should never be altered. New data is always
added as a supplement to the database, rather than a replacement.

Goal:
The ultimate goal of data warehousing is to integrate enterprise-wide corporate data into a
single repository from which users can easily run queries, produce reports, and perform
analysis.

 Secured and easy access of information to user


 Consistent information to be provided
 Clean and authentic data for analysis
 Data collected from multiple sources must be accurate and verified

Benefits:
• Potential high returns on investment. An organization must commit a huge amount of
resources to ensure the successful implementation of a data warehouse, and the cost can vary
enormously from tens of thousands to millions of dollars due to the variety of technical
solutions available. However, a study by the International Data Corporation (IDC) reported that
data warehouse projects delivered an average three-year return on investment (ROI) of 401%
(IDC, 1996).

• No business can survive without a large and accurate storehouse of historical data, from sales
and inventory data to personnel and intellectual property records. If a business executive
suddenly needs to know the sales of a key product 24 months ago, the rich historical data
provided by a data warehouse make this possible.

• Increased productivity of corporate decision makers. Data warehousing improves the


productivity of corporate decision makers by creating an integrated database of consistent,
subject-oriented, historical data. It integrates data from multiple incompatible systems into a
form that provides one consistent view of the organization. By transforming data into
meaningful information, a data warehouse allows corporate decision makers to perform more
substantive, accurate, and consistent analysis.
Comparison of OLTP systems and data warehousing systems
CHARACTERISTIC OLTP SYSTEMS Data Warehousing System
Main purpose Support operational processing Support analytical processing
Focus Insert, Update, Delete information from Extract data for analyzing that helps
the database. in decision making.
Data age Current Historic

Data Predictable pattern of data insertions, Uses select query for fetching data
Processing deletions, updates, and queries.
Users Used by end users like clerks, cashiers and large number of operational
DBA etc. user(most knowledge workers,
business analyst)
Storage Requires relatively smaller space. Requires significantly large space

Time The processing time of a transaction is The processing time of a transaction


comparatively less in OLTP. is comparatively less in OLAP.
Queries Simpler Queries Complex Queries

Normalization Tables in OLTP database are Tables in OLAP database are not
normalized(3NF). normalized.

Problems:
 Underestimation of Resources for Data
Many developers underestimate the time required to extract, transform, and load the data
into the warehouse. This process may account for a significant proportion of the total
development time.

 Hidden Problems with source systems


Hidden problems associated with the source systems feeding the data warehouse may be
identified, possibly after years of being undetected. For example, when entering the details of a
new property, certain fields may allow nulls, which may result in staff entering incomplete
property data, even when available and applicable.
 Increased Demand of the Users
After success with the initial few queries, users of the facility may ask more complicated
queries which would increase the workload on the system and server. There might also be an
increase in the number of queries posed by the staff which also increase the server load.

 Data homogenization
Large-scale data warehousing can become an exercise in data homogenization that lessens the
value of the data. For example, when producing a consolidated and integrated view of the
organization’s data, the warehouse designer emphasize similarities rather than differences in
the data used by different application areas such as property sales and property renting.

 High Demand for Resources


The data warehouse can use large amounts of disk space. Many relational databases used for
decision support are designed around star, snowflake, and star flake schemas If there are many
dimensions to the factual data, the combination of aggregate tables and indexes to the fact
tables can use up more space than the raw data.

 Data Ownership
Data warehousing may change the attitude of end-users to the ownership of data. Sensitive
data that was originally viewed and used only by a particular department or business area, such
as sales or marketing, may now be made accessible to others in the organization.

 Long-duration Projects
A data warehouse represents a single data resource for the organization. However, the building of a
warehouse can take several years, which is why some organizations are building data marts. Data marts
support only the requirements of a particular department or functional area and can therefore be built
more rapidly.
Data Warehouse Architecture
Operational Data:
The source of data for the data warehouse is supplied from:
 Mainframe operational data
 Departmental data
 Private data
 External systems i.e. internet

Operational data store:


An operational data store (ODS) is a repository of current and integrated operational data used
for analysis. It is often structured and supplied with data in the same way as the data
warehouse, but May in fact act simply as a staging area for data to be moved into the
warehouse.

Building an ODS can be a helpful step toward building a data warehouse, because an ODS can
supply data that has been already extracted from the source systems and cleaned. This means
that the remaining work of integrating and restructuring the data for the data warehouse is
simplified.
ETL Manager:
The ETL manager performs all the operations associated with the ETL of data into the
warehouse.

Warehouse Manager:
The warehouse manager performs all the operations associated with the management of the
data in the warehouse. The operations performed by the warehouse manager include:

 Check consistency in data


 Transforming and merging data from temporary storage to warehouse tables
 creation of indexes and views on base tables
 backing up and archiving data

Query Manager:
The query manager performs all the operations associated with the management of user
queries. The complexity of the query manager is determined by the facilities provided by the
end-user access tools and the database.

Detailed Data:
This area of the warehouse stores all the detailed data in the database schema.

Lightly and Highly Summarized Data:


This area of the warehouse stores all the predefined lightly and highly summarized (aggregated)
data generated by the warehouse manager.

The purpose of summary information is to speed up the performance of queries. Although


there are increased operational costs associated with initially summarizing the data, this is
offset by removing the requirement to continually perform summary operations.

Archive/Backup Data:
This area of the warehouse stores the detailed and summarized data for the purposes of
archiving and backup. Even though summary data is generated from detailed data, it may be
necessary to back up online summary data if this data is kept beyond the retention period for
detailed data.
Metadata:
This area of the warehouse stores all the metadata (data about data) definitions used by all the
processes in the warehouse. Metadata is used for a variety of purposes, including:

 The extraction and loading processes—metadata is used to map data sources to a


common view of the data within the warehouse
 The warehouse management process—metadata is used to automate the production of
summary tables
 As part of the query management process—metadata is used to direct a query to the
most appropriate data source.

The structure of metadata differs between each process, because the purpose is different.
This means that multiple copies of metadata describing the same data item are held within
the data warehouse.

End-User Access Tools:


The principal purpose of data warehousing is to support decision makers. These users
interact with the warehouse using end-user access tools. High performance is achieved by
preplanning the requirements for joins, summations, and periodic reports by end-users.

Although the definitions of end-user access tools can overlap, for the purpose of this
discussion, we categorize these tools into four main groups:

 reporting and query tools


 application development tools
 OLAP tools
 data mining tools

Reporting and query tools:


Reporting tools include production reporting tools and report writers. Production reporting
tools are used to generate regular operational reports or support high volume batch jobs, such
as customer orders/invoices and staff paychecks. Query tools for relational data warehouses
are designed to accept SQL or generate SQL statements to query data stored in the warehouse.
These tools shield end-users from the complexities of SQL and database structures by including
a meta layer between users and the database. The meta layer is the software that provides
subject-oriented views of a database and supports “point-and-click” creation of SQL. An
example of a query tool is Query-By-Example (QBE).
Application development tools:
The requirements of the end-users may be either such that the built-in capabilities of reporting
and query tools are inadequate, because the required analysis cannot be performed or because
the user interaction requires an unreasonably high level of expertise by the user. In this
situation, user access may require the development of in-house applications. Some of these
application development tools integrate with popular OLAP tools.

Online Analytical Processing (OLAP) tools:


OLAP tools are based on the concept of multidimensional databases and allow a sophisticated
user to analyze the data using complex, multidimensional views.

These tools assume that the data is organized in a multidimensional model supported by a
special multidimensional database (MDDB) or by a relational database designed to enable
multidimensional queries.

Data mining tools:


Data mining is the process of discovering meaningful new correlations, patterns, and trends by
mining large amounts of data using statistical, mathematical, and AI techniques.

Data mining has the potential to supersede the capabilities of OLAP tools, as the major
attraction of data mining is its ability to build predictive rather than retrospective models.

Extraction, Transformation, and Loading (ETL)


Extraction:
The extraction step targets one or more data sources for the EDW; these sources typically
include OLTP databases but can also include sources such as personal databases and
spreadsheets, enterprise resource planning (ERP) files, and web usage log files. The complexity
of the extraction step depends on how similar or different the source systems are for the EDW.
The extraction step normally copies the extracted data to temporary storage referred to as the
operational data store (ODS) or staging area

Transformation:
It applies series of rules or functions to the extracted data which is used to analyze the data.
The output from the transformations is data that is clean and consistent with the data already
held in the warehouse, and furthermore, is in a form that is ready for analysis by users of the
warehouse. It allows user to perform queries on the EDW data that are capable of drilling down
to the most detailed data.
Loading:
Loading of the data into the warehouse can occur after all the transformation have taken place.
As the data loads into the warehouse, additional constraints defined in the database schema as
well as in triggers activated upon data loading will be applied (such as uniqueness, referential
integrity, and mandatory fields), which also contribute to the overall data quality performance
of the ETL process. Important issues relating to the loading step are determining the frequency
of loading and establishing how loading is going to affect the data warehouse availability

Data profiling and data quality control:


Data profiling provides important information about the quantity and quality of the data
coming from the source systems. For example, data profiling can indicate how many rows have
missing, incorrect, or incomplete data entries and the distribution of values in each column.
This information can help to identify the transformation steps required to clean the data and/or
change the data into a form suitable for loading to the warehouse.

Metadata management:
Metadata repository is managed by the ETL tool and retains information on warehouse data
regarding the details of the source system, details of any transformations on the data, and
details of any merging or splitting of data. This full data history (also called data lineage) is
available to users of the warehouse data and can facilitate the validation of query results or
provide an explanation for some anomaly shown in the result set that was caused by the ETL
process.

Data Warehouse DBMS:


There are few integration issues associated with the data warehouse database. Due to the
maturity of such products, most relational databases will integrate predictably with other types
of software. Parallelism in the database becomes an important issue, as well as the usual issues
such as performance, scalability, availability, and manageability, which must all be taken into
consideration when choosing a DBMS.

Load performance and performance:


Performance of the load process should be measured in hundreds of millions of rows or
gigabytes of data per hour and there should be no maximum limit that constrains the business.
Many steps must be taken to load new or updated data into the data warehouse, including data
conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata
update. Although each step may in practice be atomic, the load process should appear to
execute as a single, seamless unit of work.

Data quality management:


The shift to fact-based management demands the highest data quality. The warehouse must
ensure local consistency, global consistency, and referential integrity despite “dirty” sources
and massive database sizes. While loading and preparation are necessary steps, they are not
sufficient. The ability to answer end-users’ queries is the measure of success for a data
warehouse application. As more questions are answered, analysts tend to ask more creative
and complex questions.

Query performance:
Fact-based management and ad hoc analysis must not be slowed or inhibited by the performance of the
data warehouse DBMS. Large, complex queries for key business operations must complete in reasonable
time periods.

Highly scalable:
The DBMS must support mass storage devices such as optical disk and hierarchical storage
management devices. Lastly, query performance should not be dependent on the size of the
database, but rather on the complexity of the query.

Networked data warehouse:


Data warehouse systems should be capable of cooperating in a larger network of data
warehouses. The data warehouse must include tools that coordinate the movement of subsets
of data between warehouses.

Parallel DBMSs:
Data warehousing requires the processing of enormous amounts of data and parallel database
technology offers a solution to providing the necessary growth in performance. The parallel
DBMS performs many database operations simultaneously, splitting individual tasks into
smaller parts so that tasks can be spread across multiple processors. They must be able to
decompose large complex queries into sub queries, run the separate sub queries
simultaneously, and reassemble the results at the end.

Data Warehouse Metadata:


The major purpose of metadata is to show the pathway back to where the data began, so that
the warehouse administrators know the history of any item in the warehouse. The metadata
associated with data transformation and loading must describe the source data and any
changes that were made to the data. For example, for each source field there should be a
unique identifier, original field name, source data type, and original location along with the
destination data type and destination table name. If the field is subject to any transformations
such as a simple field type change to a complex set of procedures and functions, this should
also be recorded. There is also metadata associated with the users of queries that includes, for
example, information describing what the term “price” or “customer” means in a particular
database and whether the meaning has changed over time.

Synchronizing metadata:
The major integration issue is how to synchronize the various types of metadata used
throughout the data warehouse. The various tools of a data warehouse generate and use their
own metadata, and to achieve integration, we require that these tools are capable of sharing
their metadata. The challenge is to synchronize metadata between different products from
different vendors using different metadata stores. For example, it is necessary to identify the
correct item of metadata at the right level of detail from one product and map it to the
appropriate item of metadata at the right level of detail in another product, then sort out any
coding differences between them. The task of synchronizing two products is highly complex,
and therefore repeating this process for all the products that make up the data warehouse can
be resource-intensive. However, integration of the metadata must be achieved.

Data Mart
Introduction:
A database that contains a subset of corporate data to support the analytical
requirements of a particular business unit (such as the sales department) or to support users who
share the same requirements to analyze a particular business process (such as property sales).
Data Mart is the subset of Data Warehouse and this is used to store data of any specific
department such as sales department, marketing department and employee’s department.
Data warehouse have grown in popularity, so has related concept of data marts. There is general
agreement that a data mart is built to support the analytical requirements of a particular group
of users, and in providing this support, the data mart stores only a subset of corporate data. In
Kimball’s methodology, a data mart is the physical implementation of a single star schema
(dimensional model) modeled on a particular business process (such as property sales). The users
of Kimball’s data mart can be spread throughout an enterprise but share the same requirement
to analyze a particular business process.
Reasons for Creating a Data Mart:
There are many reasons for creating a Data Mart:

 To give users access to the data


 To provide data in a form that matches the collective view of the data by a group of
users
 To improve end-user response time by reducing volume of data accessed
 To provide structured data as dedicated by the requirements of the users
 Data marts use less data and cost of implementing data mart is less
 The potentials users of a data mart are more clearly defined

Temporal Databases:
A Data Warehousing (DW) is process for collecting and managing data from varied
sources to provide meaningful business insights. The data warehouse is the core of the BI
system which is built for data analysis and reporting.

A Temporal database that contains time-varying historical data with the possible
inclusion of current and future data and has the ability to manipulate this data.
Example of transactional data that will change over time for the Dream Home case study. The
position and salary of staff; the monthly rental (rent) and owners (owner No) of properties and the
preferred type of property (retype) and maximum rent set by clients seeking to rent properties.
However, the key difference between Dream Home’s transactional database and data warehouse is that
the transactional database commonly presents the data as being non-temporal and only holds the
current value of the data while the data warehouse presents the database being temporal and must
hold all past, present, and future versions of the data. To illustrate the complexity of dealing with
temporal data, consider the following two scenarios concerning the temporal monthly rent
values for Dream Home’s properties.
Temporal Extensions to the SQL Standard:
In this section, we examine the temporal extensions presented in the latest SQL
standard, namely SQL:

 T180 system-versioned tables


 T181 application-time period tables

Databases that provide system-versioning or application-time period tables can avoid


some of the major issues associated with the storage of temporal data such as the level of
complexity required of the application code.

Data Warehousing Using Oracle:


Oracle is one of the leading relational DBMS for data warehousing. Oracle has achieved
this success by focusing on basic, core requirements for data warehousing: performance,
scalability, and manageability. Data warehouses store larger volumes of data, support more
users, and require faster performance, so these core requirements remain key factors in the
successful implementation of data warehouses. However, Oracle goes beyond these core
requirements and is the first true “data warehouse platform”. In this section we provide an
overview of the main features of Oracle:
 Summary management;
 Analytical functions;
 Bitmapped indexes;
 Advanced join methods;
 Sophisticated SQL optimizer
 Resource management.

Warehouse Features in Oracle 11g:


Oracle Database 11g is a comprehensive database platform for data warehousing and
business intelligence that combines industry-leading scalability and performance, deeply
integrated analytics, and embedded integration and data-quality, all in a single platform
running on a reliable, low-cost grid infrastructure.

Oracle Support for Temporal Data:


Oracle provides a product called Workspace Manager to manage temporal data, and this is
achieved through features that include the period data type, valid-time support, transaction-time
support, support for bi-temporal tables, and support for sequenced primary keys, sequenced
uniqueness, sequenced referential integrity, and sequenced selection and projection, in a manner quite
similar to that proposed in SQL/Temporal.
Star Schema
The Star Schema is the simplest style of data mart schema and is the approach most widely
used to develop data warehouses and dimensional data marts.

The star schema exploits the characteristics of factual data such that facts are generated
by events that occurred in the past, and are unlikely to change, regardless of how they are
analyzed. As the bulk of data in a data warehouse is represented as facts, the fact tables can be
extremely large relative to the dimension tables. Dimension tables, by contrast, generally
contain descriptive textual information. Dimension attributes are used as the constraints in
data warehouse queries. For example, the star schema. Star schemas can be used to speed up
query performance by denormalizing reference data into a single dimension table.
Denormalization is appropriate when there are a number of entities related to the dimension
table that are often accessed, avoiding the overhead of having to join additional tables to
access those attributes. Denormalization is not appropriate where the additional data is not
accessed very often, because the overhead of scanning the expanded dimension table may not
be offset by any gain in the query performance.

You might also like