Professional Documents
Culture Documents
Data Warehouse: FPT University Hanoi 2010
Data Warehouse: FPT University Hanoi 2010
FPT University
Hanoi 2010
1
Lecture 9
Data Quality: A key to success
2
CHAPTER OBJECTIVES
Clearly understand why data quality is critical in a data warehouse
Observe the challenges posed by corrupt data and learn the methods to
deal with them
Appreciate the benefits of quality data
Review the various categories of data quality tools and examine their usage
Study the implications of a data quality initiative and learn practical tips on
data quality
3
OVERVIEW
Imagine a small error, seemingly inconsequential, creeping into one of your
operational systems.
And take the customer data to the next step and move it into the data
warehouse. What is the consequence of this error?
Dirty data is among the top reasons for failure of a data warehouse
Most companies overestimate the quality of the data in their operational systems
Only when companies make an effort to ascertain the quality of their data are
they amazed at the extent of data corruption.
If your enterprise has several disparate legacy systems from which your data
warehouse must draw its data, start with the assumption that your source data is
likely to be corrupt
4
WHY IS DATA QUALITY CRITICAL?
Improved data quality:
boosts confidence in decision making,
enables better customer service,
increases opportunity to add better value to the services,
reduces risk from disastrous decisions,
reduces costs, especially of marketing campaigns,
enhances strategic decision making,
improves productivity by streamlining processes, and
avoids compounding effects of data contamination.
5
What is Data Quality?
Consider an entity such as customer: customer name, customer
address, customer state, customer lifestyle,…
Data quality implies data accuracy, but it is much more than that.
Most cleansing operations concentrate on just data accuracy. You
need to go beyond data accuracy.
If the data is fit for the purpose for which it is intended, we can then
say such data has quality.
What is considered to be data quality in operational systems?
Data quality in a data warehouse is not just the quality of individual
data items but the quality of the full, integrated system as a whole.
6
7
What is Data Quality? (con’t)
Survey of the characteristics of high-quality data
Accuracy. The value stored in the system for a data element is the right value
for that occurrence of the data element.
Domain Integrity. The data value of an attribute falls in the range of allowable,
defined values.
Data Type. Value for a data attribute is actually stored as the data type defined
for that attribute.
Consistency. The form and content of a data field is the same across multiple
source systems.
Redundancy. The same data must not be stored in more than one place in a
system.
8/32
What is Data Quality? (con’t)
9
What is Data Quality? (con’t)
the characteristics of high-quality data (con’t)
Data Anomaly. A field must be used only for the purpose for which it is defined.
Clarity. A data element may possess all the other characteristics of quality data
but if the users do not understand its meaning clearly, then the data element is of
no value to the users.
Timely. The users determine the timeliness of the data.
Usefulness. Every data element in the data warehouse must satisfy some
requirements of the collection of users.
Adherence to Data Integrity Rules. The data stored in the relational databases
of the source systems must adhere to entity integrity and referential integrity
rules.
10
Benefits of Improved Data Quality
Analysis with Timely Information.
For example (from textbook)
Quality data produces timely information, a significant benefit for the users.
Newer Opportunities.
Quality data in a data warehouse is a great boon for marketing and opens the
doors to immense opportunities to cross-sell across product lines and
departments.
11
Benefits of ..(con’t)
Reduced Costs and Risks
What are some of the risks of poor data quality?
One area where quality data reduces costs is in mailings to customers, especially in
marketing campaigns.
Improved Productivity
Users get an enterprise-wide view of information from the data warehouse. This is a primary
goal of the data warehouse. In areas where a corporate wide view of information naturally
enables the streamlining of processes and operations, you will see productivity gains.
12
Figure 13-2 Data quality: the top challenge
14/32
Types of …(con’t)
a list of explicit types (con’t)
Violation of business rules. the days worked in a year plus the vacation days,
holidays, and sick days cannot exceed 365 or 366. In a bank loan system, the
minimum interest rate cannot be more than the maximum rate for a variable rate
loan.
Reused primary keys. a problem in the operational systems and in the data
warehouse, where you capture both present data from the current customer file and
the past data from the archived customer file, you have a problem of duplication of
the reused primary key values.
No unique identifiers. There is a different complication with identifiers -> a unique
identifier does not represent the same product in two different systems.
Inconsistent values. Codes for policy type in different legacy systems in an
expanding insurance company could have inconsistent values.
15
Types of …(con’t)
a list of explicit types (con’t)
Incorrect values. Product Code: 146, Product Name: Crystal Vase, and Height: 486
inches in the same record point to some sort of data inaccuracy. The values for product
name and height are not compatible. Perhaps the product code is also incorrect.
Multipurpose fields. Same data value in a field entered by different departments may
mean different things. This type of problem is perpetuated because store codes and
warehouse codes were residing in the same field. Warehouse codes went into the same
field by redefining the store code field. This type of data pollution is hard to correct.
Erroneous integration.
For example
These integration problems arise because, typically, each legacy system had been developed
in isolation at different times in the past.
16
DATA QUALITY CHALLENGES
Sources of Data Pollution
System conversions. Try to understand the conversions gone through by each
of your source systems.
Data aging. If many of your source systems are old legacy systems, pay special
attention to the possibility of aged data in those systems.
Heterogeneous system integration. If the sources for one table are several
heterogeneous systems, be cautious about the quality of data coming into the
data warehouse from these systems.
Poor database design. Adhering to entity integrity and referential integrity rules
prevents some kinds of data pollution.
Incomplete information at data entry.
17
DATA QUALITY…(con’t)
Sources of Data Pollution
Input errors. Erroneous entry of data is a major source of data
corruption.
Internationalization/localization. The change in the company
structure and the resulting revisions in the source systems are
sources of data pollution.
Fraud. Make sure that the source systems are fortified with tight
edits for such fields.
Lack of policies. An enterprise without explicit policies on data
quality cannot be expected to have adequate levels of data quality.
18
DATA QUALITY…(con’t)
Validation of Names and Addresses
For a single person, multiple records can exist among the various source systems
a few inherent problems with entering names and addresses:
No unique key
Many names on one line
One name on two lines
Name and the address in a single line
Personal and company names mixed
Different addresses for the same person
Different names and spellings for the same customer
go through a preliminary step.
First, you have to recast the name and address data into the multiple field format.
you have to devise matching algorithms to match the customer records and find the duplicates.
19
DATA QUALITY…(con’t)
Costs of Poor Data Quality
Bad decisions based on routine analysis
Lost business opportunities because of unavailable or “dirty” data
Strain and overhead on source systems because of corrupt data causing reruns
Fines from governmental agencies for noncompliance or violation of regulations
Resolution of audit problems
Redundant data unnecessarily using up resources
Inconsistent reports
Time and effort for correcting data every time data corruption is discovered
20
DATA QUALITY TOOLS
21
DATA QUALITY INITIATIVE
In spite of the enormous importance of data quality, it seems as though
many companies still ask the question whether to pay special attention
to it and cleanse the data or not
Data cleansing is tedious and time-consuming.
The metadata on many source systems may be missing or nonexistent.
The users who are asked to ensure data quality have many other business
responsibilities.
Sometimes, the data cleansing activity appears to be so gigantic and overwhelming
that companies are terrified of launching a data cleansing initiative.
two approaches:
only data with a 100% quality can be loaded into the data warehouse
a “clean as you go” method
22
Data Cleansing Decisions
Which Data to Cleanse.
root decision
you and your users must jointly work out the answer to this question
The cost of cleaning up all data in the data warehouse is enormous.
Where to Cleanse.
Data for your warehouse originates in the source operational systems, so does
the data corruption.
Cleansing the data in the staging area is comparatively easy
if you attempt to cleanse the data in the source systems, you are taking on a
complex, expensive, and difficult task
23
Data Cleansing …(con’t)
How to Cleanse.
the usage of vendor tools.
How to Discover the Extent of Data Pollution.
you have to assess the extent of data pollution before applying data
cleansing techniques
Figure 13-4
Setting Up a Data Quality Framework.
Figure 13-5
24
Who Should be Responsible?
Figure 13-6
25
Figure 13-4 Discovering the extent of data pollution
26
Figure 13-5 Data quality framework
27
Figure 13-6 Data quality: participant and roles
28
Figure 13-7 Overall data Purification
29
Practical Tips on Data Quality
Identify high-impact pollution sources and begin your purification process with
these.
Do not try to do everything with in-house programs.
Tools are good and are useful. Select proper tools.
Agree on standards and reconfirm these.
Link data quality with specific business objectives. By itself, data quality work is
not attractive.
Get the senior executive sponsor of your data warehouse project to be actively
involved in backing the data cleansing initiative.
Get users totally involved and keep them constantly informed of the
developments.
Wherever needed, bring in outside experts for specific assignments.
30/32
Master data management (MDM)
MDM is an umbrella approach to provide
consistent and comprehensive core information
across the organization.
MDM comprises a set of technologies,
disciplines, and solutions to create and maintain
consistent, accurate, and complete business
data about the entities included within its ambit.
31
MDM (con’t)
MDM categories
MDM benefits
MDM and Data warehousing
32