Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

Data warehouse

FPT University
Hanoi 2010

1
Lecture 9
Data Quality: A key to success

2
CHAPTER OBJECTIVES
 Clearly understand why data quality is critical in a data warehouse
 Observe the challenges posed by corrupt data and learn the methods to
deal with them
 Appreciate the benefits of quality data
 Review the various categories of data quality tools and examine their usage
 Study the implications of a data quality initiative and learn practical tips on
data quality

3
OVERVIEW
 Imagine a small error, seemingly inconsequential, creeping into one of your
operational systems.
 And take the customer data to the next step and move it into the data
warehouse. What is the consequence of this error?
 Dirty data is among the top reasons for failure of a data warehouse
 Most companies overestimate the quality of the data in their operational systems
 Only when companies make an effort to ascertain the quality of their data are
they amazed at the extent of data corruption.
 If your enterprise has several disparate legacy systems from which your data
warehouse must draw its data, start with the assumption that your source data is
likely to be corrupt

4
WHY IS DATA QUALITY CRITICAL?
 Improved data quality:
 boosts confidence in decision making,
 enables better customer service,
 increases opportunity to add better value to the services,
 reduces risk from disastrous decisions,
 reduces costs, especially of marketing campaigns,
 enhances strategic decision making,
 improves productivity by streamlining processes, and
 avoids compounding effects of data contamination.

5
What is Data Quality?
 Consider an entity such as customer: customer name, customer
address, customer state, customer lifestyle,…
 Data quality implies data accuracy, but it is much more than that.
Most cleansing operations concentrate on just data accuracy. You
need to go beyond data accuracy.
 If the data is fit for the purpose for which it is intended, we can then
say such data has quality.
 What is considered to be data quality in operational systems?
 Data quality in a data warehouse is not just the quality of individual
data items but the quality of the full, integrated system as a whole.

6
7
What is Data Quality? (con’t)
 Survey of the characteristics of high-quality data
 Accuracy. The value stored in the system for a data element is the right value
for that occurrence of the data element.
 Domain Integrity. The data value of an attribute falls in the range of allowable,
defined values.
 Data Type. Value for a data attribute is actually stored as the data type defined
for that attribute.
 Consistency. The form and content of a data field is the same across multiple
source systems.
 Redundancy. The same data must not be stored in more than one place in a

system.

8/32
What is Data Quality? (con’t)

 the characteristics of high-quality data (con’t)


 Completeness. There are no missing values for a given attribute in the
system.
 Duplication. Duplication of records in a system is completely resolved.
 Conformance to Business Rules. The values of each data item
adhere to prescribed business rules.
 Structural Definiteness. Wherever a data item can naturally be
structured into individual components, the item must contain this well-
defined structure.

9
What is Data Quality? (con’t)
 the characteristics of high-quality data (con’t)
 Data Anomaly. A field must be used only for the purpose for which it is defined.
 Clarity. A data element may possess all the other characteristics of quality data
but if the users do not understand its meaning clearly, then the data element is of
no value to the users.
 Timely. The users determine the timeliness of the data.
 Usefulness. Every data element in the data warehouse must satisfy some
requirements of the collection of users.
 Adherence to Data Integrity Rules. The data stored in the relational databases
of the source systems must adhere to entity integrity and referential integrity
rules.

10
Benefits of Improved Data Quality
 Analysis with Timely Information.
 For example (from textbook)
 Quality data produces timely information, a significant benefit for the users.

 Better Customer Service.


 The benefit of accurate and complete information for customer service cannot be
overemphasized.
 For example (from textbook)

 Newer Opportunities.
 Quality data in a data warehouse is a great boon for marketing and opens the
doors to immense opportunities to cross-sell across product lines and
departments.

11
Benefits of ..(con’t)
 Reduced Costs and Risks
 What are some of the risks of poor data quality?
 One area where quality data reduces costs is in mailings to customers, especially in
marketing campaigns.

 Improved Productivity
 Users get an enterprise-wide view of information from the data warehouse. This is a primary
goal of the data warehouse. In areas where a corporate wide view of information naturally
enables the streamlining of processes and operations, you will see productivity gains.

 Reliable Strategic Decision Making


 This point is worth repeating. If the data in the warehouse is reliable and of high quality, then
decisions based on the information will be sound. No data warehouse can add value to a
business until the data is clean and of high quality.

12
Figure 13-2 Data quality: the top challenge

Types of Data Quality Problems


13
Types of …(con’t)
 a list of explicit types of data quality problems
 Dummy values in fields. Are you aware of the practice of filling the Social Security number
field temporarily with nines to pass the numerical edits?
 Absence of data values. In operational systems, users are only concerned with the
customer data that is needed to mail a billing statement, to send a follow-up letter, and to call
about an overdue balance. Absence of data values is also related to other types of data
elements.
 Unofficial use of fields. How many times have you asked your users to place their
comments in the customer contact field because no field was provided for comments in the
customer record?
 Cryptic values. This is a prevalent problem in legacy systems, many of which were not
designed with end-users in mind. For example (textbook)
 Contradicting values. There are related fields in the source systems in which the values
must be compatible. For example (textbook)

14/32
Types of …(con’t)
 a list of explicit types (con’t)
 Violation of business rules. the days worked in a year plus the vacation days,
holidays, and sick days cannot exceed 365 or 366. In a bank loan system, the
minimum interest rate cannot be more than the maximum rate for a variable rate
loan.
 Reused primary keys. a problem in the operational systems and in the data
warehouse, where you capture both present data from the current customer file and
the past data from the archived customer file, you have a problem of duplication of
the reused primary key values.
 No unique identifiers. There is a different complication with identifiers -> a unique
identifier does not represent the same product in two different systems.
 Inconsistent values. Codes for policy type in different legacy systems in an
expanding insurance company could have inconsistent values.

15
Types of …(con’t)
 a list of explicit types (con’t)
 Incorrect values. Product Code: 146, Product Name: Crystal Vase, and Height: 486
inches in the same record point to some sort of data inaccuracy. The values for product
name and height are not compatible. Perhaps the product code is also incorrect.
 Multipurpose fields. Same data value in a field entered by different departments may
mean different things. This type of problem is perpetuated because store codes and
warehouse codes were residing in the same field. Warehouse codes went into the same
field by redefining the store code field. This type of data pollution is hard to correct.
 Erroneous integration.
 For example
 These integration problems arise because, typically, each legacy system had been developed
in isolation at different times in the past.

16
DATA QUALITY CHALLENGES
 Sources of Data Pollution
 System conversions. Try to understand the conversions gone through by each
of your source systems.
 Data aging. If many of your source systems are old legacy systems, pay special
attention to the possibility of aged data in those systems.
 Heterogeneous system integration. If the sources for one table are several
heterogeneous systems, be cautious about the quality of data coming into the
data warehouse from these systems.
 Poor database design. Adhering to entity integrity and referential integrity rules
prevents some kinds of data pollution.
 Incomplete information at data entry.

17
DATA QUALITY…(con’t)
 Sources of Data Pollution
 Input errors. Erroneous entry of data is a major source of data
corruption.
 Internationalization/localization. The change in the company
structure and the resulting revisions in the source systems are
sources of data pollution.
 Fraud. Make sure that the source systems are fortified with tight
edits for such fields.
 Lack of policies. An enterprise without explicit policies on data
quality cannot be expected to have adequate levels of data quality.

18
DATA QUALITY…(con’t)
 Validation of Names and Addresses
 For a single person, multiple records can exist among the various source systems
 a few inherent problems with entering names and addresses:
 No unique key
 Many names on one line
 One name on two lines
 Name and the address in a single line
 Personal and company names mixed
 Different addresses for the same person
 Different names and spellings for the same customer
 go through a preliminary step.
 First, you have to recast the name and address data into the multiple field format.
 you have to devise matching algorithms to match the customer records and find the duplicates.

19
DATA QUALITY…(con’t)
 Costs of Poor Data Quality
 Bad decisions based on routine analysis
 Lost business opportunities because of unavailable or “dirty” data
 Strain and overhead on source systems because of corrupt data causing reruns
 Fines from governmental agencies for noncompliance or violation of regulations
 Resolution of audit problems
 Redundant data unnecessarily using up resources
 Inconsistent reports
 Time and effort for correcting data every time data corruption is discovered

20
DATA QUALITY TOOLS

 Categories of Data Cleansing Tools


 Error Discovery Features
 Data Correction Features
 The DBMS for Quality Control

21
DATA QUALITY INITIATIVE
 In spite of the enormous importance of data quality, it seems as though
many companies still ask the question whether to pay special attention
to it and cleanse the data or not
 Data cleansing is tedious and time-consuming.
 The metadata on many source systems may be missing or nonexistent.
 The users who are asked to ensure data quality have many other business
responsibilities.
 Sometimes, the data cleansing activity appears to be so gigantic and overwhelming
that companies are terrified of launching a data cleansing initiative.

 two approaches:
 only data with a 100% quality can be loaded into the data warehouse
 a “clean as you go” method

22
Data Cleansing Decisions
 Which Data to Cleanse.
 root decision
 you and your users must jointly work out the answer to this question
 The cost of cleaning up all data in the data warehouse is enormous.

 Where to Cleanse.
 Data for your warehouse originates in the source operational systems, so does
the data corruption.
 Cleansing the data in the staging area is comparatively easy
 if you attempt to cleanse the data in the source systems, you are taking on a
complex, expensive, and difficult task

23
Data Cleansing …(con’t)
 How to Cleanse.
 the usage of vendor tools.
 How to Discover the Extent of Data Pollution.
 you have to assess the extent of data pollution before applying data
cleansing techniques
 Figure 13-4
 Setting Up a Data Quality Framework.
 Figure 13-5

24
Who Should be Responsible?
 Figure 13-6

25
Figure 13-4 Discovering the extent of data pollution

26
Figure 13-5 Data quality framework

27
Figure 13-6 Data quality: participant and roles

28
Figure 13-7 Overall data Purification

The Purification Process

29
Practical Tips on Data Quality
 Identify high-impact pollution sources and begin your purification process with
these.
 Do not try to do everything with in-house programs.
 Tools are good and are useful. Select proper tools.
 Agree on standards and reconfirm these.
 Link data quality with specific business objectives. By itself, data quality work is
not attractive.
 Get the senior executive sponsor of your data warehouse project to be actively
involved in backing the data cleansing initiative.
 Get users totally involved and keep them constantly informed of the
developments.
 Wherever needed, bring in outside experts for specific assignments.

30/32
Master data management (MDM)
 MDM is an umbrella approach to provide
consistent and comprehensive core information
across the organization.
 MDM comprises a set of technologies,
disciplines, and solutions to create and maintain
consistent, accurate, and complete business
data about the entities included within its ambit.

31
MDM (con’t)

 MDM categories
 MDM benefits
 MDM and Data warehousing

32

You might also like