Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Data Quality Architecture and Options

Nita Khare – Alliances & Technology Team - Solution Architect – nita.khare@tcs.com


* IBM IM Champion 2013 *

December 3, 2013

0
Agenda

Pain Areas / Challenges of DQ Solution

DQ Solution

DQ Architecture Options

Near Real-time/Inline DQ management Solution

Standardization process

Other Important DQ Processes

Benefits

1
Why is DQ important ?
A Simple Pizza & Beer Order Receipt

 Date format not known – This can mislead


while calculating the sales numbers.

 Suffix – Spring Ale added to the beer


description by default - This data quality
issue impacts the store’s inventory and
procurement systems.

 Look at the age – It’s a default nos. Not


capturing the correct age range of the buyer
– During analysis, we will get results
depending on the current year only. i.e. say if
default nos. mentioned is 11/22/1990 and if
we do analysis today i.e. 11/22/2013, we will
think that the buyers are of age somewhere
around 23 years which is not at all true.

 Currency format is not mentioned – Sales


Nos. can go wrong.
Source : http://www.information-management.com/blogs/dq-by-example-old-beer-bought-by-old-man-10024988-1.html

2
Pain Areas of DQ Solution

Identification
of Business
Areas
Providing high
Managing
performance
Large Volumes
for ad-hoc
of Data
queries

Resolving data
quality issues Nos. and types
& survival of Technology
policy Involved
decisions

Handling complex
Data Availability
relationship in
& Cleansing
data
Large Nos. of
Unmanaged
Data sources

3
DQ Solution – DQ Management Approach
DMAIC is a 5 step iterative approach to data quality improvement. It comprises of
continuous analysis, observation, and improvement of underlying data leading to
overall improvement in the quality of information across the organization

4
DQ Solution – A typical DQ Lifecycle

The ultimate goal of DQ management should be to move from reactive mode of data
quality management to proactively control and manage the data quality so that the
data imperfections in the systems are limited.

5
DQ Solution – Factors Influencing DQ
All the below mentioned factors (LUCAS) need to be addressed in order to ensure
quality data available for analysis to the end users.

6
DQ Solution – DQ Reference Architecture

7
DQ Architecture Options

8
DQ Architecture Options – Pros & Cons

Consideration Option 1: Source System Option 2: ETL Layer Option 3: Target


Points Layer

Data cleansing effort Ensures quality data is Cost and effort of DQ exercise More Expensive
and cost available at the place where it increases as we move away
is captured and hence minimal from the source system.
data quality impact on
downstream applications.

Data Load The data load may be delayed The more the DQ checks, the The data load is very
as the DQ checks needs to be more would be the impact on quick as the DQ checks
applied in the source system the data load. But if designed are applied after the
before the data is ready to be optimally, there might not be loads into DWH.
loaded into target layers. much impact .

Impact on Source DQ processes may become an Less impact to source Minimal impact to source
System overhead to the operational operational system compared operational system
system. to option 1 performance
Heterogeneous source An additional overhead of Less impact to source Minimal impact to source
systems implementing data quality operational system compared operational system
processes and procedures in to option 1. performance
multiple platforms

9
Near Real-time/Inline DQ management Solution

Option 1: Data Quality Management using Hadoop


Option 2: Data Quality Management using Database resources

Consideration Points Option 1: Managing DQ solution Option 2: Managing Data Quality Solutions
using Big Data technology using Database Resources

Volume Best suited for very high volume data. It works well with low to medium data volumes.

Update Frequency Efficiently handles frequent changed Capable of handling frequently changed records.
records
Source data quality / Very efficiently handles multiple data Performance bottleneck possible with volume
No. of DQ Rules quality checks and controls during data growth and more number of DQ checks.
load process.
Data Load SLA Capable of loading high volume data in Capable of loading high volume data in stipulated
stipulated time. time but can crumble with data growth.

Cost Cost of implementation is cheaper Cost of implementation is higher compared to


compared to Option 2. Option 1
Maintenance / Support As big data technologies, especially Maintainability, debugging and fixes are quicker
requires coding, debugging and applying compared to Option 1.
fixes are more time consuming and costly.

Expert availability As big data technologies are still emerging, ETL experts are easily available.
there could be difficulty in getting big data
skilled associate.

10
DQ Solution – Data Governance Council

11
Standardization process
Cleansing and standardization of data is achieved by set of transformations, where
organizations data process through each of these stages for better data cleansing and
standardization.

12
Standardization process – Technical Integration

13
Other Important DQ Processes – Error Handling

This is one of the ways of implementing Error handling solution in any ETL architecture

 One common error table can be created to capture and store exceptions
while loading data in downstream systems.

 When records are rejected due to data quality issues (validation errors)
they will be logged in the exception database.

 In case, there is agreed default value provided by Business for the source
columns not holding valid data, the same will be loaded into the target
table.

 Performance overhead will be minimal as exception records will be low in


volume in incremental scenario and there would be mostly all inserts in the
exceptions table.

14
Sample Error Handling Dashboard

15
Reconciliation
 Data reconciliation is performed to verify the integrity of the data loaded into the
warehouse.

 One of the major reasons of information loss is loading failures or errors during
loading. Such errors can occur due to several reasons:

 Inconsistent or non coherent data from source


 Non-integrating data among different sources
 Unclean/ non-profiled data
 Un-handled exceptions
 Constraint violations
 Logical issues/ Inherent flaws in program
 Technical failures like loss of connectivity, loss over network, space issue etc.

 Reconciliation process will only indicate whether or not the data is correct. It will not
indicate why the data is not correct. Reconciliation process answers “what” part of the
question, not “why” part of the question.

Typically they are implemented to ensure that:

SUM/COUNT (Input) = SUM/COUNT (Output) + SUM/COUNT (Captured Reject)

16
Types Of Reconciliation

Transactional Reconciliation

 Matching the number of records in source and in target. If these counts are
equal, it can be safely assumed that records were not left out due to an error
during the ETL or simple load process. This can be further verified by the lack
of errors (not necessarily warnings) in the exception reporting by the ETL tool.

Financial Reconciliation

 This checks on the data content in source and target. E.g. computing the sum
of Amount column in all records at source and target and matching the same.
Financial reconciliation will be performed before the data load starts into EDW.

 Financial reconciliation, if required, can be implemented at specific job level


after identifying columns to be reconciled for a source system batch run.

17
Benefits
Central repository of enterprise data and single version of truth across the
enterprise providing Unified Information Delivery platform
 Data is complete, accurate and consistent in the target system enabling
better confidence in decision making.

 Consolidation of business logic at enterprise level, to remove


discrepancies and standardize the data set.

 Provide business users & data stewards a clear picture of their data
quality, monitor, track and govern information over time.

 Improves data standardization through common data governance


framework at the enterprise level.

 Provides architectural options for implementing DQ solution. The cost of


DQM increases as we move from Source to Target. Hence it is advisable
to apply DQM solution in the source or nearer to the source in order to
reduce the DQM effort.

18
THANK YOU

19

You might also like