Professional Documents
Culture Documents
06-Data-Integration Quality Profiling
06-Data-Integration Quality Profiling
Data Integration
Agenda
• BI- the process
• What is Data Integration?
• Challenges in Data Integration
• Technologies in Data Integration
• ETL: Extract, Transform, Load
• Various stages in ETL
• Need for Data Integration
• Advantages of using Data Integration
• Common approaches to Data Integration
• Metadata and its types
• Data Quality and Data Profiling concepts
BI – The Process
Data Data
Reporting
Integration Analysis
What is Data Integration
Process of coherent merging of data from various data sources,
for the purpose of presenting a cohesive/consolidated view to
the end user
It involves combining data that resides at various sources, and hence
provides users with a unified view of data.
Validate
Data Staging
Transform (clear, apply business rules)
Archive
Clean up
Various Stages in ETL
Data Mapping Data Staging
Validate Stage
REFERENCE
Extract Transform
Archive
------
------
------
Audit Reports
Publish
Extract, Transform and Load
• What is ETL?
Extract, transform, and load in database usage (especially in
data warehousing) involves:
Extracting data from different sources
Transforming data to fit operational needs. May include quality levels.
Loading it into the end target (database or data warehouse)
• Allows to create efficient and consistent databases
• While ETL is generally referred in the context of data warehousing, the
term ETL is in fact referred to as a process that loads any database.
• Usually, an audit trail is stored by ETL implementations on positive and
negative process runs.
Data Mapping
• The process of creating data element mapping between two distinct data models.
• Used as the first step in a wide variety of data integration tasks.
• The various method of data mapping are
Hand-coded, graphical manual
• Graphical tools allow the user to define ‘lines’ from the fields of one data-set to
fields in another
Data-driven mapping
• Evaluation of actual data values from two data sources, by the use of heuristics and
statistics. This helps to automatically discover complex mappings
Semantic mapping
• A metadata registry is referred to look up synonyms for data elements
• If the destination column does not match the source column, the mappings are made
as per the data elements synonyms in the metadata registry
• Only discovers exact matches between data fields. Will not discover transformation
logic or exceptions between columns/fields
Data Staging
A data staging area is defined as an intermediate storage area,
that exists between the sources of data and the data
warehouse (DW) or Data mart (DM)
• A staging area can be used for any purpose out of the following:
Gather data from different sources at different times.
Data cleansing
Pre-calculate aggregates.
Data Extraction
• Extraction is the operation of extracting data from a data
source system. The extracted data is further used in a data
warehouse environment. It is the first step in the ETL
process.
Incremental extraction
Offline extraction
Data Transformation
MULTISTAGE
TRANSFORMATION
Data Transformation
VALIDATE
EXTERNAL TABLE
CUSTOMER
KEYS
CONVERT SOURCE
INSERT INTO TARGET TABLE
KEY TO WAREHOUSE
WAREHOUSE
KEYS
TABLE
PIPELINED
TRANSFORMATION
Data Loading
• During the load phase, the data is ‘loaded’ into the end
target. The target usually is a data warehouse (DW). The
process varies widely, depending on organizational
requirements.
Unified
Reduces cost, overlaps SQL view of
and redundancies; data
reduces exposure to risks
Oracle
HOW
• Source/target maps, Transformation rules , data
Process cleansing rules, extract audit trail, transform
audit trail, load audit trail, data quality audit, etc.
TYPE
• Data locations, Data formats, Technical names,
Technical Data sizes, Data types, indexing, data structures,
etc.
WHO; WHEN
• Data access history: Who is accessing?
Application Frequency of access? When accessed? How
accessed? ….etc.
Data Quality and Data
Profiling
Building Blocks of Data Quality Management
• Analyze, Improve and Control
• This methodology is used to
encompass people, processes and
technology.
• This is achieved through five
methodological building blocks
namely:
Profiling
Quality
Integration
Enrichment
Monitoring
Data Profiling
• Beginning the data improvement efforts by knowing where to
begin.
• Data profiling (also known as data discovery or data quality
analysis) helps to gain a clear perspective on the current integrity
of data. It helps:
Discover the quality, characteristics and potential problems
Reduce the time and resources in finding problematic data
Gain More control on the maintenance and management of data
Catalog and analyze metadata
• The various steps in profiling include
• Metadata analysis
• Outline detection
• Data validation
• Pattern analysis
• Relationship discovery
• Statistical analysis
• Business rule validation
Data Profiling (cont…)
• Metadata profiling
Typical type of metadata profiling are
• Domain: Conformation of data in column to the defined value or range
• Type: alphabetic or numeric.
• Pattern: the proper pattern
• Frequency counts
• Interdependencies:
• Within a table:
• Between tables:
• Types of Metadata