Professional Documents
Culture Documents
Module 4
Module 4
Module 4
Module four extends your background about data warehouse development from
an emphasis on schema design in module three to data integration
concepts, processes and techniques in module four.
Even with better tools, the initial population process remains the most difficult
part of many data warehouse projects.
Data integration is a critical success factor for data warehouse projects. Many
projects have failed due to unexpected difficulties in populating and maintaining
the data warehouse.
2. Refresh processing is: Updating data warehouse objects with new data. (3
phases)
1
External data sources primarily involve dimension changes for entities tracked
by outside organizations.
-Valid time lag is the difference the occurrence of the event in the real
word, that's valid time In the storage of the event in an operational database
known as transaction time.
-Load time lag is the difference between transaction time and the storage of the
event in the data warehouse, known as load time.
2
- After propagation, notification can be sent to user groups and
administrators.
Data quality problems are usually resolved through data integration procedures.
If owners of source data cooperate, resolution can involve changes to source
systems.
--The cost to refresh a data warehouse includes both computer and human
resources.
3
-Computer resources are necessary for all tasks in the maintenance
workflow.
- Human resources maybe necessary the auditing tasks during the
preparation in integration phases.
7. The data warehouse administrator must satisfy constraints on the
refresh process, constrains on either the data warehouse, or source system may
restrict frequent refresh.
- Source access constraints can be due to legacy technology with restricted
scalability for internal data sources, or coordination problems for external data
sources.
- Integration constraints often involve identification of common entities such as
customers and transactions across source systems
- Consistency constraints involve usage in the same time period and change
data.
- Completeness constraints involve inclusion of change data from each data
source.
- Data warehouse availability often involves conflicts between online
availability and warehouse loading.
Conclusion : Other lessons in module four will cover concepts about data
sources, data cleaning and entity matching, as well as specific techniques
for pattern matching and distant measures for text.
4
L2 : Lesson two emphasizes change data used in data integration processes to
populate and refresh your data warehouse.
Change data :
- Deletion of facts and dimensions only needed to correct data that should
not have been inserted into a data warehouse.
Challenges :
5
The notification typically occurs at transaction time using a trigger.
A trigger involves software development and execution as part of a
source system. Cooperative change data can be input immediately
into a data warehouse. Or placed in a queue or staging area for later
processing, possibly with other changes.
Because cooperative change data require modifications to a source
system, it has traditionally been the least common format for
change data.
6
Because retrieving a source file can be resource intensive, there
maybe constraints about the time and frequency of retrieving a
snapshot.
Data quality problems may occur in all types of change data, but are more
prevalent in legacy systems.
Data quality problems must be addressed in data integration procedures, unless
changes can be made in source systems.
Here are typical data quality problems encountered in change data.
-Multiple identifiers. Some data sources use different primary keys for the same
entity, such as different customer numbers.
-Different units. Different units of measure and granularities for measures may
be used in data sources.
-Missing values. Data may not exist in some data sources and default values
may vary across data sources.
-Non-standard text data. Data sources can combine multiple data into a single
text column, such as addresses containing multiple components. In addition, the
format of address components can vary across data sources.
-Conflicting data. Some data sources may have conflicting data, such as
different customer addresses.
-Different update times. Some data sources may perform updates at different
time intervals.
7
L3 : three types of data cleaning
1-Parsing decomposes complex objects, using text, into their constituent parts.
For data integration, parsing is important for decomposing multi-purpose text
data into individual fields.
For example, parsing of physical addresses, phone numbers and email addresses
are typical transformations from marketing data warehouses.
To facilitate target marketing analysis, these composite fields should be
decomposed into standard parts.-->The standard tool for context-free (the
meaning of a symbol does not depend on its relationship to other symbols and
text) parsing is the regular expression.
Data sources that contain addresses in a single field typically require
parsing into standard components, such as the street number, street name, city,
state, country and postal code.
--A more complex approach for unknown values is to predict values using
relationships to other fields.
--More complex approaches will predict missing values using data mining
algorithms.
8
**For conflicting (contradictory) values, simple approaches such as the most
recent value may be used.
Determining a more credible value usually involves an investigation by a
domain expert.
Detailed investigations, possibly conducted using search services, can resolve
some cases of unknown values and conflicting values.
Exemple :
Data standardization services can be purchased for names, addresses,
and product details, although, customization may be necessary.
This example extends the previous corrected example with standardization.
The job title, firm, street, and state are standardized using a dictionary of
standard names.
The dictionary contains the complete value for values that are typically
abbreviated.
9
L4 : Regular expressions specify patterns for parsing text fields with multiple
components, common in data integration tasks.
Regular expression tools are widely supported in data integration tools, DBMSs,
application programming interfaces in testing web sites.
To perform pattern matching, the user provides a regular expression known as
the search expression in a target string. The search expression specifies the
pattern to search in the target string.
10
3. To understand search expressions, you need to work many examples.
This table shows six examples with multiple target strings per example.
4. This table shows search expressions using position, iteration and alteration
meta characters.
Here are some brief notes about these examples.
- In example one, the search expression does not match the first target
string because win does not appear in the beginning of the target string.
- In example two, the search expression does not match the second target
string because win does not appear at the end of the target string.
- In example three, the caret inside the square brackets negates the enclosed
character range matching any non-digit.
- In example four, the period, a positional meta character in the search
expression requires a character following abc, so the search expression
does not match the first target string.
- In example five, the alteration meta characters, that is vertical bars, match
all three target strings, as each one contains one of the choices dog, cat or
frog.
11
The last three examples contain groups for matching parts for target string,
denoted by the parentheses.
Due to the complexity of these examples, I recommend that you use one of the
regular expression testing websites to try them.
After copying the search expression to the regular expression field, the tester
provides a detailed explanation on the right.
12
L5 :
The classic application involves identification of duplicate customers and data
sources from different firms.
Because a common identifier does not exist, duplicates must be identified from
other common fields such as names, address components, phone numbers and
ages.
Because these common fields come from different data sources, inconsistency
and non standard representations may exist, complicating the matching process.
The entity matching process has been studied as a data mining problem for
decades in computer science, information systems, and statistics.
A number of names have been used for the problem, such as record linkage,
entity identification, and entity resolution.
-The data sources do not have a common identifier to reliably match, so non-
unique fields must be used.
-The red text indicates conflict between the two cases.
-Data source one has the maiden, that is a pre marriage name, and work address.
-Data source two has the marital name and home address.
-The middle name, job title, and firm also have different values.
13
The last name difference can be explained by combining last names after
marriage.
An entity matching algorithm, without this domain expertise, may indicate an
inconclusive match, rather than a likely match.
A costly investigation by a domain expert may be necessary to resolve this
inconclusive match.
4.In this matrix, the rows represent predictions, and the columns represent actual
results of matching two entities.
A true match involves a predicted match and an actual match, allowing the two
entities to be combined correctly.
A false match involves a predicted match, but an actual non match, resulting in
two entities combined that should have remained separate.
A false non match involves a prediction of non match, but an actual match,
resulting in two entities remaining separate that should be combined.
A true non match involves a prediction of non match, and actual non match,
resulting in two separate entities.
14
The possible non match situations involve predictions without enough certainty
to indicate a match or non match.
Investigation may be necessary to resolve inconclusive cases. Matched entities
can be merged or linked.
If merging two entities, sometimes old data from one source is discarded.
In addition, new fields can be added to obtain data unique from each data
source.
Linking maintains separate entities, but notes relationships.
For households, linking combines individuals with family and other social
relationships.
For transactions, linking associates transactions, such as different insurance
policies or crimes, with the same individual or set of individuals.
5.This example shows a possible result of merging records from the previous
matching example.
In the merge record, the work address has been deleted, and the marital last
name, Parker-Lewis has replaced the maiden name, Parker.
In addition, full values for the middle name, job title, and firm are used.
Household consolidation involves linking records from individuals living in the
same household.
This practice is sometimes known as householding.
In transaction linking, all accounts and transactions are associated to the same
person.
Often, different transaction details are stored in different operational databases
before a data warehouse is built.
An important benefit of a data integration effort is to link transactions to the
same individual across operational databases and external data sources.
15
L6 : covers quasi identifiers and distance functions for text comparisons and
entity matching.
(Les quasi-identifiants sont des attributs ou des combinaisons d'attributs qui, pris
individuellement, ne permettent pas d'identifier de manière unique un individu,
mais qui, lorsqu'ils sont combinés avec d'autres sources de données, pourraient
potentiellement conduire à l'identification des individus.)
D3
- A population can be identified by a combination of gender, birth date, and
postal code
- Other examples of quasi identifiers are name components, location
components, profession, and race.
D4
Entity matching approaches use distance functions to determine if
quasi identifiers in two entities indicate the same entity.
Distance function for text can be used to compare quasi identifiers with these
differences.
**Edit distance is a common function for comparing relatively short text values
occurring in entity matching applications.
D5
The basic idea is to count the number of single character adding operations to
transform a source text value into a target text value.
An operation can :
- delete a character,
- insert a character
- or substitute one character for another character.
Thus the focus in this lesson is counting the number of operations and examples.
16
D6 : example
This example shows only two sequences of operations, so finding the minimal
number of operations is easy.
For more complex text values, a large number of sequences must be
evaluated, to find the minimal solution.
17
L7: provides details about SQL statements for data integration, useful
statements that organizations still use for simple data integration tasks.
Module five presents tools with more robust features to satisfy requirements for
more complex data integration tasks.
D3
The SQL MERGE statement supports processing of changed data for dimension
tables through conditional UPDATE or INSERT using a single SQL statement.
If a row in a change table matches a row in a target table, the matching target
row is updated. Otherwise, the change row is inserted into the target table.
D4
The MERGE statement uses a source table, target table, and joint condition to
support conditional updating and inserting.
This diagram indicates that the MERGE statement not just rows in the
source table to a target table.
-The dark blue row indicates a table heading with the same columns in the
source and target tables.
-Blue rows in the source table indicate new rows to insert into the target table.
-The red rows indicate existing rows to update and target table.
-Yellow rows in the target table indicate existing rows in the target table not
updated after the merge statement executes.
->The MERGE statement uses the merge into keywords, followed by the target
table name.
The source table name follows the using keyword.
The joint condition follows the on keyword.
The when clauses provide action for matching target row, typically an
UPDATE statement in a non matching row, typically an INSERT statement
18