Module 4

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

Module 4

Module four extends your background about data warehouse development from
an emphasis on schema design in module three to data integration
concepts, processes and techniques in module four.

Integration= initial DW load + refresh

L1 : Lesson one emphasizes the characteristics of data integration processes,


particularly refresh processing and the initial population of a data warehouse.

Even with better tools, the initial population process remains the most difficult
part of many data warehouse projects.

1. The primary goal of data integration is to provide a single source of truth


for decision-making.

Integrating data sources involves challenges of large volumes of data, widely


varying formats, and units of measure, different update frequencies, missing
data, and lack of common identifiers.

Data integration is a critical success factor for data warehouse projects. Many
projects have failed due to unexpected difficulties in populating and maintaining
the data warehouse.

Organizations must make substantial investments in effort, hardware,


and software to overcome the challenges of data integration.

2. Refresh processing is: Updating data warehouse objects with new data. (3
phases)

Refresh processing involves internal and external data sources.


Internal data sources generate changes in both fact and dimension tables. For
example, refresh processing should update a customer dimension row after a
customer address change and insert new rows after customers are added to
internal data sources.

1
External data sources primarily involve dimension changes for entities tracked
by outside organizations.

3. Management of time differences between the updating of data sources


and related warehouse objects is imperative in the refresh processing.

-Valid time lag is the difference the occurrence of the event in the real
word, that's valid time In the storage of the event in an operational database
known as transaction time.

-Load time lag is the difference between transaction time and the storage of the
event in the data warehouse, known as load time.

4.Refresh worflow : this diagram depicts common phases of refresh processing


and tasks in each phase.
This diagram is generic, so it must be customized for each refresh process.
--The Preparation Phase manipulates change data from individual source
systems.
- Extraction retrieves data from individual data sources.
- Transportation moves extracted data to a staging area.
- Cleaning involves a variety of tasks to standardize and improve the
quality of the extracted data.
- Auditing records results of the cleaning process, performing
completeness and
reasonableness checks and handling exceptions.( L'audit enregistre les
résultats du processus de nettoyage, effectue des contrôles d'exhaustivité
et de vraisemblance et gère les exceptions.)
-- The integration phase merges separate clean sources into one source.
- Merging can involve removal of inconsistencies among the source data
- Auditing records results of the merging process, performing
completeness and
reasonableness checks, and handling exceptions.
-- The update phase involves propagating the integrated change data to various
parts of the data warehouse.

2
- After propagation, notification can be sent to user groups and
administrators.

5. In addition to periodic refreshment, data integration involves initial


population of a data warehouse. The initial loading process is more open-ended
(it’s easier)than refresh processing.
Main problem : Time requirements for discovering and resolving data quality
problems can be difficult to estimate.

Data quality problems are usually resolved through data integration procedures.
If owners of source data cooperate, resolution can involve changes to source
systems.

6. The primary objective, in managing the refresh process, is to determine the


refresh frequency for each data source, and set detailed refresh schedules.
The optimal refresh frequency maximizes the net refresh benefit, while
satisfying important constraints.
The net refresh benefit is the value of data timeliness minus the cost of refresh.
(Achieving the optimal refresh frequency in a data warehouse integration
process involves balancing the benefits of having timely data against the costs
associated with refreshing that data. The key is to find the frequency that
maximizes the value of having up-to-date information while still being efficient
and cost-effective. This might involve considerations such as the frequency of
data changes, the importance of real-time data for decision-making, and the
resources required for frequent refreshes.)

--The value of data timeliness depends on the sensitivity of decision making to


the currency of the data. (Data timeliness measures the gap in time between
when the data is collected or generated and when it is made available for use or
analysis.)
- Some decisions are very time sensitive, such as inventory decisions for
the product mix in stores.
- Other decisions are not so time sensitive, such as store location decisions

--The cost to refresh a data warehouse includes both computer and human
resources.
3
-Computer resources are necessary for all tasks in the maintenance
workflow.
- Human resources maybe necessary the auditing tasks during the
preparation in integration phases.
7. The data warehouse administrator must satisfy constraints on the
refresh process, constrains on either the data warehouse, or source system may
restrict frequent refresh.
- Source access constraints can be due to legacy technology with restricted
scalability for internal data sources, or coordination problems for external data
sources.
- Integration constraints often involve identification of common entities such as
customers and transactions across source systems
- Consistency constraints involve usage in the same time period and change
data.
- Completeness constraints involve inclusion of change data from each data
source.
- Data warehouse availability often involves conflicts between online
availability and warehouse loading.

Conclusion : Other lessons in module four will cover concepts about data
sources, data cleaning and entity matching, as well as specific techniques
for pattern matching and distant measures for text.

4
L2 : Lesson two emphasizes change data used in data integration processes to
populate and refresh your data warehouse.

Change data :

- The most common change data involve insertions of new facts.

- Insertions of new dimensions and updates of dimensions are less


common, but still important to capture.

- Deletion of facts and dimensions only needed to correct data that should
not have been inserted into a data warehouse.

Challenges :

- Variety of formats and constraints in source systems.

-External source systems usually can not be changed.

-Internal source systems can be changed if resources are available


and performance is not impacted.

Change data classification:

Change data can be classified by source system requirements and processing


level.
It's shown in this two dimensional space.
- Source system requirements involve modifications to source systems
to acquire change data. Typical changes to source systems are new
columns, such as timestamps required for queryable change data, and
trigger code required for cooperative change data. Since source systems
are difficult to change, queryable and cooperative change data may not be
available.
--Cooperative change data involve notification from a source
system about changes.

5
The notification typically occurs at transaction time using a trigger.
A trigger involves software development and execution as part of a
source system. Cooperative change data can be input immediately
into a data warehouse. Or placed in a queue or staging area for later
processing, possibly with other changes.
Because cooperative change data require modifications to a source
system, it has traditionally been the least common format for
change data.

--Queryable change data require time stamping in a data source.


Since few data sources contain time stamps for all data, queryable
change data usually are augmented with other kinds of change data.
Queryable change data are most applicable for fact tables using
columns such as order date, shipment date and hire date, that are
stored at operational data sources.

- Processing level involves resource consumption and development


required for data integration procedures. Logs and snapshot change data
involve substantial processing.
--Logged change data involve files that record changes or other user
activity. For example, a transaction log contains every change made
by transaction, and a web log contains page access histories called
click streams by web site visitors.
Log to change data usually involve no changes to our source
system, as logs are readily available for most source systems.
--Snapshot change data involve periodic dumps of source data.
To derive change data, a difference operation uses the two most
recent snapshots. The result of a difference operation is called a
delta.
Generating a delta involves comparing source files to identify new
rows, changed rows, and deleted rows.
Snapshots are the only form of change data without requirements on
a source system.
Snapshots are mainly used for legacy systems in external data
sources.

6
Because retrieving a source file can be resource intensive, there
maybe constraints about the time and frequency of retrieving a
snapshot.

Data quality problems may occur in all types of change data, but are more
prevalent in legacy systems.
Data quality problems must be addressed in data integration procedures, unless
changes can be made in source systems.
Here are typical data quality problems encountered in change data.
-Multiple identifiers. Some data sources use different primary keys for the same
entity, such as different customer numbers.
-Different units. Different units of measure and granularities for measures may
be used in data sources.
-Missing values. Data may not exist in some data sources and default values
may vary across data sources.
-Non-standard text data. Data sources can combine multiple data into a single
text column, such as addresses containing multiple components. In addition, the
format of address components can vary across data sources.
-Conflicting data. Some data sources may have conflicting data, such as
different customer addresses.
-Different update times. Some data sources may perform updates at different
time intervals.

7
L3 : three types of data cleaning
1-Parsing decomposes complex objects, using text, into their constituent parts.
For data integration, parsing is important for decomposing multi-purpose text
data into individual fields.
For example, parsing of physical addresses, phone numbers and email addresses
are typical transformations from marketing data warehouses.
To facilitate target marketing analysis, these composite fields should be
decomposed into standard parts.-->The standard tool for context-free (the
meaning of a symbol does not depend on its relationship to other symbols and
text) parsing is the regular expression.
Data sources that contain addresses in a single field typically require
parsing into standard components, such as the street number, street name, city,
state, country and postal code.

2-Correcting values involves resolution of missing and conflicting values.


**For missing values, the resolution depends on the meaning of a missing value.
--Missing values inapplicable to an entity can often be resolved through default
values.
For example, missing values for an order without an employee can be replaced
with a default value indicating a web order.
--Missing values that are unknown rather than inapplicable are more difficult
to resolve.
For example, missing dates of birth, parts of an address and grade point averages
are more difficult to resolve.
--One approach for unknown values involves typical values.
For unknown numeric values, a median or average value can be used.
For unknown non-numeric values, the mode, that is the most frequent value, can
be used.

--A more complex approach for unknown values is to predict values using
relationships to other fields.
--More complex approaches will predict missing values using data mining
algorithms.

8
**For conflicting (contradictory) values, simple approaches such as the most
recent value may be used.
Determining a more credible value usually involves an investigation by a
domain expert.
Detailed investigations, possibly conducted using search services, can resolve
some cases of unknown values and conflicting values.

3-Standardization involves conversion rules to transform values into preferred


representations.
Conversion rules are usually developed for units of measure and abbreviations.
Both standard and custom rules can be developed.

Exemple :
Data standardization services can be purchased for names, addresses,
and product details, although, customization may be necessary.
This example extends the previous corrected example with standardization.
The job title, firm, street, and state are standardized using a dictionary of
standard names.
The dictionary contains the complete value for values that are typically
abbreviated.

9
L4 : Regular expressions specify patterns for parsing text fields with multiple
components, common in data integration tasks.
Regular expression tools are widely supported in data integration tools, DBMSs,
application programming interfaces in testing web sites.

1. A regular expression, or REGEX for short, contains literals, meta characters,


and escape characters.

- A literal is a character to match exactly.


- Meta characters, or pattern matching characters, give special meaning within a
search expression, providing the power of regular expressions.
- Escape sequences remove the special meaning of meta characters to treat meta
characters as literals.

To perform pattern matching, the user provides a regular expression known as
the search expression in a target string. The search expression specifies the
pattern to search in the target string.

This diagram displays prominent meta characters :


- The iteration or quantifier meta characters, the question mark, the asterisks, the
plus symbol, and the curly braces support matches on consecutive characters.
The search expression uses the plus symbol to match one or more of the
proceeding character.
-The position meta characters are anchors.
The period, the circumflex or caret, and a dollar sign support matching in
specified places in a string.
The search expression uses the caret to match the beginning, and the dollar sign
to match at the end of a target string.
- In the other category, the range meta characters inside square brackets match a
single character from a range of specified characters.

2. This table shows a convenient summary of the common meta characters.


You can study it in the slides as a reference.

10
3. To understand search expressions, you need to work many examples.
This table shows six examples with multiple target strings per example.

Here are some brief notes about these examples.


- In example one, the question mark matches the preceding character zero
times in the first target string.
- In example two, the asterisks matches the preceding character zero times
in a third target string.
- In the third example, the plus meta character does not match the
third target string because the third character is o, not e.
- In the fourth example, the search expression does not match the third
target string because it does not contain one of the letters inside the
square brackets.
- The last two examples is iteration meta characters to specify the number
of matches.
- In example five, the first range must be matched three times, and the
second range, four times.
- In the last example, the proceeding character a, must be matched between
two and three times.

4. This table shows search expressions using position, iteration and alteration
meta characters.
Here are some brief notes about these examples.
- In example one, the search expression does not match the first target
string because win does not appear in the beginning of the target string.
- In example two, the search expression does not match the second target
string because win does not appear at the end of the target string.
- In example three, the caret inside the square brackets negates the enclosed
character range matching any non-digit.
- In example four, the period, a positional meta character in the search
expression requires a character following abc, so the search expression
does not match the first target string.
- In example five, the alteration meta characters, that is vertical bars, match
all three target strings, as each one contains one of the choices dog, cat or
frog.

5. This example shows more complex search expressions.

11
The last three examples contain groups for matching parts for target string,
denoted by the parentheses.
Due to the complexity of these examples, I recommend that you use one of the
regular expression testing websites to try them.
After copying the search expression to the regular expression field, the tester
provides a detailed explanation on the right.

12
L5 :
The classic application involves identification of duplicate customers and data
sources from different firms.

Because a common identifier does not exist, duplicates must be identified from
other common fields such as names, address components, phone numbers and
ages.

Because these common fields come from different data sources, inconsistency
and non standard representations may exist, complicating the matching process.

The entity matching process has been studied as a data mining problem for
decades in computer science, information systems, and statistics.

A number of names have been used for the problem, such as record linkage,
entity identification, and entity resolution.

Many approaches have been developed, but no dominant approaches emerged.


In addition, commercial services with customization to individual data source
requirements can match entities, but usually with relatively high cost.
To improve entity matching results, the organization should consider
investments to improve consistency and completeness in underlying data
sources.

2.This simple example depicts difficulties of entity matching.

-The data sources do not have a common identifier to reliably match, so non-
unique fields must be used.
-The red text indicates conflict between the two cases.
-Data source one has the maiden, that is a pre marriage name, and work address.
-Data source two has the marital name and home address.
-The middle name, job title, and firm also have different values.

Experience indicates these records are likely a match.


Because of proximity of Bothell and Redmond in Washington state, the
matching first name with the same uncommon spelling, sharing part of the last
name and matching employment after standardizing the firm and job titles.

13
The last name difference can be explained by combining last names after
marriage.
An entity matching algorithm, without this domain expertise, may indicate an
inconclusive match, rather than a likely match.
A costly investigation by a domain expert may be necessary to resolve this
inconclusive match.

3.This example shows common fields for two data sources.


Matching is more complex if data sources contain unstructured data, such as
text, images, and events within common structure fields.
Despite the difficulty of entity matching, it is important in many applications.
Marketing is the most prominent area, as firms often are interested in expanding
their customer bases.
Merging of firms, typically triggers a major customer matching effort.
Law enforcement agencies need to link crimes and suspects, and combine
aliases into one suspect.
Fraud detection must resolve individuals who claim benefits under different
identifiers, when the individual is the same person.
For example, the same person may fraudulently file multiple tax returns to
receive tax credits.
Business analysts in healthcare often want to combine health records of
individuals treated by different healthcare providers.
There are many other applications of entity matching in business and
government.
To obtain a more precise understanding of entity matching, the outcomes of
comparing two cases should be understood.

4.In this matrix, the rows represent predictions, and the columns represent actual
results of matching two entities.
A true match involves a predicted match and an actual match, allowing the two
entities to be combined correctly.
A false match involves a predicted match, but an actual non match, resulting in
two entities combined that should have remained separate.
A false non match involves a prediction of non match, but an actual match,
resulting in two entities remaining separate that should be combined.
A true non match involves a prediction of non match, and actual non match,
resulting in two separate entities.

14
The possible non match situations involve predictions without enough certainty
to indicate a match or non match.
Investigation may be necessary to resolve inconclusive cases. Matched entities
can be merged or linked.
If merging two entities, sometimes old data from one source is discarded.
In addition, new fields can be added to obtain data unique from each data
source.
Linking maintains separate entities, but notes relationships.
For households, linking combines individuals with family and other social
relationships.
For transactions, linking associates transactions, such as different insurance
policies or crimes, with the same individual or set of individuals.

5.This example shows a possible result of merging records from the previous
matching example.
In the merge record, the work address has been deleted, and the marital last
name, Parker-Lewis has replaced the maiden name, Parker.
In addition, full values for the middle name, job title, and firm are used.
Household consolidation involves linking records from individuals living in the
same household.
This practice is sometimes known as householding.
In transaction linking, all accounts and transactions are associated to the same
person.
Often, different transaction details are stored in different operational databases
before a data warehouse is built.
An important benefit of a data integration effort is to link transactions to the
same individual across operational databases and external data sources.

15
L6 : covers quasi identifiers and distance functions for text comparisons and
entity matching.
(Les quasi-identifiants sont des attributs ou des combinaisons d'attributs qui, pris
individuellement, ne permettent pas d'identifier de manière unique un individu,
mais qui, lorsqu'ils sont combinés avec d'autres sources de données, pourraient
potentiellement conduire à l'identification des individus.)

D3
- A population can be identified by a combination of gender, birth date, and
postal code
- Other examples of quasi identifiers are name components, location
components, profession, and race.

D4
Entity matching approaches use distance functions to determine if
quasi identifiers in two entities indicate the same entity.

Distance function for text can be used to compare quasi identifiers with these
differences.

**Edit distance is a common function for comparing relatively short text values
occurring in entity matching applications.

D5
The basic idea is to count the number of single character adding operations to
transform a source text value into a target text value.

An operation can :
- delete a character,
- insert a character
- or substitute one character for another character.

Edit distance is defined as the minimal number of operations to transform a


source text value into a target text value.

Determining the minimum number of operations involves an optimization


algorithm that is beyond the details of this lesson.

Thus the focus in this lesson is counting the number of operations and examples.

16
D6 : example

In this example, the distance to transform Saturday into Sunday, is three


operations.

This example shows two sequences operations.


- The first sequence involves two deletions, a and t, are followed by
substitution of n for r.
- The second sequence involves two substitutions, u for a, followed by n
for r, and two deletions of u and r.
The first sequence is preferred because it contains fewer operations.

This example shows only two sequences of operations, so finding the minimal
number of operations is easy.
For more complex text values, a large number of sequences must be
evaluated, to find the minimal solution.

Q : What is the edit distance between “Break” and “Trick” 3

**Phonetic distance, has many applications in law enforcement, to account


for different name spellings, but similar pronunciations
Words of the same pronunciation, should have the same phonetic value.

17
L7: provides details about SQL statements for data integration, useful
statements that organizations still use for simple data integration tasks.
Module five presents tools with more robust features to satisfy requirements for
more complex data integration tasks.

You have three learning objectives in this lesson, ….

This lesson contains a script of example SQL statements, as well as an


assignment to facilitate proficiency with writing SQL statements for data
integration.

D3
The SQL MERGE statement supports processing of changed data for dimension
tables through conditional UPDATE or INSERT using a single SQL statement.
If a row in a change table matches a row in a target table, the matching target
row is updated. Otherwise, the change row is inserted into the target table.

D4
The MERGE statement uses a source table, target table, and joint condition to
support conditional updating and inserting.

This diagram indicates that the MERGE statement not just rows in the
source table to a target table.
-The dark blue row indicates a table heading with the same columns in the
source and target tables.
-Blue rows in the source table indicate new rows to insert into the target table.
-The red rows indicate existing rows to update and target table.
-Yellow rows in the target table indicate existing rows in the target table not
updated after the merge statement executes.

->The MERGE statement uses the merge into keywords, followed by the target
table name.
The source table name follows the using keyword.
The joint condition follows the on keyword.
The when clauses provide action for matching target row, typically an
UPDATE statement in a non matching row, typically an INSERT statement

18

You might also like