Professional Documents
Culture Documents
DQ Standardization
DQ Standardization
DQ Standardization
It may be that standardization is the desired end result of using Quality Stage. For
example street address components such as “Street” or “Avenue” or “Road” are often
represented differently in data, perhaps differently abbreviated in different records.
Standardization can convert all the non-standard forms into whatever standard format
the organization has decided that it will use.
This kind of Quality Stage job can be set up as a web service. For example, a data entry
application might send in an address to be standardized. The web service would return
the standardized address to the caller.
More commonly standardization is a preliminary step towards performing matching.
More accurate matching can be performed if standard forms of words/tokens are
compared than if the original forms of these data are compared.
Country
CountryIdentifier
Identifier
COUNTRY
COUNTRY
Domain
DomainPre-
Pre-
processor
processor
USPREP
USPREP
Domain
Domain Domain
Domain Domain
Domain
Specific:
Specific: Specific:
Specific: Specific:
Specific:
USNAME
USNAME USADDR
USADDR USAREA
USAREA
Output
OutputRecords
Records
USY100
USY100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,
MA
MA02111
02111
CAYSITE
CAYSITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLE
AB
ABT0L
T0L1K0
1K0
GBY28
GBY28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
USN123
USN123MAIN
MAINSTREET
STREET
Output
OutputRecord
Record
House
House Number211
Number211
Street
Street NameWASHINGTON
NameWASHINGTON
Street
Street Suffix
Suffix TypeDR
TypeDR
Box
Box TypePO
TypePO BOX
BOX
Box
Box Value52
Value52
Call subroutines for each sub-domain (i.e. country name, post code, province, city)
Rule Sets are standardization processes used by the Standardize Stage and have
three required components:
User Overrides
Reference Tables
Standardization Example
Standardize Stage
The standardization process begins by parsing the input data into individual data
elements called tokens
Any character that is in the SEPLIST and not in the STRIPLIST, will be used to
separate tokens and will also become a token itself
The best example of this is the space character - one or more spaces are
stripped but the space indicates where one token ends and another begins
The parser behaves differently if the locale setting is Chinese, Japanese, or Korean
Spaces are not used to divide tokens so each character, including a space, is
considered a token
Classification assigns a one character tag (called a class) to each and every individual
parsed token to provide context
First, key words that can provide special context are classified
Since these classes are context specific, they vary across rule sets
These default classes are always the same regardless of the rule set used
Classification – order
First, key words that can provide special context are classified
Provided by the standardization rule set classification table
Since these classes are context specific, they vary across rule sets
Next, default classes are assigned to the remaining tokens
These default classes are always the same regardless of the rule
set used
Lexical patterns are assembled from the classification results
Concatenated string of the classes assigned to the parsed tokens
Classification Example
Default Classes
Clas Description
s
However, if a special character is included in the SEPLIST but omitted from the
STRIPLIST, then the default class for that special character becomes the special
character itself and in this case, the default class does describe an actual special
character value
It is important to note this can also happen to the “reserved” default classes
(for example: ^ = ^ if ^ is in the SEPLIST but omitted from the STRIPLIST)
Also, if a special character is omitted from both the SEPLIST and STRIPLIST (and it is
surrounded by spaces in the input data), then the “special” default class of ~ (tilde)
is assigned
If not surrounded by spaces, then the appropriate mixed token default class
would be assigned (for example: P.O. = @ if . is omitted from both lists)
Essentially, the NULL class does to complete tokens what the STRIPLIST does to
individual characters
Therefore, you will never see the NULL class represented in the assembled lexical
patterns
Classification Table
Classification Tables contain three required space delimited columns:
;---------------------------------------------------------------------
--------
; USADDR Classification Table
;---------------------------------------------------------------------
--------
; Classification Legend
;---------------------------------------------------------------------
--------
; B - Box Types
; D - Directionals
; F - Floor Types
; T - Street Types
; U - Unit Types
;---------------------------------------------------------------------
--------
PO "PO BOX" B
BOX "PO BOX" B
POBOX "PO BOX" B
Tokens in the Classification Table
A common misconception by new users is assuming that every input alpha token
should be classified by the classification table
Classification table is intended for key words that provide special context, which
means context essential to the proper processing of the data
Tokens with both a high individual frequency and a low set cardinality
The order that the columns are listed in the dictionary file defines the order the
columns appear in the standardization rule set output
Dictionary file entries are used to automatically generate the column metadata
available for mapping on the Standardize Stage output link
1. Business Intelligence
2. Matching
3. Reporting
Unhandled Data – the tokens left unhandled (i.e. unprocessed) by the rule set
Input Pattern – the lexical pattern representing the parsed and classified input
tokens
Exception Data – place holder column for storing invalid input data (alternative to
deletion)
User Override Flag – indicates whether or not a user override was applied (default =
NO)
One line containing a pattern, which is tested against the current data
One or more lines of actions, which are executed if the pattern tested true
1 2 3 6
Main
Pattern-Action Sets
Sequentially processed until start of the subroutines or an EXIT
command is encountered
Subroutines
Pattern-Action Sets
Each subroutine starts with a header line (\SUB) and ends
with a trailer line (\END_SUB)
Subroutines can be called by MAIN or by other subroutines
When called, sequentially processed until RETURN command
or \END_SUB is encountered
Standardization vs. Validation
In QualityStage, standardization and validation describe different, although related,
types of processing
House Number = 50
State Abbreviation = MA
Unhandled Data
Unhandled Pattern
Unhandled data may represent the entire input or a subset of the input
If there is no unhandled data, it does not necessarily mean the data is processed
correctly
Some unhandled data does not need to be processed, if it doesn’t belong to that
domain
User Overrides
User overrides provide the user with the ability to make modifications without
directly editing the classification table or the pattern-action file
The following pattern/text override objects are called based on logic in the
pattern-action file
input pattern
input text
unhandled pattern
unhandled text
Classification Override
Input Text Override
Input Pattern Override
There are two subroutines in each delivered rule set that are specifically for users to
add pattern action language
Input Modifications
This subroutine is called after the Input User Overrides are applied
but before any of the rule set pattern actions are checked
Unhandled Modifications
This subroutine is called after all the pattern actions are checked and
the Unhandled User Overrides are applied