DQ Standardization

Standardization
Standardization, as the name suggests, is the process of generating standard forms of

data that might more reliably be matched. For example, by generating the standard
form “William” from “Bill”, then there is an increased likelihood of finding the match
between “William Gates” and “Bill Gates”. Other standard forms that can be generated
include phonetic equivalents (using NYSIIS and/or Soundex), and something like
“initials” – maybe the first two characters from each of five fields.
Each standardization specifies a particular rule set. As well as word/token classification
tables, a rule set includes specification of the format of an output record structure, into
which original and standardized forms of the data, generated fields (such as gender) and
reporting fields (for example whether a user override was used and, if so, what kind of
override) may be written.
It may be that standardization is the desired end result of using Quality Stage. For
example street address components such as “Street” or “Avenue” or “Road” are often
represented differently in data, perhaps differently abbreviated in different records.
Standardization can convert all the non-standard forms into whatever standard format
the organization has decided that it will use.
This kind of Quality Stage job can be set up as a web service. For example, a data entry
application might send in an address to be standardized. The web service would return
the standardized address to the caller.
More commonly standardization is a preliminary step towards performing matching.
More accurate matching can be performed if standard forms of words/tokens are
compared than if the original forms of these data are compared.
Standardization Process Flow

Delivered Rule Sets Methodology in Standardization
Country
CountryIdentifier
Identifier
COUNTRY
COUNTRY
Domain
DomainPre-
Pre-
processor
processor
USPREP
USPREP
Domain
Domain Domain
Domain Domain
Domain
Specific:
Specific: Specific:
Specific: Specific:
Specific:
USNAME
USNAME USADDR
USADDR USAREA
USAREA
Example: Country Identifier Rule Set

Input
InputRecords
Records
100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA
02111
02111
SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
AB
T0L
T0L1K0
1K0
28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
123
123MAIN
MAINSTREET
STREET
Output
OutputRecords
Records
USY100
USY100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,
MA
MA02111
02111
CAYSITE
CAYSITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLE
AB
ABT0L
T0L1K0
1K0
GBY28
GBY28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
USN123
USN123MAIN
MAINSTREET
STREET
Example: Domain Pre-processor Rule Set

Input
InputRecord
Record
Address
AddressLine
Line1TINA
1TINAFISHER
FISHER
Address
AddressLine
Line2ATTN
2ATTNIBM
IBM
Address
AddressLine
Line3211
3211WASHINGTON
WASHINGTONDR
DR
Address
AddressLine
Line4PO
4POBOX
BOX5252
Address
AddressLine
Line5WESTBORO,
5WESTBORO,MAMA
Address
AddressLine
Line602140
602140
Output
OutputRecord
Record
Name
Name DomainTINA
DomainTINA FISHER
FISHER ATTN
ATTN IBM
IBM
Address
Address Domain211
Domain211 WASHINGTON
WASHINGTON DR DR PO
PO
BOX
BOX 52
52
Area
Area DomainWESTBORO
DomainWESTBORO ,, MA
MA 02140
02140
Example: Domain Specific Rule Set

Input
InputRecord
Record
211
211 WASHINGTON
WASHINGTON DR
DR PO
PO BOX
BOX 52
52
Output
OutputRecord
Record
House
House Number211
Number211
Street
Street NameWASHINGTON
NameWASHINGTON
Street
Street Suffix
Suffix TypeDR
TypeDR
Box
Box TypePO
TypePO BOX
BOX
Box
Box Value52
Value52
Logic for NAME Rule Set
 Set variables for process option delimiters
 Process the most common patterns first
 Simplify the patterns
 Check for common patterns again
 Check for multiple names
 Process organization names
 Process individual names
 Default processing (based on process options)
 Post process subroutine to populate matching fields

Logic of ADDR Rule Sets
 Process the most common patterns first
 Simplify the patterns
 Check for common patterns again
 Call subroutines for each secondary address element
 Check for street address patterns
Logic of AREA Rule Sets
 Process input from right to left
 Call subroutines for each sub-domain (i.e. country name, post code, province, city)

Rule Sets
 Rule Sets are standardization processes used by the Standardize Stage and have
three required components:
1. Classification Table – Contains the key words that provide special

context, their standard value, and their user-defined class
2. Dictionary File – Defines the output columns that will be created by

the standardization process
3. Pattern-Action File – Drives the logic of the standardization process

and decides how to populate the output columns
 Optional rule set components:
 User Overrides
Reference Tables
Standardization Example
Standardize Stage
Parsing (the Standardization Adventure Begins…)
 The standardization process begins by parsing the input data into individual data
elements called tokens
 Parsing parameters are provided by the pattern-action file
 Parsing parameters are two lists of individual characters:
 SEPLIST - Any character in this list will be used to separate tokens
 STRIPLIST - Any character in this list will be removed
 The SEPLIST is always applied first
 Any character that is in the SEPLIST and not in the STRIPLIST, will be used to
separate tokens and will also become a token itself
 The space character should be included in both lists

 Any character that is in both lists will be used to separate tokens but will not
become a token itself
 The best example of this is the space character - one or more spaces are
stripped but the space indicates where one token ends and another begins
Parsing (Chinese, Japanese, Korean)
 The parser behaves differently if the locale setting is Chinese, Japanese, or Korean
 Spaces are not used to divide tokens so each character, including a space, is
considered a token
 Spaces are classified by underscores (_) in the pattern
 The Classification file allows multiple characters to be classified together
 Latin characters are transformed to double byte representations

Classification
 Parsing separated the input data into individual tokens
 Each token is basically either an alphabetic word, a number, a special character, or

some mixture
 Classification assigns a one character tag (called a class) to each and every individual
parsed token to provide context
 First, key words that can provide special context are classified
 Provided by the standardization rule set classification table
 Since these classes are context specific, they vary across rule sets
 Next, default classes are assigned to the remaining tokens
 These default classes are always the same regardless of the rule set used
 Lexical patterns are assembled from the classification results
 Concatenated string of the classes assigned to the parsed tokens
Classification – order
 First, key words that can provide special context are classified
 Provided by the standardization rule set classification table
 Since these classes are context specific, they vary across rule sets
 Next, default classes are assigned to the remaining tokens
 These default classes are always the same regardless of the rule
set used
 Lexical patterns are assembled from the classification results
 Concatenated string of the classes assigned to the parsed tokens
Classification Example
Classify key words that can provide special context

Provided by the standardization rule set classification table
T = Street Types (i.e. Street, Road, Avenue)
D = Directionals (i.e. North, South, East, West)
Apply defaults to tokens not found in the classification table
System defaults that are always the same regardless
of the rule set used
^ = A single numeric token
+ = A single unclassified alpha token
Classify key words

Apply default ^ + T D D
classes
Parsed 123 MAIN ST
tokens N W
Default Classes
Clas Description
s
^ A single numeric token
+ A single unclassified alpha token
? One or more consecutive unclassified alpha tokens

> Leading numeric mixed token (i.e. 2B, 88WR)
< Trailing numeric mixed token (i.e. B2, WR88)
@ Complex mixed token (i.e. NOT2B, C3PO, R2D2)
Default Classes (Special Characters)

 Some special characters are “reserved” for use as default classes that describe token
values that are not actual special character values
 For example: ^ + ? > < @ (as described on the previous slide)
 However, if a special character is included in the SEPLIST but omitted from the
STRIPLIST, then the default class for that special character becomes the special
character itself and in this case, the default class does describe an actual special
character value
 For example: Periods (.), Commas (,), Hyphens (-)
 It is important to note this can also happen to the “reserved” default classes
(for example: ^ = ^ if ^ is in the SEPLIST but omitted from the STRIPLIST)
 Also, if a special character is omitted from both the SEPLIST and STRIPLIST (and it is
surrounded by spaces in the input data), then the “special” default class of ~ (tilde)
is assigned
 If not surrounded by spaces, then the appropriate mixed token default class
would be assigned (for example: P.O. = @ if . is omitted from both lists)
Default Class (NULL Class)

 Has nothing to do with NULL values
 The NULL class is a special class
 Represented by a numeric zero (0)

 Only time that a number is used as a class
 Tokens classified as NULL are unconditionally removed
 Essentially, the NULL class does to complete tokens what the STRIPLIST does to
individual characters
 Therefore, you will never see the NULL class represented in the assembled lexical
patterns
Classification Table
Classification Tables contain three required space delimited columns:
1. Key word that can provide special context
2. Standard value for the key word
 Standard value can be either an abbreviation or an expansion
 The pattern-action file will determine if the standard value is used
3. Data class (one character tag) assigned to each key word
Classification Table Example
;---------------------------------------------------------------------
--------
; USADDR Classification Table
;---------------------------------------------------------------------
--------
; Classification Legend
;---------------------------------------------------------------------
--------
; B - Box Types
; D - Directionals
; F - Floor Types
; T - Street Types
; U - Unit Types
;---------------------------------------------------------------------
--------
PO "PO BOX" B
BOX "PO BOX" B
POBOX "PO BOX" B
Tokens in the Classification Table
 A common misconception by new users is assuming that every input alpha token
should be classified by the classification table
 Unclassified != Unhandled (i.e. unclassified tokens can still be processed)
 Classification table is intended for key words that provide special context, which
means context essential to the proper processing of the data
 General requirements for tokens in the classification table:
 Tokens with standard values that need to be applied (within proper

context)
 Tokens that require standard values, especially standard

abbreviations, will often map directly into their own dictionary
columns
 Does not mean that every dictionary column requires a user

defined class
 Tokens with both a high individual frequency and a low set cardinality
 Low set cardinality means that the token belongs to a group of

related tokens that have a relatively small number of possible
values and therefore the complete token group can be easily
maintained in the classification table
 If high set cardinality, adjacent tokens can often provide necessary

context.
Classify Street Name:
MAIN(MAIN)
Is this necessary? ^ S T D D
Classify Street Type:
STREET(ST)
^ + T D D
Classify Directionals:
NORTH(N), WEST(W)
^ + + D D
^ + + + +
Without Classification:
All default classes
Parsed tokens 123 MAIN STREET NORTH

WEST
What is Dictionary File ?

 Defines the output columns created by the standardization rule set
 When data is moved to these output columns, it is called “bucketing”
 The order that the columns are listed in the dictionary file defines the order the
columns appear in the standardization rule set output
 Dictionary file entries are used to automatically generate the column metadata
available for mapping on the Standardize Stage output link
Dictionary File Example

;---------------------------------------------------------------------
---------
; USADDR Dictionary File
;---------------------------------------------------------------------
---------
; Business Intelligence Fields
;---------------------------------------------------------------------
---------
HouseNumber C 10 S HouseNumber
StreetName C 25 S StreetName
StreetSuffixType C 5 S StreetSuffixType
Dictionary File Fields (Output Columns)

 Standardization can prepare data for all of its uses and therefore most dictionary
files contain three types of output columns:
1. Business Intelligence
 Usually comprised of the parsed and standardized input tokens
2. Matching
 Columns specifically intended to facilitate more effective

matching
 Commonly includes phonetic coding fields (NYSIIS and

SOUNDEX)
3. Reporting
 Columns specifically intended to assist with the evaluation of

the standardization results
Standard Reporting Fields in the Dictionary File

 Unhandled Pattern – the lexical pattern representing the unhandled data
 Unhandled Data – the tokens left unhandled (i.e. unprocessed) by the rule set
 Input Pattern – the lexical pattern representing the parsed and classified input
tokens
 Exception Data – place holder column for storing invalid input data (alternative to
deletion)
 User Override Flag – indicates whether or not a user override was applied (default =
NO)
What is Pattern-Action File ?

 Drives the logic of the standardization process
 Configures the parsing parameters (SEPLIST/STRIPLIST)
 Configures the phonetic coding (NYSIIS and SOUNDEX)
 Populates the standardization output structures
 Written in Pattern-Action Language, which consists of a series of patterns and

associated actions structured into logical processing units called Pattern-Action Sets
 Each Pattern-Action Set consists of:
 One line containing a pattern, which is tested against the current data
 One or more lines of actions, which are executed if the pattern tested true
 Pattern-Action Set Example

Pattern-Action Set
^ | + | T | D | D | $ ; number, alpha, street type, directional, directional,
end-of-data
COPY [1] {HouseNumber}
COPY [2] {StreetName}
temp =
COPY_A [3] {StreetSuffixType} “N”
COPY_A [4] temp 4
CONCAT_A [5] temp temp =
“NW”
COPY temp {StreetSuffixDirectional}
5
1 2 3 6
Pattern-Action File Structure

Configuration Options
Parsing Parameters (SEPLIST/STRIPLIST)
Phonetic Coding (NYSIIS and SOUNDEX)
Main
Pattern-Action Sets
Sequentially processed until start of the subroutines or an EXIT
command is encountered
Subroutines
Pattern-Action Sets
Each subroutine starts with a header line (\SUB) and ends
with a trailer line (\END_SUB)
Subroutines can be called by MAIN or by other subroutines
When called, sequentially processed until RETURN command
or \END_SUB is encountered
Standardization vs. Validation
 In QualityStage, standardization and validation describe different, although related,
types of processing
 Validation extends the functionality of standardization
 For example: 50 Washington Street, Westboro, Mass. 01581
 Standardization can parse, identify, and re-structure the data as follows:
 House Number = 50
 Street Name = WASHINGTON
 Street Suffix Type = ST
 City Name = WESTBORO
 State Abbreviation = MA
 Zip Code = 01581

 Validation can verify that the data describes an actual address and can also:
 Correct City Name = WESTBOROUGH
 Append Zip + 4 Code = 1013
 Validation provides this functionality by matching against a database
How to Deal with Un Handled data ?

 There are two reporting fields in all delivered rule sets:
 Unhandled Data
 Unhandled Pattern
 To identify and review unhandled data:
 Investigate stage on the Unhandled Data and Unhandled Pattern columns
 SQA stage on the output of the Standardize stage
 Unhandled data may represent the entire input or a subset of the input
 If there is no unhandled data, it does not necessarily mean the data is processed
correctly
 Some unhandled data does not need to be processed, if it doesn’t belong to that
domain
 Processing of a rule set may be modified through overrides or pattern action

language
User Overrides
 Most standardization rule sets are enabled with user overrides
 User overrides provide the user with the ability to make modifications without
directly editing the classification table or the pattern-action file
 User Overrides are:
 Entered via simple GUI screens
 Stored in specific object within the rule set
 Classification overrides can be used to add classifications for tokens not in

the classification table or to replace existing classifications already in the
classification table
 The following pattern/text override objects are called based on logic in the
pattern-action file
 input pattern
 input text
 unhandled pattern
 unhandled text
Domain Specific Override Example
Classification Override
Input Text Override
Input Pattern Override
User Modification Subroutines
 There are two subroutines in each delivered rule set that are specifically for users to
add pattern action language
 User modifications within the pattern action file:
 Input Modifications
 This subroutine is called after the Input User Overrides are applied
but before any of the rule set pattern actions are checked
 Unhandled Modifications
 This subroutine is called after all the pattern actions are checked and
the Unhandled User Overrides are applied

DQ Standardization

Uploaded by

Copyright:

Available Formats

You might also like

DQ Standardization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DQ Standardization

Uploaded by

Copyright:

Available Formats

Standardization

Standardization, as the name suggests, is the process of generating standard forms of

Standardization Process Flow

Example: Country Identifier Rule Set

Example: Domain Pre-processor Rule Set

Example: Domain Specific Rule Set

Logic for NAME Rule Set

 Set variables for process option delimiters

 Process the most common patterns first

 Simplify the patterns

 Check for common patterns again

 Check for multiple names

 Process organization names

 Process individual names

 Default processing (based on process options)

 Post process subroutine to populate matching fields

 Process the most common patterns first

 Simplify the patterns

 Check for common patterns again

 Call subroutines for each secondary address element

 Check for street address patterns

 Post process subroutine to populate matching fields

Logic of AREA Rule Sets

 Process input from right to left

 Post process subroutine to populate matching fields

1. Classification Table – Contains the key words that provide special

2. Dictionary File – Defines the output columns that will be created by

3. Pattern-Action File – Drives the logic of the standardization process

 Optional rule set components:

Parsing (the Standardization Adventure Begins…)

 Parsing parameters are provided by the pattern-action file

 Parsing parameters are two lists of individual characters:

 SEPLIST - Any character in this list will be used to separate tokens

 STRIPLIST - Any character in this list will be removed

 The SEPLIST is always applied first

 The space character should be included in both lists

Parsing (Chinese, Japanese, Korean)

 Spaces are classified by underscores (_) in the pattern

 The Classification file allows multiple characters to be classified together

 Latin characters are transformed to double byte representations

 Each token is basically either an alphabetic word, a number, a special character, or

 Provided by the standardization rule set classification table

 Next, default classes are assigned to the remaining tokens

 Lexical patterns are assembled from the classification results

 Concatenated string of the classes assigned to the parsed tokens

Classify key words that can provide special context

Classify key words

^ A single numeric token

+ A single unclassified alpha token

? One or more consecutive unclassified alpha tokens

< Trailing numeric mixed token (i.e. B2, WR88)

@ Complex mixed token (i.e. NOT2B, C3PO, R2D2)

Default Classes (Special Characters)

 For example: ^ + ? > < @ (as described on the previous slide)

 For example: Periods (.), Commas (,), Hyphens (-)

Default Class (NULL Class)

 The NULL class is a special class