DQ Standardization

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Standardization

Standardization, as the name suggests, is the process of generating standard forms of


data that might more reliably be matched. For example, by generating the standard
form “William” from “Bill”, then there is an increased likelihood of finding the match
between “William Gates” and “Bill Gates”. Other standard forms that can be generated
include phonetic equivalents (using NYSIIS and/or Soundex), and something like
“initials” – maybe the first two characters from each of five fields.
Each standardization specifies a particular rule set. As well as word/token classification
tables, a rule set includes specification of the format of an output record structure, into
which original and standardized forms of the data, generated fields (such as gender) and
reporting fields (for example whether a user override was used and, if so, what kind of
override) may be written.

It may be that standardization is the desired end result of using Quality Stage. For
example street address components such as “Street” or “Avenue” or “Road” are often
represented differently in data, perhaps differently abbreviated in different records.
Standardization can convert all the non-standard forms into whatever standard format
the organization has decided that it will use.
This kind of Quality Stage job can be set up as a web service. For example, a data entry
application might send in an address to be standardized. The web service would return
the standardized address to the caller.
More commonly standardization is a preliminary step towards performing matching.
More accurate matching can be performed if standard forms of words/tokens are
compared than if the original forms of these data are compared.

Standardization Process Flow


Delivered Rule Sets Methodology in Standardization

Country
CountryIdentifier
Identifier
COUNTRY
COUNTRY

Domain
DomainPre-
Pre-
processor
processor
USPREP
USPREP

Domain
Domain Domain
Domain Domain
Domain
Specific:
Specific: Specific:
Specific: Specific:
Specific:
USNAME
USNAME USADDR
USADDR USAREA
USAREA

Example: Country Identifier Rule Set


Input
InputRecords
Records
100
100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,MAMA
02111
02111
SITE
SITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLEAB
AB
T0L
T0L1K0
1K0
28
28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
123
123MAIN
MAINSTREET
STREET

Output
OutputRecords
Records
USY100
USY100SUMMER
SUMMERSTREET
STREET15TH
15THFLOOR
FLOORBOSTON,
BOSTON,
MA
MA02111
02111
CAYSITE
CAYSITE66COMP
COMP1010RR
RR88STN
STNMAIN
MAINMILLARVILLE
MILLARVILLE
AB
ABT0L
T0L1K0
1K0
GBY28
GBY28GROSVENOR
GROSVENORSTREET
STREETLONDON
LONDONW1X
W1X9FE
9FE
USN123
USN123MAIN
MAINSTREET
STREET

Example: Domain Pre-processor Rule Set


Input
InputRecord
Record
Address
AddressLine
Line1TINA
1TINAFISHER
FISHER
Address
AddressLine
Line2ATTN
2ATTNIBM
IBM
Address
AddressLine
Line3211
3211WASHINGTON
WASHINGTONDR
DR
Address
AddressLine
Line4PO
4POBOX
BOX5252
Address
AddressLine
Line5WESTBORO,
5WESTBORO,MAMA
Address
AddressLine
Line602140
602140
Output
OutputRecord
Record
Name
Name DomainTINA
DomainTINA FISHER
FISHER ATTN
ATTN IBM
IBM
Address
Address Domain211
Domain211 WASHINGTON
WASHINGTON DR DR PO
PO
BOX
BOX 52
52
Area
Area DomainWESTBORO
DomainWESTBORO ,, MA
MA 02140
02140

Example: Domain Specific Rule Set


Input
InputRecord
Record
211
211 WASHINGTON
WASHINGTON DR
DR PO
PO BOX
BOX 52
52

Output
OutputRecord
Record
House
House Number211
Number211
Street
Street NameWASHINGTON
NameWASHINGTON
Street
Street Suffix
Suffix TypeDR
TypeDR
Box
Box TypePO
TypePO BOX
BOX
Box
Box Value52
Value52

Logic for NAME Rule Set

 Set variables for process option delimiters

 Process the most common patterns first

 Simplify the patterns

 Check for common patterns again

 Check for multiple names

 Process organization names

 Process individual names

 Default processing (based on process options)

 Post process subroutine to populate matching fields


Logic of ADDR Rule Sets

 Process the most common patterns first

 Simplify the patterns

 Check for common patterns again

 Call subroutines for each secondary address element

 Check for street address patterns

 Post process subroutine to populate matching fields

Logic of AREA Rule Sets

 Process input from right to left

 Call subroutines for each sub-domain (i.e. country name, post code, province, city)

 Post process subroutine to populate matching fields


Rule Sets

 Rule Sets are standardization processes used by the Standardize Stage and have
three required components:

1. Classification Table – Contains the key words that provide special


context, their standard value, and their user-defined class

2. Dictionary File – Defines the output columns that will be created by


the standardization process

3. Pattern-Action File – Drives the logic of the standardization process


and decides how to populate the output columns

 Optional rule set components:

 User Overrides

Reference Tables
Standardization Example

Standardize Stage

Parsing (the Standardization Adventure Begins…)

 The standardization process begins by parsing the input data into individual data
elements called tokens

 Parsing parameters are provided by the pattern-action file

 Parsing parameters are two lists of individual characters:

 SEPLIST - Any character in this list will be used to separate tokens

 STRIPLIST - Any character in this list will be removed

 The SEPLIST is always applied first

 Any character that is in the SEPLIST and not in the STRIPLIST, will be used to
separate tokens and will also become a token itself

 The space character should be included in both lists


 Any character that is in both lists will be used to separate tokens but will not
become a token itself

 The best example of this is the space character - one or more spaces are
stripped but the space indicates where one token ends and another begins

Parsing (Chinese, Japanese, Korean)

 The parser behaves differently if the locale setting is Chinese, Japanese, or Korean

 Spaces are not used to divide tokens so each character, including a space, is
considered a token

 Spaces are classified by underscores (_) in the pattern

 The Classification file allows multiple characters to be classified together

 Latin characters are transformed to double byte representations


Classification
 Parsing separated the input data into individual tokens

 Each token is basically either an alphabetic word, a number, a special character, or


some mixture

 Classification assigns a one character tag (called a class) to each and every individual
parsed token to provide context

 First, key words that can provide special context are classified

 Provided by the standardization rule set classification table

 Since these classes are context specific, they vary across rule sets

 Next, default classes are assigned to the remaining tokens

 These default classes are always the same regardless of the rule set used

 Lexical patterns are assembled from the classification results

 Concatenated string of the classes assigned to the parsed tokens

Classification – order
 First, key words that can provide special context are classified
 Provided by the standardization rule set classification table
 Since these classes are context specific, they vary across rule sets
 Next, default classes are assigned to the remaining tokens
 These default classes are always the same regardless of the rule
set used
 Lexical patterns are assembled from the classification results
 Concatenated string of the classes assigned to the parsed tokens
Classification Example

Classify key words that can provide special context


Provided by the standardization rule set classification table
T = Street Types (i.e. Street, Road, Avenue)
D = Directionals (i.e. North, South, East, West)
Apply defaults to tokens not found in the classification table
System defaults that are always the same regardless
of the rule set used
^ = A single numeric token
+ = A single unclassified alpha token

Classify key words


Apply default ^ + T D D
classes
Parsed 123 MAIN ST
tokens N W

Default Classes

Clas Description
s

^ A single numeric token

+ A single unclassified alpha token

? One or more consecutive unclassified alpha tokens


> Leading numeric mixed token (i.e. 2B, 88WR)

< Trailing numeric mixed token (i.e. B2, WR88)

@ Complex mixed token (i.e. NOT2B, C3PO, R2D2)

Default Classes (Special Characters)


 Some special characters are “reserved” for use as default classes that describe token
values that are not actual special character values

 For example: ^ + ? > < @ (as described on the previous slide)

 However, if a special character is included in the SEPLIST but omitted from the
STRIPLIST, then the default class for that special character becomes the special
character itself and in this case, the default class does describe an actual special
character value

 For example: Periods (.), Commas (,), Hyphens (-)

 It is important to note this can also happen to the “reserved” default classes
(for example: ^ = ^ if ^ is in the SEPLIST but omitted from the STRIPLIST)

 Also, if a special character is omitted from both the SEPLIST and STRIPLIST (and it is
surrounded by spaces in the input data), then the “special” default class of ~ (tilde)
is assigned

 If not surrounded by spaces, then the appropriate mixed token default class
would be assigned (for example: P.O. = @ if . is omitted from both lists)

Default Class (NULL Class)


 Has nothing to do with NULL values

 The NULL class is a special class

 Represented by a numeric zero (0)


 Only time that a number is used as a class

 Tokens classified as NULL are unconditionally removed

 Essentially, the NULL class does to complete tokens what the STRIPLIST does to
individual characters

 Therefore, you will never see the NULL class represented in the assembled lexical
patterns

Classification Table
Classification Tables contain three required space delimited columns:

1. Key word that can provide special context

2. Standard value for the key word

 Standard value can be either an abbreviation or an expansion

 The pattern-action file will determine if the standard value is used

3. Data class (one character tag) assigned to each key word

Classification Table Example

;---------------------------------------------------------------------
--------
; USADDR Classification Table
;---------------------------------------------------------------------
--------
; Classification Legend
;---------------------------------------------------------------------
--------
; B - Box Types
; D - Directionals
; F - Floor Types
; T - Street Types
; U - Unit Types
;---------------------------------------------------------------------
--------
PO "PO BOX" B
BOX "PO BOX" B
POBOX "PO BOX" B
Tokens in the Classification Table
 A common misconception by new users is assuming that every input alpha token
should be classified by the classification table

 Unclassified != Unhandled (i.e. unclassified tokens can still be processed)

 Classification table is intended for key words that provide special context, which
means context essential to the proper processing of the data

 General requirements for tokens in the classification table:

 Tokens with standard values that need to be applied (within proper


context)

 Tokens that require standard values, especially standard


abbreviations, will often map directly into their own dictionary
columns

 Does not mean that every dictionary column requires a user


defined class

 Tokens with both a high individual frequency and a low set cardinality

 Low set cardinality means that the token belongs to a group of


related tokens that have a relatively small number of possible
values and therefore the complete token group can be easily
maintained in the classification table

 If high set cardinality, adjacent tokens can often provide necessary


context.
Classify Street Name:
MAIN(MAIN)
Is this necessary? ^ S T D D
Classify Street Type:
STREET(ST)
^ + T D D
Classify Directionals:
NORTH(N), WEST(W)
^ + + D D
^ + + + +
Without Classification:
All default classes

Parsed tokens 123 MAIN STREET NORTH


WEST

What is Dictionary File ?


 Defines the output columns created by the standardization rule set

 When data is moved to these output columns, it is called “bucketing”

 The order that the columns are listed in the dictionary file defines the order the
columns appear in the standardization rule set output

 Dictionary file entries are used to automatically generate the column metadata
available for mapping on the Standardize Stage output link

Dictionary File Example


;---------------------------------------------------------------------
---------
; USADDR Dictionary File
;---------------------------------------------------------------------
---------
; Business Intelligence Fields
;---------------------------------------------------------------------
---------
HouseNumber C 10 S HouseNumber
StreetName C 25 S StreetName
StreetSuffixType C 5 S StreetSuffixType

Dictionary File Fields (Output Columns)


 Standardization can prepare data for all of its uses and therefore most dictionary
files contain three types of output columns:

1. Business Intelligence

 Usually comprised of the parsed and standardized input tokens

2. Matching

 Columns specifically intended to facilitate more effective


matching

 Commonly includes phonetic coding fields (NYSIIS and


SOUNDEX)

3. Reporting

 Columns specifically intended to assist with the evaluation of


the standardization results

Standard Reporting Fields in the Dictionary File


 Unhandled Pattern – the lexical pattern representing the unhandled data

 Unhandled Data – the tokens left unhandled (i.e. unprocessed) by the rule set

 Input Pattern – the lexical pattern representing the parsed and classified input
tokens

 Exception Data – place holder column for storing invalid input data (alternative to
deletion)

 User Override Flag – indicates whether or not a user override was applied (default =
NO)

What is Pattern-Action File ?


 Drives the logic of the standardization process

 Configures the parsing parameters (SEPLIST/STRIPLIST)

 Configures the phonetic coding (NYSIIS and SOUNDEX)

 Populates the standardization output structures

 Written in Pattern-Action Language, which consists of a series of patterns and


associated actions structured into logical processing units called Pattern-Action Sets

 Each Pattern-Action Set consists of:

 One line containing a pattern, which is tested against the current data

 One or more lines of actions, which are executed if the pattern tested true

 Pattern-Action Set Example


Pattern-Action Set
^ | + | T | D | D | $ ; number, alpha, street type, directional, directional,
end-of-data
COPY [1] {HouseNumber}
COPY [2] {StreetName}
temp =
COPY_A [3] {StreetSuffixType} “N”
COPY_A [4] temp 4
CONCAT_A [5] temp temp =
“NW”
COPY temp {StreetSuffixDirectional}
5

1 2 3 6

Pattern-Action File Structure


Configuration Options
Parsing Parameters (SEPLIST/STRIPLIST)
Phonetic Coding (NYSIIS and SOUNDEX)

Main
Pattern-Action Sets
Sequentially processed until start of the subroutines or an EXIT
command is encountered

Subroutines
Pattern-Action Sets
Each subroutine starts with a header line (\SUB) and ends
with a trailer line (\END_SUB)
Subroutines can be called by MAIN or by other subroutines
When called, sequentially processed until RETURN command
or \END_SUB is encountered
Standardization vs. Validation
 In QualityStage, standardization and validation describe different, although related,
types of processing

 Validation extends the functionality of standardization

 For example: 50 Washington Street, Westboro, Mass. 01581

 Standardization can parse, identify, and re-structure the data as follows:

 House Number = 50

 Street Name = WASHINGTON

 Street Suffix Type = ST

 City Name = WESTBORO

 State Abbreviation = MA

 Zip Code = 01581


 Validation can verify that the data describes an actual address and can also:

 Correct City Name = WESTBOROUGH

 Append Zip + 4 Code = 1013

 Validation provides this functionality by matching against a database

How to Deal with Un Handled data ?


 There are two reporting fields in all delivered rule sets:

 Unhandled Data

 Unhandled Pattern

 To identify and review unhandled data:

 Investigate stage on the Unhandled Data and Unhandled Pattern columns

 SQA stage on the output of the Standardize stage

 Unhandled data may represent the entire input or a subset of the input

 If there is no unhandled data, it does not necessarily mean the data is processed
correctly

 Some unhandled data does not need to be processed, if it doesn’t belong to that
domain

 Processing of a rule set may be modified through overrides or pattern action


language

User Overrides

 Most standardization rule sets are enabled with user overrides

 User overrides provide the user with the ability to make modifications without
directly editing the classification table or the pattern-action file

 User Overrides are:

 Entered via simple GUI screens

 Stored in specific object within the rule set

 Classification overrides can be used to add classifications for tokens not in


the classification table or to replace existing classifications already in the
classification table

 The following pattern/text override objects are called based on logic in the
pattern-action file
 input pattern

 input text

 unhandled pattern

 unhandled text

Domain Specific Override Example

Classification Override
Input Text Override
Input Pattern Override

User Modification Subroutines

 There are two subroutines in each delivered rule set that are specifically for users to
add pattern action language

 User modifications within the pattern action file:

 Input Modifications

 This subroutine is called after the Input User Overrides are applied
but before any of the rule set pattern actions are checked

 Unhandled Modifications

 This subroutine is called after all the pattern actions are checked and
the Unhandled User Overrides are applied

You might also like