Professional Documents
Culture Documents
Name and Address Algorithms White Paper 009
Name and Address Algorithms White Paper 009
For Release
December, 2010
Table of Contents
Background ................................................................................................................................................... 3
The Problem: ................................................................................................................................................. 4
Existing Technologies .................................................................................................................................... 9
Putting It All Together ................................................................................................................................. 14
Summary ..................................................................................................................................................... 19
References and Additional Reading ............................................................................................................ 21
There are a number of vendors that claim to have KYC software solutions; however, most of these were
created as add-on modules without the proper research and planning. These solutions often compare
against incomplete watch list data or use rudimentary name and address matching algorithms that risk
missing an actual match (false negative) or providing too many potential matches (false positives) to be
useful. For example, a major provider of core banking software created a KYC module that only
compares against the US OFAC list, and only does partial matching using Soundex algorithms often built
into database software (which might match “Thom” and “Tom”, but would miss “Thomas”). Our goal is
to do one thing, do it well, and as a result of this singular focus we have created the premier KYC
software solution on the market. We are not content to rest, and are continually updating and
upgrading our software to ensure that we maintain our leadership position in this market.
This white paper is not meant to detail every possible name and address matching algorithm nor does it
provide specifics about Truth Technologies’ unique and proprietary name and address algorithms and
strategies; however, it does provide an overview on the basics of name and addressing matching
technology and provides insight on those items that one should consider when choosing a filter vendor
solution. This document is split into three sections – the first details the requirements and why they are
important; the second provides an overview of existing technology; and the third details items that
should be considered for vendor selection. Truth Technologies has carefully examined all the
technologies listed in this white paper and built upon this body of public research to create a unique and
proprietary approach to the problem that is superior to any other known applications on the market.
To begin, the following is a partial summary of issues that must be taken into account when developing
any successful name and address matching strategy. The following section deals specifically with name
and address matching algorithms.
Name Names, whether person, place or entity, would be easy to match if they were
Variations consistent. If “John Smith” was always known as “John Smith” and never “Jonathan J.
Smith”, or been recorded as “John Smyth”, we could use exact matching and there
would be no need for name and address matching software. But that is not the case.
The most common issue with name matching is Name Variations, as shown above.
There are a number of sources of name variations including:
(1) Spelling variations. These can include interchanged or misplaced letters
due to typographical errors, substituted letters (as in Smyth and Smith),
additional letters (such as Smythe), or omissions (as with Collins and
Colins). Generally such variations do not affect the phonetic structure of
the name but still cause problems with matching algorithms.
(2) Phonetic variations. Where the phonemes of the name are modified, e.g.
through mishearing, the structure of the name is substantially altered.
Sinclair and St. Clair are related names but their phonetic structure is very
different. Indeed, phonetic variations in first names can be very large as
These names may have been given by the parents, adopted by popular culture, or
chosen by the individual – however arrived – they cause special problems when trying
to resolve and match names and addresses. As one can imagine, this is just one name
and one set of possible English variations. There are a number of methods to address
this problem, including name substitution. To get a true idea of the amount of data
that needs to be collected and used in name substitutions, one needs to include image
a list of each common name and each language and each cultural variation – a truly
enormous undertaking.
Capture When applying name and address matching methodology, it is important to take into
Method consideration the data capture method. Take an example where a user self-reports.
For example, a web user might create an on-line order which needs to be compared
with credit card information recorded by the user in a prior transaction. For the
shipping, they might use a familiar (ie. Rick) whereas for the credit card the user might
have used a formal (Richard). These differences and “errors” in translation are very
different than those that might be created when transcribed by a listener. If a person
placed a phone order “Jean” and “Gene” are phonetically equivalent and could easily
occur during transcription, but would unlikely be the result of self-reporting.
An entire area of computer science has grown up around “Master Data Management”,
in an attempt to determine the “true” database record of fact. In the prior example,
should a company’s record include the informal “Rick” as the person may like to be
called, or “Richard” which might appear on his credit card, or something else?
In the financial services world, bank accounts and the like are often registered using the
formal name of an individual, likewise are passports and other documents, however,
watch lists often include aliases, partial names, or familiar names.
Aliases When performing any sort of name and address checking, it is necessary to consider
that a single check may really be multiple checks for multiple aliases. This is closely
aligned with the need for name substitution. For example, one of America’s FBI most
wanted people is Boston crime family head James Joseph Bulger, Jr., who is more
commonly known as “Whitey” Bulger or James Bulger. Any number of variations may
include part or all of his real name and “Whitey” or other known aliases.
These are not common substitutions, but are specific to the individual, entity or
address. Addresses can have such substitutions – neighborhoods may refer to
themselves by a name unrecognized by the government. For example, outside of
Data Quality / The quality of the data under consideration should be accounted for in the algorithms.
Data Use all possible available data for both the watch list and client data sets. Lack of first
Completeness names, use of initials, abbreviations, alternative names or addresses, and the like will
cause matching problems. Understand also that name data may not be as it appears.
It is not uncommon to find the first and middle names in a first name field or a name
like “Van” in the middle name field. Failing to use the Middle name field as part of your
client data being matched may cause additional false positives. In addition, although
the practice is not unique to the financial services industry, data entry personnel have
been known to bastardize the use of certain fields. For example, a core banking system
may not allow for the entry of beneficial accounts for minors (or data entry operators
may not be aware of additional fields specific for this use). So it is relatively common to
find a financial institution with account names that might appear as: “John F. Smith b/o
Jerry G. Smith”. Successful matching strategies should account for these common
occurrences.
When considering name and address matching, one should include the use of
secondary filter criteria whenever possible. If the client data has a birth date, and a
watch list record has a birth date, using this additional secondary criteria in addition to
the standard name and address can greatly enhance the quality of the results (both in
terms of reducing false negatives and ensuring no false positives). Other fields such as
country of birth or citizenship are unlikely to change and can be used in considering the
likelihood of a match. Note that non-matches should not necessarily eliminate the
candidate as a match, but should be a factor in the decision.
Domain One should consider domain limits, if known, when dealing with names and addresses.
Limits For example, there are a limited number of countries in the world and therefore
mistyping, transliteration, or other errors that may have been introduced can be
Frequency One needs to be aware of name frequency in various cultures. For example, in Korea
the last name of Kim is used by over 50% of the population. This is also true with
Western last names and even first names. Simply having a match with “John” on a first
name between a customer and watch list name is less meaningful than a match on a
first name of “Osama”. Any successful name matching algorithms should be aware of
this name frequency and adjust emphasis accordingly.
These are but a few of the considerations that should be made when evaluating any name and address
matching algorithms. At a minimum, hopefully the reader better understands the complexity of the
problem and therefore the complexity of any solution. The next section describes some technical
approaches to this problem.
K-Strings and K-Strings and Q-Grams are a family of comparison algorithms compare words
Q-Grams or text and determine the closeness within those strings. Since these
algorithms find the delta, or differential between two strings, these
algorithms must be done real time when both strings to be compared are
known.
The most basic of this family of algorithms is K-String and Guth. A K-String
algorithm searches for a specific string of text within a larger section of text,
and counts the number of letter differences between the two. Guth
enhances this by taking into account the actual positions of letters. Obviously
this technique is rudimentary and not especially effective, but is often used
for simple spell-checking algorithms and does have the advantage of not
being tied to the phonetics of any specific language.
The enhanced versions of the K-String and Guth count the number of
differences between strings and the number of individual letter edits
(insertions, deletions, and substitutions) required to achieve a match and
return a score, usually from 0.0 (no match) to 1.0 (exact match).
The Levenshtein or Edit Distance algorithm is an example that measure the
number of edits required to turn one string into another. There are a number
of variations such as q-grams which are substrings within the larger text. A
variation on the q-gram would be LCS (Longest Common String) which
removes the longest common string before processing. Others such as the
Smith-Waterman take into allow for gaps and character specific match score
comparisons. There are many twists to these strategies such as Positional q-
Rusell Soundex The algorithms listed above work best with real time comparisons, when both
strings are known and work for any type of text string. The next several
algorithms take a different approach, applying strategies that are specific to
person, place, and entity names and reduce a name to unique code that can
be stored in a database for quick comparison.
The Russell Soundex solution was developed near the turn of the century and
is one of the first in the field of name matching solutions. This solution
applies an algorithm to create a four digit alpha-numeric code by using the
following rules:
1. Retain the first letter of the name, and drop all occurrences of a, e, h,
i, o, u, w, y in other positions.
2. Assign the following numbers to the remaining letters after the first:
b, f, p, v = 1
l=4
c, g, j, k, q, s, x, z = 2
m, n = 5
d, t = 3
r= 6
3. If two or more letters with the same code were adjacent in the
original name (before step 1), omit all but the first.
4. Convert to the form ‘letter, digit, digit, digit’ by adding trailing zeros
(if there are less than three digits), or by dropping rightmost digits it
there are more than three.
For example, the names Euler, Gauss, Hilbert, Knuth and Lloyd are given the
respective codes E460, G200, H416, K530, and L300. However, the algorithm
also gives the same codes for Ellery, Ghosh, Heilbronn, Kant and Ladd which
are not related in reality.
Although the Russell Soundex method is more accurate than just relying on
character similarities between the names, the algorithm is not ideal. Indeed,
when comparing first names, different codes could be given for abbreviated
forms of a name so Tom, Thos. and Thomas would be classed as different
names.
Metaphone Coding Metaphone Coding was developed by Lawrence Phillips to replace words with
their phonetic equivalents based on commonplace rules of English
pronunciation. A simplified version of the rules for shortening are:
1. Ignore vowels after the first letter and reduces the remaining
alphabet to sixteen consonant sounds, although vowels are retained
when they are the first letter.
2. Duplicate letters are not added to the code.
3. Zero is used to represent the ‘th’ sound since it resembles the Greek
theta when it has a line through it, and ‘X’ is used for the ‘sh’ sound.
4. The sixteen consonant sounds are: B X S K J T F H L M N P R Ø W Y
For example, the English word “School” would be shorted to “SKL” and the
name “Shubert” to “XBRT”.
Metaphone Coding is able to catch mistakes Sound-Ex doesn’t. For example
Metaphone can differentiate between Bonner and Baymore (BNR and BMR),
but falls short on a number of key differentiations.
Hybrid (Phonex) The Phonex is an example of a hybrid between the Metaphone and the
Soundex algorithms. The basic idea is to convert names into their phonetic
equivalent, and then use Soundex or related method to encode. The
algorithm converts each name to a four-character code, which can be used to
identify equivalent names, and is structured as follows:
Pre-process the name according to the following rules:
1. Remove all trailing 'S' characters at the end of the name.
2. Convert leading letter-pairs as follows:
Each of these algorithms has benefits and drawbacks. Some can be used separately, others in parallel to
create hybrid solutions. Truth Technologies has examined each of these algorithms, extracted the best
of the benefits and combined these unique and proprietary set of algorithms and rules that are
dynamically combined on a per-client basis to implement a complete strategy to accurately filter
customer names and addresses against watch lists.
o Name variations.
o Cultural issues.
o Capture method.
o Aliases.
o Data Quality/Data Completeness.
o Domain Limits.
o Frequency.
A number of coding rules can be created to deal with each of these issues, primarily through pre and
post processing data cleansing and enhancement. Some of these strategies include:
o Acronym substitution.
o Initial identification.
o Non-phonetic fuzzy matching.
o Name lexicons and nickname substitution.
o Element matching, transformation, identification, and standardization.
o Key word and noise word identification.
o Element identification and weighting.
The core of the data matching algorithm is the actual name (person, place, or entity) comparison with
those on watch lists. We examined the most basic of these algorithms including:
For any successful strategy, it is critical to understand when and where to use each of the techniques.
Some can be combined or used in parallel. Each strategy and methodology requires an understanding of
the target (person names, entity names, and address) and the context (cultural and data completeness).
Through a judicious application of these rules, one starts with a universe of potential matches, this is
then narrowed down to return only the most relevant.
Truth Technologies has taken into account careful consideration of the benefits of each possible strategy,
combined this with an understanding of the nature of the client data and the watch list data, and
created a highly accurate name and address “engine”. Truth Technologies Sentinel dynamically applies
various pre-programmed set of rules specific to each individual client need. Currently, our application
has over 19,000 rules that are applied dynamically as needed per individual situation.
Under the hood, Sentinel utilizes a complex matrix of rules, algorithms, and strategies applied to create
the most advanced and effective filtering software on the market. These rules sets are automatically
and dynamically tailored to account for the idiosyncrasies and specifics of the client data and the watch
list data. Although most of this complexity is hidden from the user, there are two other layers of user
So where do we go from here? Truth Technologies has a number of major product enhancements in our
product roadmap, of which we are able to share (in the abstract) a couple of these here including:
Cross Language With this enhancement a name in two-bit character set, such as Mandarin
Matching Chinese, could be compared against Western watch lists. This can be done by
either turning the Chinese into a Latin character set translation or the
Western watch list into a Chinese characters. As can be imagined, this task is
brings with it a host of issues and complexities.
An Enhanced Risk This white paper has dealt with the likelihood of match. Our algorithm takes
Scoring Algorithm the complete possible universe of matches and eliminates all but those most
likely and returns those with a score. This score represents the confidence
level that our algorithms have determined a client record matches one or
more watch list records. To further refine matches, the algorithm adjusts to
specific risk tolerance levels and users are able to apply sensitivity to these
Enhanced Name and Truth Technologies is constantly examining the latest name and address
Address matching matching technologies, strategies, and algorithms and is always striving to
rules, algorithms, and reduce false positives while maintaining a level of integrity that will prevent
strategies missing any false negatives.
Each of these enhancements represents a major step forward for Truth Technologies’ Sentinel product,
both in terms of capability and in the scope and size of the effort. Translation engines are plentiful,
however, generic text translators are much different than those specific to names. The creation of a
translation engine will require both the translation skills and arriving at an entirely new strategy
Due to the size and nature of fines being dispensed by regulators, many industry insiders are recognizing
the shift from compliance being a “checklist” issue to a “results” issue. Recent fines include Wachovia,
American Express, Lloyds of London, and RBS. These fines are now reaching the nine digit range and are
being applied not just for specific instances of money laundering, but for systemic failures in AML
compliance. No longer can financial intuitions use low-end exact name matching, or even built in
Soundex features on existing database applications. Increasingly companies are turning to high-end well
purposed and highly tuned software to minimize false positives and reduce or eliminate false negatives.
When looking at vendors that provide filtering software, we recommend that you ask the following
questions:
What matching algorithms are being used? Are they even bothering to applying a name
matching algorithm, or just doing exact matching? Is the vendor using an off-the-shelf solution
or built-in Soundex algorithm, or have they done the research and invested to develop a tailored
solution? Even if they are applying one or more of the strategies here, many of the algorithms
described aren’t specific to the task at hand and many are used to match any type of text, not
specifically people, place, and entity names. Understand what you are buying.
Does the vendor have an algorithm or a strategy? The heart of filtering software is the name
and address matching algorithm, but is this being employed as part of an overall strategy? For
example, does the vendor utilize nickname substitution? How do they deal with honorifics or
cultural variations? How about noise words such as “Incorporated”, “Company”, and “Limited”
when dealing with entity names? What pre and post name and address processing is occurring
(such as address standardization)?
Is there a possibility of false negatives? Worse than false positives (which create additional
work for a compliance officer to eliminate), a false negative may mean your institution is doing
business with someone on a watch list; which is a time bomb waiting to explode. Revelation
that your institution is doing business with someone on a watch list can result in hundreds of
millions of dollars in penalties as well as civil and criminal charges. Be wary of any company that
claims extremely low false positive rates – they may be doing an exact match between names,
or some other very rudimentary algorithm that results in very few matches, and likely results in
false negatives.
Is it a complete solution? Does the vendor’s package allow segmentation of data and workflow
management? Regulators are looking for risk-based approaches to KYC. By segmenting
customer data by product line (retail customers vs. brokerage customer, for example), by
geography, by industry, or by other risk criteria and applying different frequencies and tolerance
levels for screening, compliance officers can reduce risk and increase compliance. What is your
strategy and is your vendor flexible to adapt to the way you do business? Or are you expected
to adapt to the way the vendor software does business?
Can your vendor rescreen data? Increasingly, regulators are insisting on the periodic
rescreening of all client data. Just because a client may not have been on a watch list or been a
politically exposed person at the time they were on-boarded, doesn’t mean they haven’t
subsequently been added to a list. The key to any rescreening process is to only flag matches
when the underlying data has changed or it is a new match. For example, if a match occurs but
is marked as being a false positive, subsequent rescreenings should understand this and deal
with the match differently than a match that occurs with a never before matched newly added
record to a watch list. An alternative to rescreening all client data periodically, is a method
called Continuous Customer Monitoring. In CCM, the client data is stored against updates to the
watch list data as they are loaded. Any new matches between the update and the client data
results in notification sent to the compliance officer. The downside of CCM is that the client
data is also changing, resulting in the need for a second new client screening process.
As can be seen in this white paper, Truth Technologies’ singular focus on KYC technologies, strategies,
and algorithms means we have invested the time and effort to develop, and continue to enhance, the
most advanced and effective filter solution on the market. Truth Technologies’ flagship product, Truth
Technologies’ Sentinel, is a foremost leader in filtering strategies – using a proprietary combination of
name and address matching algorithms that builds upon the techniques described in this white paper,
but also greatly expands on them to account for the various complexities and potential problems
specifically associated with client and watch list data.
For more information contact your local Truth Technologies’ Sentinel sales representative or email us at
sales@truthtechnologies.com.
Christen, Peter. A Comparison of Personal Name Matching: Techniques and Practical Issues, TR-CS-06-
02, Joint Computer Science Technical Report Series, Department of Computer Science, The Australian
National University, September, 2006.
Lait, AJ and Randell, B., An Assessment of Name Matching Algorithms, Department of Computer Science,
University of Newcastle upon Tyne.
Snae, Chakkrit, A Comparison and Analysis of Name Matching Algorithms, World Academy of Science,
Engineering and Technology, November 25, 2007.
Wang, Wen, Hi, Heng, and Grishman, Ralph, Phonetic Name Matching for Cross-Lingual Spoken Sentence
Retrieval, New York University.