Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Truth Technologies, Inc.

Name and Address


Matching Strategy
White Paper

For Release
December, 2010
Table of Contents

Background ................................................................................................................................................... 3
The Problem: ................................................................................................................................................. 4
Existing Technologies .................................................................................................................................... 9
Putting It All Together ................................................................................................................................. 14
Summary ..................................................................................................................................................... 19
References and Additional Reading ............................................................................................................ 21

Customizable, High-Speed Customer Verification

Truth Technologies, Inc.


Author: David Olenzak
www.TruthTechnologies.com | sales@truthtechnologies.com | 866.691.3867
©Copyright, 2010, Truth Technologies. | All rights reserved.
Background
Truth Technologies’ flagship product is Sentinel, a hosted Know-Your-Customer (KYC) compliance
solution which is part of a financial institution’s overall Anti-Money Laundering (AML) compliance
program. Our software helps a financial institution’s compliance officer answer the question: Is this
person (or entity) who they say they are, and should I be doing business with them?

A Venn Diagram of Truth Technologies’ Products within the Industry

There are a number of vendors that claim to have KYC software solutions; however, most of these were
created as add-on modules without the proper research and planning. These solutions often compare
against incomplete watch list data or use rudimentary name and address matching algorithms that risk
missing an actual match (false negative) or providing too many potential matches (false positives) to be
useful. For example, a major provider of core banking software created a KYC module that only
compares against the US OFAC list, and only does partial matching using Soundex algorithms often built
into database software (which might match “Thom” and “Tom”, but would miss “Thomas”). Our goal is
to do one thing, do it well, and as a result of this singular focus we have created the premier KYC
software solution on the market. We are not content to rest, and are continually updating and
upgrading our software to ensure that we maintain our leadership position in this market.

This white paper is not meant to detail every possible name and address matching algorithm nor does it
provide specifics about Truth Technologies’ unique and proprietary name and address algorithms and
strategies; however, it does provide an overview on the basics of name and addressing matching
technology and provides insight on those items that one should consider when choosing a filter vendor
solution. This document is split into three sections – the first details the requirements and why they are
important; the second provides an overview of existing technology; and the third details items that
should be considered for vendor selection. Truth Technologies has carefully examined all the
technologies listed in this white paper and built upon this body of public research to create a unique and
proprietary approach to the problem that is superior to any other known applications on the market.

Truth Technologies White Paper Series Page 3


The Problem:
Name and address matching technology dates back to the early days of mainframes when companies
wanted to “merge/purge” lists. Usually this involved taking two sets of mailing lists or customer lists
and combining them into a single list without duplicates. These early lists brought to fore a whole series
of issues including address standardization and data integrity. Ultimately, these problems have grown
into an entire IT scientific field of Master Data Management. Applying this to Know Your Customer (KYC)
is much more complex than first appears. Not only does the software need to know that “Bob”,
“Robert”, and “Roberto” are all the same person, it also has to deal with cultural anomalies (first
name/last name vs. last name/first name), honorifics, translations, and the like. When it comes to KYC
software, a financial institution cannot afford to miss a match. Accidentally accepting a customer that is
on a sanction list could cause irreparable financial and reputational risk (known in the industry as a false
negative). However, financial institutions need to be able to manage matches and partial matches of
customers with similar sounding individual and entity names on watch lists (false positives). Since it
takes time to research and eliminate or verify any potential matches and most compliance officers have
limited resources, quality name and address matching is critical. The goal of this white paper is to help
readers understand these potential problems and the implications of these issues.

To begin, the following is a partial summary of issues that must be taken into account when developing
any successful name and address matching strategy. The following section deals specifically with name
and address matching algorithms.

Name Names, whether person, place or entity, would be easy to match if they were
Variations consistent. If “John Smith” was always known as “John Smith” and never “Jonathan J.
Smith”, or been recorded as “John Smyth”, we could use exact matching and there
would be no need for name and address matching software. But that is not the case.
The most common issue with name matching is Name Variations, as shown above.
There are a number of sources of name variations including:
(1) Spelling variations. These can include interchanged or misplaced letters
due to typographical errors, substituted letters (as in Smyth and Smith),
additional letters (such as Smythe), or omissions (as with Collins and
Colins). Generally such variations do not affect the phonetic structure of
the name but still cause problems with matching algorithms.
(2) Phonetic variations. Where the phonemes of the name are modified, e.g.
through mishearing, the structure of the name is substantially altered.
Sinclair and St. Clair are related names but their phonetic structure is very
different. Indeed, phonetic variations in first names can be very large as

Truth Technologies White Paper Series Page 4


illustrated by Christina and its shortened form Tina.
(3) Double names. There are some cases where surnames are composed of
two elements but both are not always shown. For example, a double
surname such as Philips-Martin may be given in full, as Philips or as
Martin.
(4) Double first names. Although not common in the English language, when
considering French, for example, names such as Jean-Claude may be given
in full, or as Jean and/or Claude.
(5) Where an individual changes their name during the course of their life or is
called by one of their first names during a period of their life and another
later on, this causes a major problem in name-matching. In these
situations an algorithm that recognizes simple variations in spelling or
phonetics would not be able to identify two such names as referring to the
same person.
This last problem is illustrative of the scope of the problem. Name matching difficulty is
accentuated by the personal nature of names (including place names). For example,
someone named Elisabeth could go by any number of spellings of the full name or by
any number of nicknames. The following is just a partial example of shortened names
or common nicknames for Elisabeth in English:

Babeth Babette Batty Bee Bess Bessie Bet Beth


Bethanne Bethey Betsy Betta Bette Bettina Betty Bezzy
Bi Billie Bitty Bitsy Biz Buffy Effy Ela
Elbinard Eli Eliaz Elisa Elise Eliza Eliza-Beth Ella
Elle Ellie Ellie-B Els Elsa Elsie Elsinue Elspeth
Ely Ilsa Ibby Isabel Iz Izabel Izabeth Izzie
Izzy Leeza Leezbeez Lib Libby Lidabet Lies Liesel
Lila Lili Lilibet Lilibeth Lible Lilie Lilla Lillah
Lilli Lillibet Lillibeth Lillie Lilly Lily Lisa Lisbet
Lisbeth Lisette Lissy Liz Liza Lizbet Lizabeth Lizbee
Lizine Lizz Lizzette Lizzie Lizzy Sissi Tetty Tibby
Tizzy Wiz Wizzy Yizzy Zabe Zabs Zeebz

These names may have been given by the parents, adopted by popular culture, or
chosen by the individual – however arrived – they cause special problems when trying
to resolve and match names and addresses. As one can imagine, this is just one name
and one set of possible English variations. There are a number of methods to address
this problem, including name substitution. To get a true idea of the amount of data
that needs to be collected and used in name substitutions, one needs to include image
a list of each common name and each language and each cultural variation – a truly
enormous undertaking.

Truth Technologies White Paper Series Page 5


Cultural There are added complexities of cultural adaptations of names. In Latin American
Issues countries it is a common practice to include maternal and paternal last names in a
hyphenated format. In many Western cultures it is increasingly more common for
married couples to adopt both names in this hyphenated format. In other cultures,
honorifics or other descriptors are often incorporated into the name. In Scotland, for
example, the use of “Mac” or “Mc” meant “son of”, but has come to be part of many
Scottish family names. Names can be reported with or without these honorifics or
descriptors and each must be taken into account.

Capture When applying name and address matching methodology, it is important to take into
Method consideration the data capture method. Take an example where a user self-reports.
For example, a web user might create an on-line order which needs to be compared
with credit card information recorded by the user in a prior transaction. For the
shipping, they might use a familiar (ie. Rick) whereas for the credit card the user might
have used a formal (Richard). These differences and “errors” in translation are very
different than those that might be created when transcribed by a listener. If a person
placed a phone order “Jean” and “Gene” are phonetically equivalent and could easily
occur during transcription, but would unlikely be the result of self-reporting.
An entire area of computer science has grown up around “Master Data Management”,
in an attempt to determine the “true” database record of fact. In the prior example,
should a company’s record include the informal “Rick” as the person may like to be
called, or “Richard” which might appear on his credit card, or something else?
In the financial services world, bank accounts and the like are often registered using the
formal name of an individual, likewise are passports and other documents, however,
watch lists often include aliases, partial names, or familiar names.

Aliases When performing any sort of name and address checking, it is necessary to consider
that a single check may really be multiple checks for multiple aliases. This is closely
aligned with the need for name substitution. For example, one of America’s FBI most
wanted people is Boston crime family head James Joseph Bulger, Jr., who is more
commonly known as “Whitey” Bulger or James Bulger. Any number of variations may
include part or all of his real name and “Whitey” or other known aliases.
These are not common substitutions, but are specific to the individual, entity or
address. Addresses can have such substitutions – neighborhoods may refer to
themselves by a name unrecognized by the government. For example, outside of

Truth Technologies White Paper Series Page 6


Washington, DC there is a neighborhood where many residents use the city name of
“North Potomac”, adopting the more prestigious name of their neighboring community
of “Potomac” and forgoing their actual postal address of Rockville.
To add complexity, a number of known criminals use various name aliases for the sole
purpose of creating confusion. Watch lists may include part or all of these known
aliases. Of course, any individual on a watch list that uses a new and different alias,
combined with a valid address not known to be associated with the person in question,
and is able to produce valid legal identification can likely get away with opening an
account. The best prevention against this is for an institution to periodically recheck
their account data against watch lists to pickup any new aliases as they become known
to authorities.

Data Quality / The quality of the data under consideration should be accounted for in the algorithms.
Data Use all possible available data for both the watch list and client data sets. Lack of first
Completeness names, use of initials, abbreviations, alternative names or addresses, and the like will
cause matching problems. Understand also that name data may not be as it appears.
It is not uncommon to find the first and middle names in a first name field or a name
like “Van” in the middle name field. Failing to use the Middle name field as part of your
client data being matched may cause additional false positives. In addition, although
the practice is not unique to the financial services industry, data entry personnel have
been known to bastardize the use of certain fields. For example, a core banking system
may not allow for the entry of beneficial accounts for minors (or data entry operators
may not be aware of additional fields specific for this use). So it is relatively common to
find a financial institution with account names that might appear as: “John F. Smith b/o
Jerry G. Smith”. Successful matching strategies should account for these common
occurrences.
When considering name and address matching, one should include the use of
secondary filter criteria whenever possible. If the client data has a birth date, and a
watch list record has a birth date, using this additional secondary criteria in addition to
the standard name and address can greatly enhance the quality of the results (both in
terms of reducing false negatives and ensuring no false positives). Other fields such as
country of birth or citizenship are unlikely to change and can be used in considering the
likelihood of a match. Note that non-matches should not necessarily eliminate the
candidate as a match, but should be a factor in the decision.

Domain One should consider domain limits, if known, when dealing with names and addresses.
Limits For example, there are a limited number of countries in the world and therefore
mistyping, transliteration, or other errors that may have been introduced can be

Truth Technologies White Paper Series Page 7


corrected. Additional data sources may be used to provide these domain limits. For
example, a database of country names and common aliases (UK and United Kingdom
for example). In another example used by Truth Technologies, the US Postal Service
provides a quarterly update that includes every possible US legal address. For example,
it would know that the street addresses of Main Street in a certain town run from 1000
to 1500, but it can’t tell if “1234 Main Street” has an actual edifice at that address (or is
an empty parcel of land).
One must be careful when using domain lists. Suppose a builder extends a street or
creates an entirely new street. This may not be recognized until the latest US Postal
Service issues a new release of their database. So should you allow an “invalid”
address? On the other side, suppose a watch list specifies an individual’s date of birth
as being February 30th: should you try to correct this data and if so, what changes the
month or day?
Even political considerations can come into play. Problems can exist with a country
such as Myanmar, the official name of the country formerly known as Burma. Many
countries and people refuse to recognize the name change. Another example would be
a Somaliland, a breakaway region in the northern part of Somalia. A more widely
recognized problem is Palestine. There are a number of nations that refuse to
recognize the state of Israel and use the term Palestine to refer to the area including
the current state of Israel. Others may use Palestine to refer Israeli occupied territories
where the Palestinian refugees currently live (the West Bank and Gaza), while Israel
may refuse to recognize Palestine on any current map.
Any solution incorporating domain limits should be flexible and provide a degree of
tolerance.

Frequency One needs to be aware of name frequency in various cultures. For example, in Korea
the last name of Kim is used by over 50% of the population. This is also true with
Western last names and even first names. Simply having a match with “John” on a first
name between a customer and watch list name is less meaningful than a match on a
first name of “Osama”. Any successful name matching algorithms should be aware of
this name frequency and adjust emphasis accordingly.

These are but a few of the considerations that should be made when evaluating any name and address
matching algorithms. At a minimum, hopefully the reader better understands the complexity of the
problem and therefore the complexity of any solution. The next section describes some technical
approaches to this problem.

Truth Technologies White Paper Series Page 8


Existing Technologies
The following section describes some of the most common publically available name (and text)
comparison algorithms. The name comparison algorithm is the most important component of any
overall strategy, but must be combined with other techniques used to remediate problems and issues
listed in the last section, with the goal of creating a comprehensive overall name and address matching
strategy. This section is not meant to be a comprehensive analysis of the entirety of research in this
area; rather, it is presented as an instructional tool to assist with the understanding of basic name and
address matching technology. Truth Technologies’ proprietary name and address matching algorithm
not only builds upon this existing body of work, but includes a continuous process of extending and
enhancing our algorithms to ensure the highest quality software with the fewest false negatives while
attempting to eliminate any false positives.

K-Strings and K-Strings and Q-Grams are a family of comparison algorithms compare words
Q-Grams or text and determine the closeness within those strings. Since these
algorithms find the delta, or differential between two strings, these
algorithms must be done real time when both strings to be compared are
known.
The most basic of this family of algorithms is K-String and Guth. A K-String
algorithm searches for a specific string of text within a larger section of text,
and counts the number of letter differences between the two. Guth
enhances this by taking into account the actual positions of letters. Obviously
this technique is rudimentary and not especially effective, but is often used
for simple spell-checking algorithms and does have the advantage of not
being tied to the phonetics of any specific language.
The enhanced versions of the K-String and Guth count the number of
differences between strings and the number of individual letter edits
(insertions, deletions, and substitutions) required to achieve a match and
return a score, usually from 0.0 (no match) to 1.0 (exact match).
The Levenshtein or Edit Distance algorithm is an example that measure the
number of edits required to turn one string into another. There are a number
of variations such as q-grams which are substrings within the larger text. A
variation on the q-gram would be LCS (Longest Common String) which
removes the longest common string before processing. Others such as the
Smith-Waterman take into allow for gaps and character specific match score
comparisons. There are many twists to these strategies such as Positional q-

Truth Technologies White Paper Series Page 9


grams, Skip-grams, Compression, Jaro, Winkler, Sorted-Winkler and
Permuted-Winkler.

Rusell Soundex The algorithms listed above work best with real time comparisons, when both
strings are known and work for any type of text string. The next several
algorithms take a different approach, applying strategies that are specific to
person, place, and entity names and reduce a name to unique code that can
be stored in a database for quick comparison.
The Russell Soundex solution was developed near the turn of the century and
is one of the first in the field of name matching solutions. This solution
applies an algorithm to create a four digit alpha-numeric code by using the
following rules:
1. Retain the first letter of the name, and drop all occurrences of a, e, h,
i, o, u, w, y in other positions.
2. Assign the following numbers to the remaining letters after the first:
b, f, p, v = 1
l=4
c, g, j, k, q, s, x, z = 2
m, n = 5
d, t = 3
r= 6
3. If two or more letters with the same code were adjacent in the
original name (before step 1), omit all but the first.
4. Convert to the form ‘letter, digit, digit, digit’ by adding trailing zeros
(if there are less than three digits), or by dropping rightmost digits it
there are more than three.
For example, the names Euler, Gauss, Hilbert, Knuth and Lloyd are given the
respective codes E460, G200, H416, K530, and L300. However, the algorithm
also gives the same codes for Ellery, Ghosh, Heilbronn, Kant and Ladd which
are not related in reality.
Although the Russell Soundex method is more accurate than just relying on
character similarities between the names, the algorithm is not ideal. Indeed,
when comparing first names, different codes could be given for abbreviated
forms of a name so Tom, Thos. and Thomas would be classed as different
names.

Truth Technologies White Paper Series Page 10


Enhanced Soundex A number of scientists have expanded on the Russell Soundex, such as Henry
Name Matching and the Daitch-Mokotoff Coding Method. These
modifications were often developed to solve specific language issues, and the
four digit alpha-numeric code is often expanded and the code rules changed
to improve accuracy. However, each of these still are far from accurate.
Soundex and its enhanced derivatives are the best known and most often
implemented. Since these strategies are simple to code, most database
companies provide a feature like Soundex or a derivative thereof. Many of
Truth Technologies’ competitors either use one of these build-in database
features or have created their own implementation of this most fundamental
and error prone matching strategies.

Metaphone Coding Metaphone Coding was developed by Lawrence Phillips to replace words with
their phonetic equivalents based on commonplace rules of English
pronunciation. A simplified version of the rules for shortening are:
1. Ignore vowels after the first letter and reduces the remaining
alphabet to sixteen consonant sounds, although vowels are retained
when they are the first letter.
2. Duplicate letters are not added to the code.
3. Zero is used to represent the ‘th’ sound since it resembles the Greek
theta when it has a line through it, and ‘X’ is used for the ‘sh’ sound.
4. The sixteen consonant sounds are: B X S K J T F H L M N P R Ø W Y
For example, the English word “School” would be shorted to “SKL” and the
name “Shubert” to “XBRT”.
Metaphone Coding is able to catch mistakes Sound-Ex doesn’t. For example
Metaphone can differentiate between Bonner and Baymore (BNR and BMR),
but falls short on a number of key differentiations.

Hybrid (Phonex) The Phonex is an example of a hybrid between the Metaphone and the
Soundex algorithms. The basic idea is to convert names into their phonetic
equivalent, and then use Soundex or related method to encode. The
algorithm converts each name to a four-character code, which can be used to
identify equivalent names, and is structured as follows:
Pre-process the name according to the following rules:
1. Remove all trailing 'S' characters at the end of the name.
2. Convert leading letter-pairs as follows:

Truth Technologies White Paper Series Page 11


KN = N
WR = R
PH = F
3. Convert leading single letters as follows:
H = Remove
E, I, O, U, Y = A
K, Q = C
P=B
J=G
V=F
Z=S
Code the pre-processed name according to the following rules:
1. Retain the first letter of the name, and drop all occurrences of A, E, H,
I, O, U, W, Y in other positions.
2. Assign the following numbers to the remaining letters after the first:
B, F, P, V = 1
C, G, J, K, Q, S, X, Z = 2
D, T = 3 If not followed by C.
L = 4 If not followed by vowel or end of name.
M, N = 5 Ignore next letter if either D or G.
R = 6 If not followed by vowel or end of name.
3. Ignore the current letter if it has the same code digit as the last
character of the code.
4. Convert to the form ‘letter, digit, digit, digit’ by adding trailing zeros
(if there are less than three digits), or by dropping rightmost digits it
there are more than three. Although the resulting four-character
code is identical in format to that produced by the Soundex coding
algorithm, these two forms are not compatible.
There are a series of variations of this technique (Phonix, NYSIIS, Double-
Metaphone, Fuzzy Soundex, and the like). Each has their own enhancement,
which provides marginally better matching. For example, Phonix enhances
the Phonex coding algorithm through the use of more than one hundred
additional transformation rules based on groups of letters. NYSIIS (New York
State Identification Intelligence System) uses groups of letters versus the

Truth Technologies White Paper Series Page 12


more common Soundex four digit alpha-numeric coding system.
One should keep in mind that different enhancements were often created
and tailored with different data sets in mind. For example, some algorithms
may work better on French names rather than English names.

Each of these algorithms has benefits and drawbacks. Some can be used separately, others in parallel to
create hybrid solutions. Truth Technologies has examined each of these algorithms, extracted the best
of the benefits and combined these unique and proprietary set of algorithms and rules that are
dynamically combined on a per-client basis to implement a complete strategy to accurately filter
customer names and addresses against watch lists.

Truth Technologies White Paper Series Page 13


Putting It All Together
We’ve examined a number of the issues that are involved with originating a quality name and address
matching solution. These include understanding your data and why or where differences can originate.
Topics include:

o Name variations.
o Cultural issues.
o Capture method.
o Aliases.
o Data Quality/Data Completeness.
o Domain Limits.
o Frequency.

A number of coding rules can be created to deal with each of these issues, primarily through pre and
post processing data cleansing and enhancement. Some of these strategies include:

o Acronym substitution.
o Initial identification.
o Non-phonetic fuzzy matching.
o Name lexicons and nickname substitution.
o Element matching, transformation, identification, and standardization.
o Key word and noise word identification.
o Element identification and weighting.

The core of the data matching algorithm is the actual name (person, place, or entity) comparison with
those on watch lists. We examined the most basic of these algorithms including:

o K-Strings and Q-Grams.


o Russell Soundex.
o Enhanced Soundex.
o Metaphone Coding.
o Hybrid (Phonex).

For any successful strategy, it is critical to understand when and where to use each of the techniques.
Some can be combined or used in parallel. Each strategy and methodology requires an understanding of
the target (person names, entity names, and address) and the context (cultural and data completeness).
Through a judicious application of these rules, one starts with a universe of potential matches, this is
then narrowed down to return only the most relevant.

Truth Technologies White Paper Series Page 14


The Client Data Extraction pulls the
relevant data from the data base
including all possible fields. This data
is encrypted and sent to Truth
Technologies’ hosted system in an
secure manner for processing.

Pre-Processing includes the


application of a number
standardization, identification, and
transformation methodologies.

The core of any filtering solution is the


application one or more robust name
and address Matching Algorithms.
The result is the universe of possible
matches with corresponding scores.

In the Post Processing step the scores


are weighted and reduced to produce
a final subset of possible matches.

The final step is a review by a


Compliance Officer to determine
actual matches, false positives, and to
take appropriate action.

Truth Technologies Matching “Engine” Workflow

Truth Technologies has taken into account careful consideration of the benefits of each possible strategy,
combined this with an understanding of the nature of the client data and the watch list data, and
created a highly accurate name and address “engine”. Truth Technologies Sentinel dynamically applies
various pre-programmed set of rules specific to each individual client need. Currently, our application
has over 19,000 rules that are applied dynamically as needed per individual situation.

Under the hood, Sentinel utilizes a complex matrix of rules, algorithms, and strategies applied to create
the most advanced and effective filtering software on the market. These rules sets are automatically
and dynamically tailored to account for the idiosyncrasies and specifics of the client data and the watch
list data. Although most of this complexity is hidden from the user, there are two other layers of user

Truth Technologies White Paper Series Page 15


customization are applied. During setup, Sentinel allows users to make choices about the data and
sensitivity of the algorithms applied to the institution. In addition, some of these choices can be
extended down to the user level – allowing individual data sets to be applied to different data.
Furthermore, Truth Technologies consultants can assist clients in understanding how the matching
engine works and assist clients in understanding the implications when changing these settings.

Truth Technologies Sentinel Architecture

So where do we go from here? Truth Technologies has a number of major product enhancements in our
product roadmap, of which we are able to share (in the abstract) a couple of these here including:

Cross Language With this enhancement a name in two-bit character set, such as Mandarin
Matching Chinese, could be compared against Western watch lists. This can be done by
either turning the Chinese into a Latin character set translation or the
Western watch list into a Chinese characters. As can be imagined, this task is
brings with it a host of issues and complexities.

An Enhanced Risk This white paper has dealt with the likelihood of match. Our algorithm takes
Scoring Algorithm the complete possible universe of matches and eliminates all but those most
likely and returns those with a score. This score represents the confidence
level that our algorithms have determined a client record matches one or
more watch list records. To further refine matches, the algorithm adjusts to
specific risk tolerance levels and users are able to apply sensitivity to these

Truth Technologies White Paper Series Page 16


likelihood scores.
In addition to a score that indicates a level of confidence (likelihood) of a
match, another score could be added to describe the importance of a match.
Most of our customers use Sentinel in conjunction with the World-Check
database, the leading provider of watch list data. The World-Check database
can be divided into three categories of data: watch list data (these are
people, countries, and entities that companies are prohibited from doing
business with); politically exposed persons (PEPs; people that companies can
do business with, but must apply enhanced due diligence); and high risk
individuals (those that have prior convictions of relevant crimes such as
money laundering). Consider the instance where we have two different
name and address matches, each with the same likelihood of a match –
however, one match is against multiple watch lists, while the other match is
against a high risk individual. Sentinel would return the same score on each
match for the likelihood of a match, however, it would assign a higher score
for the importance of the match to the former (the watch lists hits) over the
second match (a high risk individual list hit). This new score that measures in
the importance of a match combined with the score that describes the
likelihood of a match in what we call The Enhanced Risk Scoring Algorithm.
The Enhanced Risk Scoring Algorithm provides the benefit of allowing
compliance officers to deal with the what could be considered more
important issues with the highest confidence levels first (failure to deal with a
watch list can result in civil and criminal penalties in addition to financial and
reputational risk, whereas a high risk individual leaves open only financial and
reputational risk). Since importance is subjective, this new feature will
incorporate a set of base rules that can be tailored via weights to be
customized by compliance officer and specific case.

Enhanced Name and Truth Technologies is constantly examining the latest name and address
Address matching matching technologies, strategies, and algorithms and is always striving to
rules, algorithms, and reduce false positives while maintaining a level of integrity that will prevent
strategies missing any false negatives.

Each of these enhancements represents a major step forward for Truth Technologies’ Sentinel product,
both in terms of capability and in the scope and size of the effort. Translation engines are plentiful,
however, generic text translators are much different than those specific to names. The creation of a
translation engine will require both the translation skills and arriving at an entirely new strategy

Truth Technologies White Paper Series Page 17


including new rules for pre-processing and post processing and matching algorithms for each language
implemented. The concept of Risk is unique to each client due to their specific tolerance and situation.
Compliance officers in a place like Switzerland, which are known as a safe depository for assets from
those living in more volatile countries, may be more interested in PEPs than a financial services firm in a
country like Columbia, which might be more concerned about narco-terrorism. An entirely new
configurable risk engine with appropriate weighting will have to be developed. Finally, the on-going
tailoring of our existing core technology, the address matching rules, algorithms, and strategies, is a non-
trivial task as we are constantly evaluating new advances in the science, getting feedback from field on
our existing engine, and performing tuning to our name and address matching strategies.

Truth Technologies White Paper Series Page 18


Summary
Name and address matching may seem to be a straightforward and rudimentary problem at first
examination; however, the immense complexities quickly emerge. Many of Truth Technologies’
competitors have created filter solutions and grossly underestimate these complexities. These resulting
“bolt-on” modules are poor substitution for a product built by a company with a singular focus and
extensive expertise on the KYC problem.

Due to the size and nature of fines being dispensed by regulators, many industry insiders are recognizing
the shift from compliance being a “checklist” issue to a “results” issue. Recent fines include Wachovia,
American Express, Lloyds of London, and RBS. These fines are now reaching the nine digit range and are
being applied not just for specific instances of money laundering, but for systemic failures in AML
compliance. No longer can financial intuitions use low-end exact name matching, or even built in
Soundex features on existing database applications. Increasingly companies are turning to high-end well
purposed and highly tuned software to minimize false positives and reduce or eliminate false negatives.

When looking at vendors that provide filtering software, we recommend that you ask the following
questions:

What matching algorithms are being used? Are they even bothering to applying a name
matching algorithm, or just doing exact matching? Is the vendor using an off-the-shelf solution
or built-in Soundex algorithm, or have they done the research and invested to develop a tailored
solution? Even if they are applying one or more of the strategies here, many of the algorithms
described aren’t specific to the task at hand and many are used to match any type of text, not
specifically people, place, and entity names. Understand what you are buying.

Does the vendor have an algorithm or a strategy? The heart of filtering software is the name
and address matching algorithm, but is this being employed as part of an overall strategy? For
example, does the vendor utilize nickname substitution? How do they deal with honorifics or
cultural variations? How about noise words such as “Incorporated”, “Company”, and “Limited”
when dealing with entity names? What pre and post name and address processing is occurring
(such as address standardization)?

Is there a possibility of false negatives? Worse than false positives (which create additional
work for a compliance officer to eliminate), a false negative may mean your institution is doing
business with someone on a watch list; which is a time bomb waiting to explode. Revelation
that your institution is doing business with someone on a watch list can result in hundreds of
millions of dollars in penalties as well as civil and criminal charges. Be wary of any company that
claims extremely low false positive rates – they may be doing an exact match between names,
or some other very rudimentary algorithm that results in very few matches, and likely results in
false negatives.

Truth Technologies White Paper Series Page 19


What watch lists are being used? Is the vendor simply comparing against the OFAC list or a
subset of watch lists? Fewer watch lists mean fewer potential matches, but also fewer actual
matches. A compliance officer needs to seriously weigh the risks. A local bank may claim it
doesn’t need to examine foreign watch lists because it only does local business, but can it really
afford to miss a PEP or be sanctioned for a foreign wire transfer?

Is it a complete solution? Does the vendor’s package allow segmentation of data and workflow
management? Regulators are looking for risk-based approaches to KYC. By segmenting
customer data by product line (retail customers vs. brokerage customer, for example), by
geography, by industry, or by other risk criteria and applying different frequencies and tolerance
levels for screening, compliance officers can reduce risk and increase compliance. What is your
strategy and is your vendor flexible to adapt to the way you do business? Or are you expected
to adapt to the way the vendor software does business?

Can your vendor rescreen data? Increasingly, regulators are insisting on the periodic
rescreening of all client data. Just because a client may not have been on a watch list or been a
politically exposed person at the time they were on-boarded, doesn’t mean they haven’t
subsequently been added to a list. The key to any rescreening process is to only flag matches
when the underlying data has changed or it is a new match. For example, if a match occurs but
is marked as being a false positive, subsequent rescreenings should understand this and deal
with the match differently than a match that occurs with a never before matched newly added
record to a watch list. An alternative to rescreening all client data periodically, is a method
called Continuous Customer Monitoring. In CCM, the client data is stored against updates to the
watch list data as they are loaded. Any new matches between the update and the client data
results in notification sent to the compliance officer. The downside of CCM is that the client
data is also changing, resulting in the need for a second new client screening process.

As can be seen in this white paper, Truth Technologies’ singular focus on KYC technologies, strategies,
and algorithms means we have invested the time and effort to develop, and continue to enhance, the
most advanced and effective filter solution on the market. Truth Technologies’ flagship product, Truth
Technologies’ Sentinel, is a foremost leader in filtering strategies – using a proprietary combination of
name and address matching algorithms that builds upon the techniques described in this white paper,
but also greatly expands on them to account for the various complexities and potential problems
specifically associated with client and watch list data.

For more information contact your local Truth Technologies’ Sentinel sales representative or email us at
sales@truthtechnologies.com.

Truth Technologies White Paper Series Page 20


References and Additional Reading
The following publications were used as input in the creating of this white paper, and can used by
readers wishing to better understand the complexities of the name and address matching problem.

Christen, Peter. A Comparison of Personal Name Matching: Techniques and Practical Issues, TR-CS-06-
02, Joint Computer Science Technical Report Series, Department of Computer Science, The Australian
National University, September, 2006.

Lait, AJ and Randell, B., An Assessment of Name Matching Algorithms, Department of Computer Science,
University of Newcastle upon Tyne.

Snae, Chakkrit, A Comparison and Analysis of Name Matching Algorithms, World Academy of Science,
Engineering and Technology, November 25, 2007.

Truth Technologies Incorporated, Sentinel Product Brochure, www.TruthTechnologies.com, 2010.

Truth Technologies Incorporated, Sentinel Product Roadmap, www.TruthTechnologies.com, 2010.

Wang, Wen, Hi, Heng, and Grishman, Ralph, Phonetic Name Matching for Cross-Lingual Spoken Sentence
Retrieval, New York University.

Truth Technologies White Paper Series Page 21

You might also like