Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 7

Identification of Confusable Drug Names:

A New Approach and Evaluation Methodology


Grzegorz Kondrak Bonnie Dorr
Department of Computing Science Institute for Advanced Computer Studies &
University of Alberta Department of Computer Science
Edmonton, Alberta, Canada, T6G 2E8 University of Maryland
kondrak@cs.ualberta.ca College Park, 20742, USA
bonnie@umiacs.umd.edu

Abstract subsequence, several variations of measures based


This paper addresses the mitigation of medi- on counting common letter -grams, and measures
cal errors due to the confusion of sound-alike designed specifically for associating phonetically
and look-alike drug names. Our approach similar names, such as Soundex and Editex. They
involves application of two new methods— identified the normalized edit distance, Editex, and
one based on orthographic similarity (“look- a trigram-based measure as the most accurate.
alike”) and the other based on phonetic sim- The evaluation methodology of Lambert et al.
ilarity (“sound-alike”). We present a new
(1999) involves repeated selection of cut-off thresh-
recall-based evaluation methodology for deter-
mining the effectiveness of different similar- olds in order to compute precision and recall on a
ity measures on drug names. We show that test set that contains equal number of positive and
the new orthographic measure (BI-SIM) outper- negative examples of confusable drug name pairs.
forms other commonly used measures of sim- However, our own experience with systems for au-
ilarity on a set containing both look-alike and tomatic detection of potential drug-name confusions
sound-alike pairs, and that the feature-based suggests that the usual approach is to examine a
phonetic approach (ALINE) outperforms or- fixed number of most similar candidates rather than
thographic approaches on a test set contain- all candidates with similarity above certain thresh-
ing solely sound-alike confusion pairs. How-
old. Moreover, the number of non-confusable pairs
ever, an approach that combines several differ-
ent measures achieves the best results on both can be expected to greatly exceed the number of
test sets. confusable pairs.
We present a different method of evaluating the
1 Introduction accuracy of a measure. Starting from a set of con-
fusable drug name pairs, we combinatorially induce
Many hundreds of drugs have names that either
a much larger set of negative examples. The re-
look or sound so much alike that doctors, nurses
call is calculated against an on-line gold standard
and pharmacists can get them confused, dispens-
for each potentially confusable drug name consid-
ing the wrong one in errors that can injure or even 

ering only the top candidate names returned by


kill patients. In the United States alone, an esti-
a similarity measure. The recall values are then
mated 1.3 million people are injured each year from
aggregated using the technique of macro-averaging
medication errors, such as administering the wrong
(Salton, 1971).
dose or the wrong drug (Lazarou et al., 1998). 1
We formulate a general framework for represent-
The U.S. Food and Drug Administration has sought
ing word similarity measures based on -grams, and
to mitigate this threat by ensuring that proposed
propose a new measure of orthographic similarity
drug names that are too similar to pre-existing drug
called BI-SIM that combines the advantages of sev-
names are not approved (Meadows, 2003).
eral known measures. Using our recall-based evalu-
A number of different lexical similarity measures
ation methodology, we show that this new measure
have been applied to the problem of identifying con-
performs better on a U.S. pharmacopeial gold stan-
fusable drug names. Lambert et al. (1999) tested
dard than the measures identified as the most accu-
twenty two distinct methods on a set of drug names
rate by Lambert et al. (1999).
extracted from published reports of medication er-
Some potential drug-name confusions can be at-
rors. The methods included well-known universal
tributed solely to high phonetic similarity. Con-
measures, such as edit distance and longest common
sider the example of Xanax vs. Zantac—two brand
1
For example, a patient needed an injection of Narcan but names that the Physicians’ Desk Reference (PDR)
instead got the drug Norcuron and went into cardiac arrest. warns may be “mistaken for each other ... lead[ing]
Distance Similarity Measure Zantac/ Zantac/ Xanax/
Orthographic EDIT -GRAM Xanax Contac Contac
NED LCSR EDIT 3 2 4
Phonetic SOUNDEX ALINE NED 0.500 0.333 0.667
EDITEX LCSR 0.500 0.667 0.333
BIGRAM 0.222 0.600 0.000
Table 1: Classification of word distance and simi- TRIGRAM-2B 0.000 0.333 0.000
larity measures. SOUNDEX 3 1 3
EDITEX 5 2 7
ALINE 9.542 9.333 8.958
to serious medication errors” (24th Ed., 2003). The BI-SIM 0.417 0.583 0.250
phonetic transcription of the two names, [zænæks] TRI-SIM 0.333 0.500 0.167
PREFIX 0.000 0.000 0.000
and [zæntæk], reveals their sound-alike similarity
that is not apparent in their orthographic form. Table 2: Examples of values returned by various
For the detection of sound-alike confusion pairs, measures.
we apply the ALINE phonetic aligner (Kondrak,
2000), which estimates the similarity between two the length of the longest common subsequence by
phonetically-transcribed words. We demonstrate the length of the longer string. LCSR is closely
that ALINE outperforms orthographic approaches related to normalized edit distance. If the cost
on a test set containing sound-alike confusion pairs. of substitution is at least twice the cost of inser-
The next section describes several commonly- tion/deletion and the strings are of equal length,
used measures of word similarity. After this, we LCSR is equivalent to the normalized edit distance.
present two new methods for identifying look-alike In -gram measures, the number of -grams that
and sound-alike drug names. We then compare the are shared by two strings is doubled and then di-
effectiveness of various measures using our recall- vided by the total number of -grams in each string:
based evaluation methodology on a U.S. pharma-
 -grams 
  
 -grams  
copeial gold standard and on another test set con- -grams
taining sound-alike confusion pairs. We conclude
with a discussion of our experimental results.
-grams  
where -grams(x) is a multi-set of letter -grams
2 Background 
in . This formula is often referred to as the Dice
Drug-name matching refers to the process of string coefficient. A slight variation of this measure is ob-
matching to rank similarity between drug names. tained by adding extra symbols, such as spaces, be-
There are two classes of string matching: ortho- fore and/or after each string (Lambert et al., 1999).
graphic and phonetic. For each of these, there are The modification is designed to increase sensitivity
two methods of matching: distance and similarity. to the beginnings and endings of words. For ex-
If two drug names are confusable, their distance
should be small and their similarity should be large. Dice formula with 
ample, TRIGRAM-2B is calculated by applying the
after adding two spaces
Some examples of orthographic and phonetic algo- before each string. In this paper, we consider two
rithms for both distance- and similarity-based ap- specific variants: BIGRAM, which is the most ba-
proaches are shown in Table 1. sic formulation, and TRIGRAM-2B. 2
In the remainder of this section, we describe a SOUNDEX (Hall and Dowling, 1980) is
number of measures that have been applied to the an approximation to phonetic name matching.
problem of identifying confusable drug names. Spe- SOUNDEX transforms all but the first letter to nu-
cific examples of values obtained by the measures meric codes (see Table 3) and after removing ze-
are provided in Table 2. roes truncates the resulting string to 4 characters.
String-edit distance (Wagner and Fischer, 1974) For the purposes of comparison, we implemented
(EDIT) (also known as Levenshtein distance) a SOUNDEX-based similarity measure that returns
counts up the number of steps it takes to transform the edit distance between the corresponding codes.
one string into another, where the cost of substi- EDITEX (Zobel and Dart, 1996) is another quasi-
tution is the same as the cost of insertion or dele- phonetic measure that combines edit distance with a
tion. A normalized edit distance (NED) is calcu- letter-grouping scheme similar to SOUNDEX (Ta-
lated by dividing the total edit cost by the length of ble 3). As in SOUNDEX, the codes are designed
the longer string. 2
TRIGRAM-2B was identified by Lambert et al. (1999)
The longest common subsequence ratio as particularly effective for identifying confusable drug name
(Melamed, 1999) (LCSR) is computed by dividing pairs.
Code SOUNDEX EDITEX phoneme matches. The similarity value is normal-
0 aehiouwy aeiouy
1 bfpv bp
ized by the length of the longer word.
2 cgjkqsxz ckq ALINE’s behavior is controlled by a number of
3 dt dt parameters: the maximum phonemic score, the in-
4 l lr sertion/deletion penalty, the vowel penalty, and the
5 mn mn
feature salience weights. The parameters have de-
6 r gj
7 fpv fault settings for the cognate matching task, but
8 sxz these settings may not be appropriate for drug-name
9 csz matching. The settings can be optimized (tuned) on
a training set that includes positive and negative ex-
Table 3: Character conversion codes in SOUNDEX
amples of confusable name pairs.
and EDITEX.
4 Orthographic Similarity: BI-SIM
to identify letters that have similar pronunciations,
but the corresponding sets of letters are not disjoint. An analysis of the reasons behind the unsatisfactory
The edit distance between letters that belong to the performance of commonly used measures led us to
same group is smaller than the edit distance between propose a new measure of orthographic similarity:
other letters. Additional rules are aimed at eliminat- BI-SIM.3 Below, we describe the inherent strengths
ing silent and reduplicated letters. and weaknesses of -gram and subsequence-based
approaches. Next, we present a new, generalized
3 Phonetic Similarity: ALINE framework that characterizes a number of com-
The ALINE cognate matching algorithm (Kon- monly used similarity measures. Following this, we
drak, 2000) assigns a similarity score to pairs of describe the parametric settings for BI-SIM—a spe-
phonetically-transcribed words on the basis of the cific instantiation of this generalized framework.
decomposition of phonemes into elementary pho-
4.1 Problems with Commonly Used Measures
netic features. The algorithm was initially designed
to identify and align cognates in vocabularies of re- The Dice coefficient computed for bigrams (BI-
lated languages (e.g. colour and couleur). Never- GRAM) is an example of a measure that is demon-
theless, thanks to its grounding in universal phonetic strably inappropriate for estimating word similar-
principles, the algorithm can be used for estimating ity. Because it is based exclusively on com-
the similarity of any pair of words, including drug plete bigrams, it often fails to discover any sim-
names. Furthermore, unlike SOUNDEX and EDI- ilarity between words that look very much alike.
TEX, ALINE is completely language-independent. For example, it returns zero on the pair Vere-
The principal component of ALINE is a func- lan/Virilon. In addition, it violates a desirable re-
tion that calculates the similarity of two phonemes quirement of any similarity measure that the maxi-
that are expressed in terms of about a dozen binary mum similarity of 1 should only result when com-
or multi-valued phonetic features (Place, Manner, paring identical words. In particular, non-identical
Voice, etc.). Feature values are encoded as floating- pairs4 like Xanex/Nexan—where all bigrams are
point numbers in the range  . For example, the shared—are assigned a similarity value of 1. More-
feature Manner can take any of the following seven over, it sometimes associates bigrams that occur
values: stop = 1.0, affricate = 0.9, fricative = 0.8, in radically different word positions, as in the pair
approximant = 0.6, high vowel = 0.4, mid vowel Voltaren/Tramadol. Finally, the initial segment,
= 0.2, and low vowel = 0.0. The numerical values which is arguably the most important in determining
reflect the distances between vocal organs during drug-name confusability,5 is actually given a lower
speech production. The phonetic features are as- weight than other segments because it participates
signed salience weights that express their relative in only one bigram. It is therefore surprising that
importance. BIGRAM has been such a popular choice of mea-
The overall similarity score and optimal align- sure for computing word similarity.
ment of two words—computed by a dynamic pro- LCSR is more appropriate for identifying poten-
gramming algorithm (Wagner and Fischer, 1974)— tial drug-name confusability because it does not rely
is the sum of individual similarity scores between 3
BI-SIM was developed before we conducted the experi-
pairs of phonemes. A constant insertion/deletion
ments described in Section 6.
penalty is applied for each unaligned phoneme. An- 4
This observation is due to Ukkonen (1992).
other constant penalty is set to reduce relative im- 5
74.2% of the confusable pairs in the pharmacopeial gold
portance of the vowel—as opposed to consonant— standard (Section 6) have identical initial segments.
on (frequently imprecise) bigram matching. How- corresponding positions divided by :
ever, LCSR is weak in its tendency to posit non-
intuitive links, such as the ones between segments in        

  

Benadryl/Cardura. The fact that it returns the same 
  

value for both Amaryl/Amikin and Amaryl/Altoce



can be attributed to lack of context sensitivity. where   returns 1 if  and  are identical, and


4.2 A Generalized -gram Measure 0 otherwise. The scale distinguishes levels of sim-
ilarity, including 1 for identical bigrams, and 0 for
Although it may not be immediately apparent, completely distinct bigrams.6
LCSR can be viewed as a variant of the -gram ap- The notion of similarity scale between -grams
proach. If is set to 1, the Dice coefficient for- requires clarification in the case of -grams partially
mula returns the number of shared letters divided composed of segments appended to the beginning or
by the average length of two strings. Let us call this end of strings. Normally, extra affixes are composed
measure UNIGRAM. The main difference between of one or more copies of a unique special symbol,
LCSR and UNIGRAM is that the former obeys the such as space, that does not belong to the string al-
no-crossing-links constraint, which stipulates that phabet. We define an alphabet of special symbols
the matched unigrams must form a subsequence of that contains a unique symbol for each of the sym-
both of the compared strings, whereas the latter dis- bols in the original string alphabet. The extra affixes
regards the order of unigrams. E.g., for pat/tap, are assumed to contain copies of special symbols
LCSR returns 0.33 because the length of the longest that correspond to the initial symbol of the string.
common subsequence is 1, while UNIGRAM re- In this way, the similarity between pairs of -grams
turns 1.0 because all letters are shared. The other, in which one or both of the -grams overlap with an
minor difference is that the denominator of LCSR is extra affix is guaranteed to be either 0 or 1.
the length of the longer string, as opposed to the av-
erage length of two strings in UNIGRAM. (In fact, 4.3 BI-SIM
LCSR is sometimes defined with the average length We propose a new measure of orthographic simi-
in the denominator.) larity, called BI-SIM, that aims at combining the
We define a generalized measure based on - advantages of the context inherent in bigrams, the
grams with the following parameters: precision of unigrams, and the strength of the no-
crossing-links constraint. BI-SIM belongs to the
1. The value of . class of -gram measures defined above. Its param-
 
2. The presence or absence of the no-crossing- eters are: , the no-crossing-links constraint
links constraint. enforced, a single segment appended to the begin-
ning of the string, normalization by the length of the
3. The number of segments appended to the be- longer string, and multi-valued -gram similarity.
ginning and the end of the strings. The rationale behind the specific settings is as fol-

lows.  is a minimum value that provides con-
4. The length normalization factor: either the
text for matching segments within a string. The no-
maximum or the average length of the strings.
crossing-links constraint guarantees the sequential-
A number of commonly used similarity measures ity of segment matches. The segment added to the
can be expressed in the above framework. The com- beginning increases the importance of the match of
bination of   with the no-crossing-links con- initial segment. The normalization method favors
 associations between words of similar length. Fi-
straint produces LCSR. By selecting and
the average normalization factor, we obtain the BI- nally, the refined -gram similarity scale increases
GRAM measure. Thirteen out of twenty two mea- the resolution of the measure.
sures tested by Lambert et al. (1999) are variants BI-SIM is defined by the following recurrence:

that combine either  or   with various     
 

max     
lengths of appended segments.             
      


So far, we have assumed that there are only two   
   

possible values of -gram similarity: identical or 6


non-identical. This need not be the case. Obviously, The scale could be further refined to include more levels of
similarity. For example, bigrams that are frequently confused
some non-identical -grams are more similar than because of their typographic or cursive shape, such as en/im,
others. We can define a similarity scale for two - could be assigned a similarity value that corresponds to the fre-
grams as the number of identical segments in the quency of their confusions.
where  refers to the -gram similarity scale defined Name Score +/– Recall
in Section 4.2, and  and   are the appended seg-

1.
2.
Tramadol
Tobradex
0.6875
0.6250
+

0.25
0.25
ments.
Furthermore,  is defined to be  if


3. Torecan 0.5714 + 0.50




 or  

 . The recurrence relation exhibits 4. Stadol 0.5714 – 0.50


strong similarity to the relation for computing the 5. Torsemide 0.5000 – 0.50
longest common subsequence except that the sub- 6. Theraflu 0.5000 – 0.50
7. Tegretol 0.5000 + 0.75
sequence is composed of bigrams rather than uni- 8. Taxol 0.5000 – 0.75
grams, and the bigrams are weighted according to
their similarity. Assuming that the segments ap- Table 4: Top 8 names that are most similar to
pended to the beginning of each string are chosen Toradol according to the BI-SIM similarity mea-
according to the rule specified in Section 4.2, the sure, and the corresponding recall values.
returned value of BI-SIM always falls in the inter-
val  . In particular, it returns 1 if and only if the 6 Experiments and Results
strings are identical, and 0 if and only if the strings
have no segments in common. We conducted two experiments with the goal of
BI-SIM can be seen as a generalization of LCSR: evaluating the relative accuracy of several mea-
the setting of 

 reduces BI-SIM to LCSR sures of similarity in identifying confusable drug


names. The first experiment was performed against
(which could also be called UNI-SIM). On the other
hand, the setting of yields TRI-SIM. TRI-
 
an online gold standard: the United States Pharma-
SIM requires two extra symbols at the beginning of copeial Convention Quality Review, 2001 (hence-
the string. forth the USP set). The USP set contains both look-
alike and sound-alike confusion pairs. We used 582
5 Evaluation Methodology unique drug names from this source to combinator-
ically induce 169,071 possible pairs. Out of these,
We designed a new method for evaluating the ac-
399 were true confusion pairs in the gold standard.
curacy of a measure. For each drug name, we sort
The maximum number of true positives was 6, but
all the other drug names in the test set in order of
for the majority of names (436 out of 582), only one
decreasing value of similarity. We calculate the re-
confusable name is identified in the gold standard.
call by dividing the number of true positives among


On average, the task was to identify 1.37 true posi-


the top names by the total number of true pos-
tives among 581 candidate names.
itives for this particular drug name, i.e., the frac-
We computed the similarity of each name pair us-
tion of the confusable names that are discovered


ing the following similarity measures: BIGRAM,


by taking the top similar names. At the end
TRIGRAM-2B, LCSR, EDIT, NED, SOUNDEX,
we apply an information-retrieval technique called
EDITEX, BI-SIM, TRI-SIM, ALINE and PREFIX.
macro-averaging (Salton, 1971) which averages the
PREFIX is a baseline-type similarity measure that
recall values across all drug names in the test set. 7
returns the length of the common prefix divided by
Because there is a trade-off between recall and


the length of the longer string. In addition, we


the threshold, it is important to measure the re-


calculated the COMBINED measure by taking the


call at different values of . Table 4 shows the top 8
simple average of the values returned by PREFIX,
names that are most similar to Toradol according to
EDIT, BI-SIM, and ALINE.
the BI-SIM similarity measure. A ‘+’/‘–’ mark indi-
In order to apply ALINE to the USP set, all
cates whether the pair is a true confusion pair. The
drug names were transcribed into phonetic symbols.
pairs are listed in rank order, according to the score
This transcription was approximated by applying a
assigned by the indicated algorithm. Names that re-
simple set of about thirty regular expression rules.
turn the same similarity value are listed in the re-
(It is likely that a more sophisticated transcription
verse lexicographic order. Since the test set contains
method would result in improvement of ALINE’s
four drug names that have been identified as confus-
performance.) In the first experiment, the parame-
able with Toradol (Tramadol, Torecan, Tegretol,
 and ters of ALINE were not optimized; rather, they were

  


Inderal), the recall values are for , and 

for  for  .





set according to the values used for a distinct task of
cross-language cognate identification.
7
We could have also chosen to micro-average the recall In Figure 1, the macro-averaged recall values
values by dividing the total number of true positives discov- achieved by several measures on the USP set are
ered among the top  candidates by the total number of true 

positives in the test set. The choice of macro-averaging over


plotted against the cut-off . Some measures have
micro-averaging does not affect the relative ordering of simi- been left out in order to preserve the clarity of the
 


larity measures implied by our results. plot. Table 5 contains detailed results for 
1 1
"COMBINED"
"ALINE"
0.9 0.9 "EDITEX"
"NED"
"TRIGRAM-2B"
"BIGRAM"
0.8 0.8 "PREFIX"
"SOUNDEX"
0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3
"COMBINED"
0.2 "BI-SIM" 0.2
"EDITEX"
"NED"
"TRIGRAM-2B"
0.1 "BIGRAM" 0.1
"PREFIX"
"SOUNDEX"
0 0
0 5 10 15 20 25 0 5 10 15 20 25

Figure 1: Recall at various thresholds for the USP Figure 2: Recall at various thresholds for the sound-
test set. alike test set.
USP Set Phono Set
top 10 top 20 top 10 top 20 of regular expression rules. (We found that phonetic
PREFIX 0.5651 0.6658 0.2981 0.3478 transcription led to a slight improvement in the re-
EDIT 0.7506 0.8130 0.5139 0.6410 call values achieved by the orthographic measures.)
NED 0.7846 0.8489 0.5590 0.6639
LCSR 0.7375 0.8333 0.4663 0.5769 The parameters of ALINE used in this experiment
BIGRAM 0.6362 0.7148 0.3560 0.4400 were optimized beforehand on the USP set.
TRIGRAM-2B 0.7335 0.8251 0.4674 0.5355
SOUNDEX 0.3965 0.4898 0.2331 0.3326 7 Discussion
EDITEX 0.7558 0.8155 0.5864 0.6911
ALINE 0.7503 0.8303 0.5825 0.6873 The results described in Section 6 clearly indi-
BI-SIM 0.8220 0.8927 0.4838 0.6590 cate that BI-SIM and TRI-SIM, the newly proposed
TRI-SIM 0.8324 0.8946 0.4782 0.6245
COMBINED 0.8560 0.9137 0.6462 0.7737
measures of similarity, outperform several currently
 
used measures on the USP test set regardless of the
Table 5: Recall at and for both the


choice of the cutoff parameter . However, a sim-


 

  

USP and the sound-alike test sets. ple combination of several measures achieves even
higher accuracy. On the sound-alike confusion set,
EDITEX and ALINE are the most effective. The


and for all measures.




Since the USP set contains both look-alike and accuracy achieved by the best measures is impres-
sound-alike name pairs, we conducted a second ex- sive. For the combined measure, the average recall
periment to compare the performance of various on the USP set exceeds 90% with only the 15 top
measures on sound-alike pairs only. We used a pro- candidates considered.
prietary list of 276 drug names identified by ex- The USP test set has its limitations. The set in-
perts as “names of concern” for 83 “consult” names. cludes pairs that are considered confusable for other
None of the “consult” names and only about 25% reasons than just phonetic or orthographic simi-
of the “names of concern” are in the USP set, i.e., larity, including illegible handwriting, incomplete
there are no true positive pairs shared between the knowledge of drug names, newly available prod-
two sets. The maximum number of true positives ucts, similar packaging or labeling, and incorrect
was 11, while the average for all names was 3.33. selection of a similar name from a computerized
The measures were applied to calculate the sim- product list. In many cases, the names do not sound
ilarity between each of the 83 “consult” names and or look alike, but when handwritten or communi-
a list of 2596 drug names. The results are shown cated verbally, these names have caused or could
in Figure 2. Since the task, which involved identi- cause a mix-up. On the other hand, many clearly
fying, on average, 3.33 true positives among 2596 confusable name pairs are not identified as such
candidates, was more challenging, the recall values (e.g. Erythromycin/Erythrocin, Neosar/Neoral, Lo-
are lower than in Figure 1. All drug names were first razepam/Flurazepam, Erex/Eurax/Urex, etc.).
converted into a phonetic notation by means of a set All similarity measures have their own
strengths and weaknesses. -GRAM is ef- imization of medication errors.
fective at recognizing pairs such as Chlorpro- The task of computing similarity between words
mazine/Prochlorperazine, where a shorter name is also important in other contexts. When an entered
closely matches parts of the longer name. However, name does not exist in a bibliographic database, it is
this advantage is offset by its poor performance on desirable to retrieve names that sound similar. In-
similar-sounding names with few shared bigrams formation retrieval systems may need to expand the
(Nasarel/Nizoral). LCSR is able to identify pairs search in cases where a typed query contains errors
where common subsequences are interleaved or variations in spelling. A related task of the iden-
with dissimilar segments, such as Asparagi- tification of cognates arises in statistical machine
nase/Pegaspargase, but fails on similar sounding translation. The techniques discussed in this paper
names where the overlap of identical segments is may also be applicable in those areas.
minimal (Luride/Lortab). ALINE detects phonetic
similarity even when it is obscured by the orthogra- References
phy (eg. Xanax/Zantac), but phonetic transcription PDR 24th Ed. 2003. Physicians’ Desk Reference
is required beforehand. for Nonprescription Drugs and Dietary Supplements.
The idiosyncrasies of individual measures are at- Thomson PDR, New York, NY.
Patrick A. V. Hall and Geoff R. Dowling. 1980.
tenuated when they are combined together, which
Approximate string matching. Computing Surveys,
may explain the excellent performance of the com- 12(4):381–402.
bined measure. Each measure is focused on a par- Grzegorz Kondrak. 2000. A New Algorithm for the
ticular facet of string similarity: initial segments in Alignment of Phonetic Sequences. In Proceedings of
PREFIX, phonetic sound-alike quality in ALINE, the First Meeting of the North American Chapter of
common clusters in bigram-based measures, overall the Association for Computational Linguistics, pages
transformability in EDIT, etc. For this reason, a syn- 288–295, Seattle, WA.
ergistic blend of several measures achieves higher Bruce L. Lambert, Swu-Jane Lin, Ken-Yu Chang, and
accuracy than any of its components. Sanjay K. Gandhi. 1999. Similarity As a Risk Fac-
tor in Drug-Name Confusion Errors: The Look-Alike
Our experiments confirm that orthographic ap-
(Orthographic) and Sound-Alike (Phonetic) Model.
proaches are superior to their phonetic counterparts Medical Care, 37(12):1214–1225.
in tasks involving string matching (Zobel and Dart, J. Lazarou, B.H. Pomeranz, and P.N. Corey. 1998. Inci-
1995). Nevertheless, phonetic approaches identify dence of Adverse Drug Reactions in Hospitalized Pa-
many sound-alike names that are beyond the reach tients. Journal of the American Medical Association,
of orthographic approaches. In applications where 279:1200–1205.
the gap between spelling and pronunciation plays Michelle Meadows. 2003. Strategies to Reduce Medica-
an important role, it is advisable to employ pho- tion Errors. U.S. Food and Drug Administration Con-
netic approaches as well. The two most effec- sumer Magazine, May-June.
tive ones are EDITEX and ALINE, but whereas I. Dan Melamed. 1999. Bitext maps and alignment
via pattern recognition. Computational Linguistics,
ALINE is language-independent, EDITEX incorpo-
25(1):107–130.
rates English-specific letter groups and rules. Gerard Salton. 1971. The Smart System: Experiments in
Automatic Document Processing. Englewood Cliffs:
8 Conclusion Prentice Hall, NJ.
We have investigated the problem of identifying Esko Ukkonen. 1992. Approximate string-matching
with -grams and maximal matches. Theoretical
confusable drug name pairs. The effectiveness of Computer Science, 92:191–211.
several word similarity measures was evaluated us- Robert A. Wagner and Michael J. Fischer. 1974. The
ing a new recall-based evaluation methodology. We string-to-string correction problem. Journal of the
have proposed a new measure of orthographic simi- ACM, 21(1):168–173.
larity that outperforms several commonly used sim- Justin Zobel and Philip W. Dart. 1995. Finding approx-
ilarity measures when tested on a publicly available imate matches in large lexicons. Software — Practice
list of confusable drug names. On a test set con- and Experience, 25(3):331–345.
taining solely sound-alike confusion pairs phonetic Justin Zobel and Philip Dart. 1996. Phonetic string
approaches, ALINE and EDITEX achieve the best matching: Lessons from information retrieval. In
results. Our results suggest that a linear combina- Proceedings of the 19th International Conference on
Research and Development in Information Retrieval,
tion of several measures benefits from the strengths pages 166–172.
of its components, and is likely to outperform any
individual measure. Such a combined approach has
the potential to provide the basis for automatic min-

You might also like