Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Enterprise Master Patient Index

Healthcare data are captured in many different settings such as hospitals, clinics, labs, and
physician offices. According to a report by the CDC, patients in the United States made an
estimated 1.1 billion visits to physician offices, hospital outpatient departments and hospital
emergency departments just in 2006. This corresponds to a rate of about four visits per patient. In
addition to the large volume of visit data generated on an annual basis, the data is also distributed
across different healthcare settings as patients visit hospitals, primary and specialty physicians
and move across the country. In order to put together the longitudinal health record of a patient,
all their data needs to be integrated accurately and efficiently despite the fact that the data is
captured using disparate and heterogeneous systems.
The heterogeneity of the systems used to capture patient data at the different healthcare
systems cause patient records to have multiple unrelated patient identifiers assigned to them with
the possibility of multiple identifiers assigned to a given patient record within a single institution.
The lack of precise standards on the format of patient identifying and patient demographic data
results in incomplete data sharing among healthcare professionals, patients, and data repositories.
In addition to the syntactic heterogeneity of the data, the data capture process is often not
carefully controlled for quality nor necessarily defined in a common way across different data
sources, resulting in unavoidable and all too common data entry errors which further exacerbate
the inconsistency of the data. Common data management design issues include lack of
normalization or de-normalization and missing integrity constraints whereas improper data
handling results in wrong or missing data or uncontrolled data duplication.
In order to integrate all this healthcare data, the various patient identifiers assigned to a given
patient either at different institutions or erroneously by a single institution, must be linked
together despite the presence of syntactic and semantic differences in the associated demographic
data captured for the patient. This problem has been known for more than five decades as the
record linkage or the record matching problem. The goal in the record matching problem is to
identify records that refer to the same real world entity, even if the records do not match
completely. If each record carried a unique, universal, and error-free identification code, the only
problem would be to find an optimal search sequence that would minimize the total number of
record comparisons. The syntactic and semantic differences between the data sources as well as
the data entry errors introduced during capture coupled with pure quality control measures result
in the need to use identification codes that are neither unique nor error-free. The following
examples illustrates the situation where even though two records refer to the same patient,
because of semantic and syntactic variations in the two records, the task of matching them cannot
be easily automated. The following example makes it clear that only a sophisticated record
matching algorithm could automatically make the decision on whether to link the identifiers
assigned to the following two patient records.

Name Address Age

Javier Martinez 49 E. Applecross Road 33

Havier Marteenez 49 Applecross Road 36


Sophisticated record matching algorithms approach this complex problem by
decomposing the overall process into three tasks: the data preparation phase, the searching
phase and the matching phase. The data preparation phase is a pre-processing phase which
parses and transforms the data in an effort to remove the syntactic and semantic
heterogeneity of the patient demographic data. In the absence of unique patient identifiers,
the matching algorithms must use patient demographic data as the matching variables so, in
the data preparation phase the individual fields are transformed in order for them to conform
to the data types of their corresponding domains. In the searching phase, the algorithms
search through the data to identify candidates for potential matches. The brute-force
approach of an exhaustive search across the Cartesian product of the data sets is of quadratic
order, with respect to the number of records so it is not feasible for linking large data sets.
For example, attempting to link two data sets with only 10,000 records each using an
exhaustive search would require 100 million comparisons. To reduce the total number of
comparisons, blocking algorithms are used which partition the full Cartesian product of
possible record pairs into smaller subsets. The topic of blocking algorithms is addressed in
more detail in a separate section.
In the matching phase, the record pairs identified during the searching phase are
compared to identify matches. Typically, a subset of the patient’s demographic attributes,
which are referred to as the matching variables, is used to identify matches. The
corresponding matching variables for each pair of records are compared with one another
forming the comparison vector. The matching decision must determine whether a pair of
records should be regarded as linked, not linked or possibly linked, depending upon the
various agreements or disagreements of items of identifying information. The specification of
a record linking procedure requires both a method for measuring closeness of agreement
between records as well as an algorithm that uses this measure for deciding when to classify
records as matches. For example, if we are linking person records, a possible measurement
would be to compare the family names of the two records and assign the value of “1” to those
pairs where there is absolute agreement and “0” to those pairs where there is absolute
disagreement. Values in between 0 and 1 are most frequently used to indicate how close the
values in the two different domains are. Matching algorithms are classified as deterministic
or probabilistic. The section on matching algorithms defines these two classifications in
detail and describes the advantages and disadvantages of each approach.

Blocking Algorithms
In most record linkage cases, the two datasets that need to be integrated do not posses
a unique identifier so a collection of record attributes must be used to match records and
record pairs from the two datasets need to be compared to one another to identify
matching records. For example, a typical scenario is to use the first name, last name,
gender and birth date of each record as the identifying attributes for a distinct patient
record. The next step is to compare record pairs from the two data sources and apply a
matching algorithm to detect whether the two records of a pair match or not. The naïve
approach of comparing every possible record pair from the two datasets is of quadratic
complexity and is thereby computationally infeasible for large sets. As was mentioned
previously, comparing all the record pairs in an effort to match two data sets with 10,000
records each, requires 100 million comparisons, making it prohibitively expensive. In
addition to the computational complexity, comparing these many records in memory
increases the demands on the memory subsystem to the point where the data set cannot
all be kept in memory. The algorithm will need to page records into memory as needed
complicating the complexity of the implementation. To reduce the huge number of
possible record pair comparisons, record linkage systems employ “blocking” algorithms
to reduce the large number of record pairs that need to be compared.
The most commonly used blocking algorithm uses one or more record attributes,
referred to as the blocking key. Examples of blocking keys are using the first four
characters of the last name in one case or the zip code attribute combined with an age
category. The attributes of the blocking key are used to split the datasets into blocks.
Then only records from the block with the same blocking key need to be compared for
possible matches. There is a cost-benefit trade-off to be considered in choosing the
blocking keys. If the number of records in each block is very large, as would be the case
when using the gender as the only blocking attribute, then more record pairs than
necessary will be generated. If the number of records in each block is very small, many
potential record matches will be missed since the matching algorithm will only compare
records that fall in the same block. An example of this situation would be using the SSN
as the only attribute in the blocking key. In this case only records with the same SSN
would end up in the same block, which means that records with incorrect or missing
values in the SSN field would not match.
A conflicting goal for a blocking algorithm is that while attempting to reduce the
number of record pairs that need to be compared, the algorithm must not erroneously
assign potentially matching records to two different blocks because then they will never
match with one another. It is preferable to use the least error-prone attributes in the
blocking key and as a further safeguard, multiple passes of blocking are sometimes used
with different blocking keys in each phase so that potential matches that are missed with
a certain blocking key will be considered for matching in a subsequent phase.
A number of more advanced blocking algorithms have been proposed over the years
that aim to achieve a reduction in the number of record pairs that need to be compared
while reducing the errors in separating potential matches. The sorted neighborhood
blocking algorithm partitions each dataset using the blocking key, sorts the records using
the blocking key values and then moves a sliding window across the records. Records
within the window are then paired with each other and included in the candidate record
pair list. For a window of size w and a total number of records of n, this limits the
number of possible record pair comparisons for each record to 2w-1 and the total number
of generated record pairs to O(wn).
Another blocking algorithm is the Bigram Indexing method. The Bigram Indexing
method converts blocking keys into a list of bigrams and generates one block for each
bigram value in the list. It then assigns records with the same blocking key value to all
the blocks labeled with one of the bigram values of that blocking key.
The last advanced blocking algorithm we describe here is the Canopy Clustering
algorithm which forms clusters by choosing a record at random and putting in its cluster
all the records within a certain distance to it. The record chosen at random and any
records within a certain tight threshold distance of it are then removed from the candidate
set of records. Figure 1 illustrates an example of four data clusters and the canopies that
cover them. Points belonging to the same cluster are colored in the same shade of gray.
Point A was selected at random and forms a canopy consisting of all points within the
outer (solid) threshold. Points inside the inner (dashed) threshold are excluded from being
the center of, and forming new canopies. Canopies for B, C, D, and E were formed
similarly to A. Note that the optimality condition holds: for each cluster there exists at
least one canopy that completely contains that cluster. Note also that while there is some
overlap, there are many points excluded by each canopy. Expensive distance
measurements will only be made between pairs of points in the same canopies, far fewer
than all possible pairs in the data set [MCCAL]. The number of record pair comparisons
 fn 2 
resulting from canopy clustering is O   where n is the number of records in each of
 c 
the two data sets, c is the number of canopies and f is the average number of canopies a
record belongs to. The threshold parameter should be set so that f is small and c is large,
in order to reduce the amount of computation.

Figure 1 Illustration of the Canopy Clustering Blocking Algorithm

Matching Algorithms
The subject of record linkage has received active attention by the research community
for more than fifty years and during that time, many algorithms have been proposed. At
the same time no algorithm has emerged as the most appropriate for use in every possible
population of patient demographic data. This plethora of matching algorithms can be
classified as belonging into one of two major categories of matching algorithms:
deterministic or probabilistic algorithms.
Deterministic (or exact or all-or-none) algorithms employ a set of rules based on
exact agreement/disagreement results between corresponding fields in potential record
pairs. Deterministic algorithms are the most appropriate choice when records on both
sources of data that need to be integrated contain a variable or characteristic of an
individual that is ideally (i) universally available, (ii) fixed, (iii) easily recorded, (iv)
unique to each individual, and (v) readily verifiable. Few, if any, variables meet all these
requirements although some come close enough to be useable. The advantage of
deterministic algorithms is that they are easy to implement but their disadvantage is that
they only produce accurate matching results after careful analysis and extensive
preprocessing of the data sets that need to be matched.
In one of the few published studies of the application of deterministic matching
algorithms, the authors describe in detail the process that they employed in order to
achieve accurate matching results using a deterministic algorithm. Their objective was to
match records from two hospital systems’ patient registries with the Social Security
Death Master File. The source data was preprocessed using field specific approaches
such as encoding names using phonetic compressions algorithms, imputation of missing
gender field values using the first name to look up gender specific names in the Census
and parsing of birth dates into month, day and year followed by correction or elimination
of invalid values. They used number theory-based algorithms to detect cases where the
order of the names varied between sources. After the preprocessing phase, the authors
extracted samples from the two data sets and manually reviewed the matches found using
a social security-only matching phase in order to generate a gold standard for measuring
the error rates of linkage variables and for comparing the matching accuracy of various
combinations of these variables. After extensive analysis of various combinations of
matching variables the concluded that the best matching attributes for their dataset were
the SSN, the first name transformed by the NYSIIS phonetic encoding algorithm, month
of birth and gender. A more recent study that compares a deterministic with a
probabilistic matching algorithm using an empirical evaluation comes to similar
conclusions regarding the advantages and disadvantages of deterministic algorithms.
Deterministic matching algorithms can produce good results if the quality of the source
data is fairly high, the data is preprocessed carefully and after matching attributes are
selected after a detailed analysis of their performance in small subsets of the data.
Probabilistic algorithms have gained popularity due to their applicability in most
common scenarios where the data sets that need to be integrated do not posses one or
more attributes with the characteristics described earlier. Most data sets have attributes
where their values contain errors and omissions and they typically do not possess a
unique and universally available high quality identifier. Probabilistic record matching
relies on calculating scores in determining whether two records are a match and those
underlying scores are based on probabilities. A set of record attributes are first selected
after analyzing the data set and those attributes form what is referred to as the match key.
Candidate record pairs are then compared to one another by using a distance metric to
calculate the similarity between corresponding fields of the match key. The values of the
distance metric for each record form a comparison vector. The probability of each
comparison vector is then evaluated to determine whether the two records are “close”
enough to each other to be classified as a match. The probabilistic approach to record
linkage originated with the work of geneticist Howard Newcombe who introduced odds
ratios of frequencies and the decision rules for delineating matches and non-matches.
Fellegi and Sunter then organized Newcombe’s ideas into a rigorous mathematical
framework which formalized the record linkage problem in probabilistic terms.
Before presenting the rigorous formulation of the record linkage problem, some
notation needs to be introduced. The assumption is that there are two sets of records that
need to be linked to one another. The two sets of records are denoted as A and B and a
single record from each set is denoted by lower case letters a and b , respectively. The
set of all possible record pairs consists of the Cartesian product of the two sets
A × B = {(a , b); a ∈ A, b ∈ B}
The objective of the matching algorithm is to partition the set of record pairs into the
two subsets
M = {(a , b) ∈ A × B | a = b}
of matching pairs and
U = {(a , b) ∈ A × B | a ≠ b}
of non-matching pairs. Note that since this is a partitioning of the two sets, that implies
that M ∪ U = A × B and M ∩ U = ∅ . Let’s assume that the match key consists of K
attributes. The two records in each record pair are compared with one another, forming a
comparison vector γ of length K. The space of all possible values of γ is Γ :

γ j = [γ 1 , γ 2 ,..., γ i ,..., γ K ]T
with γ i ∈ {0,1} . In particular, γ ij = 1 indicates that the corresponding values for field i in
the two records in record pair j are close enough based on the selected distance metric to
be considered a match and γ ij = 0 otherwise. If we assume that the observation of γ is an
event generated from some probability distribution, we can then consider the conditional
probability of observing γ given that the record pair is a match and denote that
distribution as:
m(γ ) = P (γ | ( a , b) ∈ M ) = P (γ | M )
as well as the conditional probability of observing γ given that the record pair is a non-
match, denoted as:
u (γ ) = P (γ | ( a , b) ∈ U ) = P (γ | U )
Given this model, a matching algorithm is simply a mapping or decision rule that upon
observing the comparison vector γ for record pair ( a , b) it needs to decide if it is a
matched pair, labeling the record pair as A1 , or if it is an unmatched pair, labeling it by
A3 . There will be some cases in which the neither label can be assigned with sufficient
confidence and those cases will be labeled as A2 . A linkage rule can now be defined as a
mapping from Γ , the comparison space, onto a set of random decision functions
D = {d (γ )} where
d (γ ) = {P( A1 | γ ), P( A2 | γ ), P( A3 | γ )}; γ ∈ Γ
and
3

∑ P ( A | γ ) = 1.
i =1
i

There are two types of error associated with a linkage rule. The first occurs when an
unmatched record pair is assigned to be a match (also referred to as a Type I error) and
the probability of this error is:
P( A1 | U ) = ∑ u (γ ) P( A1 | γ ) .
γ ∈Γ

The second type of error is when a matched pair is assigned to be a non-match (also
referred to as a Type II error):
P( A3 | M ) = ∑ m(γ ) P( A3 | γ ) .
γ ∈Γ

For fixed values of the false match rate µ and false non-match rate λ , Fellegi and Sunter
define the optimal linkage rule at levels µ and λ , denoted by L( µ , λ , Γ) as the rule for
which P( A2 | L) ≤ P ( A2 | L′) over all possible L′( µ , λ , Γ) . They then define a linkage
rule by first assigning an ordering to all comparison vectors γ such that the corresponding
sequence of ratios m(γ ) u(γ ) is monotone increasing and indexing the ordered set {γ }
by the subscript i; (i = 1,2,..., N Γ ) and write ui = u(γ i ); mi = m(γ i ) where N Γ = Γ . They
finally prove that if
n NΓ
µ = ∑ ui , λ = ∑ mi , and n < n ′ ,
i =1 i =n′

then Lo ( µ , λ , Γ) is the optimal algorithm at the levels ( µ , λ ) where the decision rule is
defined as:
(1,0,0) if 1≤ i ≤ n

d (γ ) = (0,1,0) if n < i < n′
(0,0,1) if n′ ≤ i ≤ N Γ

The proof for the optimality of the algorithm as well as details on the implementation of
the algorithm can be found in the original paper by Fellegi and Sunter.
1.1. OpenEMPI
OpenEMPI is an Open Source Enterprise Master Patient Index (EMPI) that originated
from the remnants of the Care Data Exchange software that were turned over to the open
source community after the Santa Barbara County (California) Care Data Exchange
(SBCCDE) ceased operations. Since it was originally released it has undergone considerable
refactoring and redesign in an effort to achieve the following goals: a. decompose the overall
system into a collection of services, b. make it extensible so that new algorithms can be
easily embedded into the system over time, c. optimize the data model to support instances
with large numbers of patients, and d. provide standards-based integration access points into
the system so that OpenEMPI can be easily integrated into existing healthcare environments.
To achieve the first goal, the system was re-designed so that the overall architecture of
the software is now based on Service Oriented Architecture (SOA) principles; the overall
system architecture consists of a collection of loosely coupled components and interaction
among components only takes place through well defined interfaces. Figure 2 illustrates the
new architecture of OpenEMPI with only some of the services included in the figure for
conciseness. The figure also illustrates the layered nature of the architecture where services
are allocated to the data access layer, the service layer or the UI layer and where each layer
builds on top of the layer below. Layered architectures provide flexibility by allowing entire
layers to be removed and replaced without affecting the rest of the system. For example,
moving OpenEMPI to utilize storage in the cloud would simply require implementation of a
new data layer to be deployed on a Storage-as-a-Service (SaaS) infrastructure without
requiring modifications to the rest of the application.
Some of the key services that comprise the architecture of OpenEMPI include the
Blocking Service, which abstracts the algorithms that reduce the number of record pairs that
need to be compared for matching purposes, the Matching Service, which provides an
abstraction for the algorithm that determine whether two or more patient records in the
system identify a unique patient, the String Comparison Service, which determines the
measure of similarity between two patient demographic attributes, and the Standardization
Service, which supports the data preparation phase and transforms patient attributes into a
standard format for the purpose of improving matching performance.
By decomposing the system into components and specifying interfaces as the only means
of interaction between those components, the new architecture achieved the extensibility
goal. The system can be easily extended by simply implementing a different version of a
service and plugging into the system without requiring any modifications to the rest of the
system. This capability of allowing for the plugging in of new and interchangeable
algorithms in any of the services is crucial since it allows for OpenEMPI to become a
platform for the testing and validation of various intelligent algorithms in any of the three
phases of the record linkage process.
Figure 2 SOA-based Architecture of OpenEMPI

An EMPI internally utilizes algorithms from the field of record linkage to detect and
link duplicate records. Over the years many algorithms have been proposed with different
operating assumptions and performance characteristics as described in previous sections
of this report. By utilizing the extensibility of OpenEMPI, multiple, alternative matching
algorithms may be implemented and the most appropriate choice may be selected during
deployment. The software distribution currently includes a fully functional
implementation of both a simple, deterministic algorithm as well as a probabilistic
matching algorithm, which is an implementation of the Fellegi-Sunter algorithm that uses
Expectation-Maximization (EM) for estimating the marginal probabilities of the model.
The same extensibility mechanism is also available in the blocking service
implementation and in the string comparison service. The current string comparison
service provides a number of string distance metrics including an exact string
comparator, the Jaro, Jaro-Winkler, and the Levenshtein distance metrics among others.
The next objective for the redesign of OpenEMPI was the optimization of the data
model to support the persistence and retrieval of patient identifying and demographic
data. This objective was selected for two reasons. The intent is for OpenEMPI to be
suitable for implementation in large-scale healthcare environments with hundreds of
thousands to millions of patient records, and where the system needs to sustain intense
user workloads. This goal can only be achieved if the data model is designed from the
beginning with high performance as a key criterion. The second reason for the focus on
optimization of the data model was to form a solid basis on top of which multiple
additional matching algorithms could be developed, tested and deployed in a production
environment. With a data model where query performance is not considered during
design, it would be impossible to develop matching algorithms that are efficient and
suitable for implementation in large-scale environments.
The final objective was the provision of integration points into OpenEMPI in order to
facilitate its integration into the Information Technology (IT) infrastructure of existing
healthcare environments. To achieve this goal, OpenEMPI provides support for the
Patient Identifier Cross-Referencing (PIX) and Patient Demographics Query (PDQ)
standards defined by the Integrating the Healthcare Enterprise (IHE) organization. IHE
defines the workflow and specifies the standards to be used in the implementation of
those workflows for promoting the sharing of medical information. The PIX profile
supports the cross-referencing of patient identifiers from multiple Patient Identifier
Domains and the PDQ profile provides ways for multiple distributed applications to
query a patient information server for a list of patients, based on user-defined search
criteria, and retrieve a patient’s demographic information. OpenEMPI utilizes the open
source implementation of the PIX/PDQ profiles in OpenPIXPDQ and the two projects
combined have been tested successfully at both the 2009 and 2010 IHE Connectathon. To
further simplify the integration of OpenEMPI into existing IT infrastructure, there are
plans for the development of both a REST-based and a SOAP-based web services
interface to the full functionality available by OpenEMPI.

You might also like