Professional Documents
Culture Documents
Deep Sequence-To-Sequence Entity Matching For Heterogeneous Entity Resolution
Deep Sequence-To-Sequence Entity Matching For Heterogeneous Entity Resolution
Deep Sequence-To-Sequence Entity Matching For Heterogeneous Entity Resolution
Hao Nie1,3 , Xianpei Han1,2 , Ben He3,1∗ , Le Sun1,2 , Bo Chen1 , Wei Zhang4 , Suhui Wu4 , Hao Kong4
1
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
2
State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China
3
University of Chinese Academy of Sciences, Beijing, China
4
Alibaba Group, Hangzhou, China
1
{niehao2016, xianpei, sunle, chenbo}@iscas.ac.cn 3 benhe@ucas.ac.cn
4
{lantu.zw, linnai.wsh, konghao.kh}@alibaba-inc.com
629
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
in real-world ER applications, entity records are usually from dif- • We propose a deep align-compare-aggregate matching neu-
ferent data sources which are described using different schemas. ral network Seq2SeqMatcher for Seq2Seq entity matching,
This hinders us from comparing corresponding attribute values which can effectively learn the token representations, cap-
directly. For example, in Figure 1, entity records records e 1 and ture the semantic relevance between tokens, and aggregate
e 2 are heterogeneous because they are described using different all matching evidence for accurate ER decisions in an end-
schemas: (Name, Brand, Address) vs (Name, Manufacturer, to-end manner.
Location, Price). Traditional structure matching models can- • The proposed approach achieves state-of-the-art results on
not be directly applied to these heterogeneous cases. Secondly, at- multiple benchmarks on all attribute-aligned, dirty and het-
tribute values of records are often missing, noisy, or misplaced. For erogeneous cases.
example, see record e 1 in Figure 1, attribute value “Apple” under at-
tribute Brand is misplaced under attribute Name and the value for
attribute Brand is missing. Because traditional structure match- 2 RELATED WORK
ing approaches cannot compare attribute values across different Entity resolution has received widespread attention [12, 17, 20, 30,
attributes, they ignore important matching evidence in such dirty 34, 49, 50] in the past few decades. Traditional ER approaches can
cases. As depicted in Figure 1, a structure matching model cannot be roughly grouped into two categories: machine-learning based
identify entity records e 1 and e 2 as being associated to the same methods and rule-based methods. The machine-learning based meth-
Brand – “Apple”, as they reside in different attributes. ods, e.g. [3, 7, 39, 46], usually learn a classifier (e.g., decision tree
In this paper, we propose a new entity resolution approach – or SVM) based on a list of pre-defined handcrafted features from
Seq2Seq entity matching, aiming to effectively solve the heteroge- a given annotated dataset to predict if an entity pair is a “match”
neous and dirty cases by modeling ER as a token-level sequence- or a “non-match”. However, such approaches require feature engi-
to-sequence matching task. Figure 1 shows the general architec- neering and the results are not explainable. To address the issues
ture of our approach, where each record is linearized as a token se- faced with machine-learning based methods, rule-based solutions
quence, and each token is a pair of the form <attribute, word>. We [14, 41, 42] estimate similarity between corresponding attribute
can see that, our Seq2Seq entity matching approach: 1) compares values using various similarity metrics [3, 5], and make the ER de-
records in token-level instead of attribute-level, where no attribute cisions based on the pre-defined rules and thresholds, in order to
alignment knowledge is needed, therefore can naturally solve the mitigate the need for feature engineering and at the same time
heterogeneous schemas; and 2) tokens can be compared across at- allow for interpretable matching results. However, the rule-based
tributes and contribution of each token to the final ER decision is methods require heavy involvement of domain experts to select
automatically learned, therefore the dirty cases can be effectively the matching rules and appropriate thresholds. To address these
solved. For example, our method is able to resolve e 1 and e 2 by iden- issues, solutions have been proposed to learn matching functions
tifying token pairs (<Name, Apple>, <Manufacturer, Apple>), and thresholds automatically [26, 43], without the need for human
(<Address, CA>, <Location, California>), · · · , etc., which is efforts from domain experts.
robust to heterogeneous schemas and dirty values. With the development of deep learning (DL) [19, 31, 40] theories,
To this end, we design an align-compare-aggregate [37] neural many natural language processing (NLP) [18, 22] and computer vi-
network for Seq2Seq entity matching, which can learn the repre- sion (CV) [21, 29] applications have improved the state-of-the-arts,
sentations of tokens, capture the semantic relevance between to- owing to DL’s ability in abstraction and representation learning in
kens and aggregate matching evidence for accurate ER decisions in an end-to-end fashion. In the mean time, there has also been grow-
an end-to-end manner. Specifically, given two entities S and T , our ing interests in applying deep learning techniques to entity reso-
approach first represents each token by concatenating its word em- lution. In particular, there are two lines of such studies, namely
bedding and the corresponding attribute embedding, then finds the the representation based methods and the compare-aggregate based
local relatedness between tokens using neural attention [1] from methods.
two directions [47], and eventually aggregates all local semantic The Representation Based Methods first learn a distributed
matching evidence for final ER decision. In this way, our Seq2Seq representation for each entity record, then estimate the similarity
matching network can automatically learn the global relatedness between two records by comparing their vector representations.
between different entities. This kind of methods tries to condense the overall semantics of an
We evaluate our approach on three kinds of datasets: the attribute- entity into a single low-dimensional vector, which cannot capture
aligned dataset, dirty dataset and heterogeneous dataset. The ex- minor differences between entities [12].
perimental results show that the proposed model substantially out- DeepER [12], a recent pioneering work following this direction,
performs the state-of-the-art approaches on all above settings. The focuses on designing deep learning solutions to entity resolution.
main contributions of this paper can be summarized as follows: In DeepER, two models – AVG and LSTM-RNN [23] are employed to
model the entity representations. Compared with the representa-
tion based methods, our approach makes ER decisions by modeling
• We propose a new entity resolution framework - Seq2Seq token-level interactions, without encoding the whole entity into a
entity matching, which can effectively solve the heteroge- low-dimensional vector. Such a technique has shown superiority
neous and dirty entity resolution problems by transform- in a number of natural language processing applications. For ex-
ing attribute-level structure-to-structure matching to token- ample, in [32], strong interactions of sentence pairs are modeled
level sequence-to-sequence matching. via two coupled-LSTMs to avoid sentence encoding.
630
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
Compare-Aggregate Based Methods take a different way to Table 1: An entity record pair of the attribute-aligned ER
solve the ER problems. Specifically, the subunits (e.g., words or problem.
phrases) in the entities are firstly compared to get multiple match-
ing signals. Then these matching signals are aggregated to make Name City Age
the final ER decisions. DeepMatcher [34] is a representative work e1 Dave Smith New York 18
following this direction, which attempts to explore the design space e2 David Smith NYC 18
of entity resolution. Specifically, it first compares pairwise attribute
values to get corresponding matching vectors. Then, these match- Table 2: An entity record pair of the dirty ER problem.
ing vectors are aggregated to generate the final similarity score.
While this work shares similarities with DeepMatcher, a major
Name Brand Price
difference is that DeepMatcher is not applicable to scenarios where
e1 Adobe Acrobat8 299.99
entities are heterogeneous, or the attribute values are misplaced,
e2 Acrobat 8 Adobe 299.99
preventing direct comparisons of corresponding attribute values.
In contrast, our model utilizes the token-level alignments to per-
form entity resolution without using any attribute alignment in- Table 3: An entity record pair of the heterogeneous ER prob-
formation, allowing adaptive use in heterogeneous environments lem.
or dirty cases.
Name Brand Location
3 PROBLEM FORMULATION e1 iphone 8 plus Apple CA
3.1 Problem Settings. Name Manufacturer Location price
In this paper, entities are real-world objects (e.g., place, product, e2 iphone 8p Apple Inc. California 699.0
scholar article, person, organization, etc.), which are structured as
<attribute, value> pairs. Given a pair of entities, the aim of the
entity resolution task is to judge whether it is a “match” or “non-
match”.
Let E and E 0 be two collections of entities from the same or dif- • Dirty ER. In this category, entity records in E and E 0 are
ferent data sources, this paper assumes that entities in E and E 0 are described by the same schema with attributes A1 , · · · , Am .
either described by the same attributes (e.g., the same schema with However, attribute values may be misplaced under the wrong
attributes A1 ,…, Am in the case of attribute-aligned data) or differ- attribute (i.e., attribute values are not associated with their
ent attributes (e.g., the different schemas with attributes A1 ,…, Am corresponding attribute in the schema). See Table 2 for an
and B1 ,…, Bn in the case of heterogeneous data). The goal of entity example, the italic word Adobe in entity e 1 under attribute
resolution is to identify all pairs of entity records between E and Brand is misplaced in attribute Name. To effectively re-
E 0 that refer to the same real-world entity, and we call these pairs solve dirty records, an ER system must be able to compare
matches. To find all matches, an entity resolution system first finds across attributes and robust to noise.
candidate matches C by performing a blocking step, then a match- • Heterogeneous ER. In this category, entities in E and E 0
ing algorithm is performed on the candidate set C to identify cor- are described using different schemas (A1 , · · · , Am ) and (B1 ,
rect matches. This paper mainly focuses on the matching step of · · · , Bn ). These entities may come from different data sources
entity resolution, since the blocking step has been well addressed where different schemas - (Name, Brand, Location) and
by blocking algorithms as in [11, 36, 48]. (Name, Manufacturer, Location, Price) are used to
Formally, given two collections of entities E and E 0 and a can- describe the same real-world entity (see Table 3). The het-
didate set C containing entity pairs (e1 ∈ E, e2 ∈ E 0 ), in order to erogeneity of the entities hinders the direct comparisons of
apply the learning-based techniques to entity resolution problems, corresponding attribute values.
we further assume that we have a labeled dataset T of triples {(ei1 ,
|T | |T |
ei2 , l)}i=1 , where {(ei1 , ei2 )}i=1 ⊂ C and l is a label with values of
4 SEQ2SEQ ENTITY MATCHING FOR
either 1 or 0, indicating the entities ei1 and ei2 are “match” or “non-
match”. Given the labeled dataset T, our goal is to design a match-
HETEROGENEOUS ENTITY RESOLUTION
ing function M that can accurately distinguish between “match” In this section, we describe how to solve the entity resolution prob-
and “no-match” pairs (e1 , e2 ) in C. lem via sequence-to-sequence entity matching. Figure 2 shows the
framework of our Seq2Seq entity matching model, which consists of
3.2 Types of ER Problems. an embedding layer for token representations, an alignment layer
for capturing local relevance, a comparison layer for local similar-
As discussed previously, we consider three types of entity resolu- ity, and an aggregation layer for the global similarity. For each
tion problems in this paper: pair of entities S and T, our model is devised to output match or
• Attribute-Aligned ER. In this category, entity records non-match based on their similarity sim(S,T ). In the following, we
from E and E 0 are described using the same schema with first explain each layer step by step, before a description of model
attributes A1 ,…, Am . See Table 1 for examples. training.
631
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
match / non-match ? Table 4: Entity pair example showing various word impor-
Prediction Layer
tance. Italic tokens photo-editing software in entity e 2 are
not as important as other tokens when compared with en-
Aggregation Layer tity e 1 .
Comparison Matrix ܥଵ Comparison Matrix ܥଶ
632
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
Comparison Layer. Based on the alignment matrices and weight Then we can gather these two sets of matching vectors for the en-
0
vectors computed above, the aim of the comparison layer is to gen- tity pair S and T to form two matrices C and C . Remind that weight
0 0 0 0
erate a series of matching signals. Intuitively, we first compute a vectors w® = [w 1, w 2, · · · , wm ] and w® = [w 1, w 2, · · · , w n ] measure
soft-attended representation for each token in one entity using all the relevant importance of words for each entity. So we transform
0
tokens in the other entity, as described below: the two matrices C and C with the corresponding weight vector
Õ
n as below:
sˆi = ai j t j (2)
j=1
Cs := [w 1c®1, w 2c®2, · · · , wm c®m ]
where sˆi is the soft-attended representation of the ith word in en- h 0 0 0 0 0 0
i (6)
tity S. There exist two problems in such an attention mechanism: Ct := w 1c®1 , w 2c®2 , · · · , w n c®n
on the one hand, there are only a few fragments that share the
relevant meaning with each other in two entities. The semantic In this way, we compare entity S and T from two directions (S →
relation is obscure when irrelevant ones are involved and it is ob- T and T → S) simultaneously and get two comparison matrices
viously more reasonable to only combine the relevant fragments to which are used as the matching signals.
obtain the corresponding semantic matching vector. On the other Aggregation Layer. The aim of the aggregation layer is to ag-
hand, in the case that si in S is irrelevant to all fragments in T, si gregate matching signals from the two comparison matrices in-
needs to be deprecated since we know there is no matching to it in duced by both directions of the comparison phase. Inspired by [24],
T and it should not be involved in further matching process. How- we utilize convolutional neural networks and design filters based
ever, after the normalization of attention weights, an aligned sub- on various n-grams to serve as feature extractors. The CNN module
phrase is still generated which acts as noise for following phases. involves two sequential operations: convolution and max-pooling.
Inspired by [2], we propose to filter on the alignment matrix For the convolution operation, we define a list of filters {wo }. The
to address this issue using a mechanism called k-max weighted at- shape of each filter is d × h, where d is the dimension of the com-
tention. Specifically, let S be the primary entity, T be the context parison vector and h is the window size of the filter. The following
entity, the elements in the ith row of the alignment matrix Am×n equation expresses this process.
® ai 1 , ai 2 , · · · , ain }. Elements in a® are sorted
constitute a vector a={
in descending order, and the indexes of the top k of the largest at-
co,i = f wo ∗ C [i :i+h] + bo
tention weights are denoted as L = {i1 , i2 , · · · , ik }. We keep ai j if it (7)
0 0
is among the top k in a® and set it to zero otherwise as follows: co,i = f wo ∗ C [i :i+h] + bo
( a
a i j = ÍL i j i∈L where the operation A ∗ B sums up all elements in B with the cor-
j ai j (3) 0
ai j = 0 j <L responding weights in A, C [i :i+h] and C [i :i+h] indicate the patches
0
Thus we modify the computation of the soft-attended representa- from C and C respectively, bo is a bias term and f is a non-linear
tion for token si as below: activation function (we use ReLU in this work). We apply this fil-
Õ ter to all possible patches, and produce two feature vectors c®o =
0 0 0 0
sˆi = ai j t j (4) [co, 1, co, 2, ..., co,O ] and c®o = [co, 1, co, 2, ..., co,O ]. To deal with
j ∈L variable feature size, we perform a max-pooling operation over c®o
0 0
By virtue of k-max weighted attention, relevant fragments are pre- and c®o by selecting the maximum value co = max c®o and co =
0
served while irrelevant ones are discarded in obtaining the cor- max c®o . Therefore, after these operations, each filter generates only
responding soft-attended representation of token si . In this way, one feature. We define several filters by varying window size and
irrelevant fragments are not involved in the corresponding soft- the initial values. Eventually, two feature vectors are generated,
attended representation. Moreover, a fragment without semantic one for each of the comparison matrices, and the feature dimen-
matching relations in the other entity does not take effect in the sion is equal to the number of filters.
subsequent matching phases. In terms of the choice of k, we con- Prediction Layer. The prediction layer performs similarity as-
duct detailed analysis in section 5.3.3 and find that it is dataset sessment based on the two feature vectors generated in the previ-
specific. ous step. Specifically, taking two feature vectors as input, we first
After the steps above, we get the soft-attended representations concatenate them and then pass the resultant vector to a two layer
of all the words in entity S and entity T. For these two entities, we fully-connected layer followed by a softmax classifier to get the fi-
compare each word in one entity with its soft-attended represen- nal similarity score of the entity pair (S, T ). We set the similarity
tation computed using k-most relevant words in the other entity. threshold to 0.5 in this work, and predict the entity pair as a match
Specifically, we use the comparison function element-wise absolute if the similarity score is above the threshold and a non-match oth-
0
difference to calculate a comparison vector c®i and c®i for the ith erwise.
word in entity S and T respectively.
4.2 Model Learning
c®i = |si − sˆi | |T |
(5) Given a training corpus T = {(ei1 , ei2 , li )}i=1 , where ei1 and ei2 are
0
c®i = ti − tˆi a pair of entities, and li ∈ {0, 1} indicates the similarity between
633
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
them, we assign l i = 1 if Si and Ti is a match, and l i = 0 other- Table 5: Statistics of three kinds of datasets used in our ex-
wise. we train our Seq2Seq entity matching model by minimizing periments. Columns “Size”, “#Pos.”, and “#Attr.” list the total
the cross-entropy objective function: numbers of examples, positive examples, and attributes for
each dataset. For the heterogeneous datasets, the number of
|T | h
Õ i attributes in the left entity is unequal to that of the right
J =− l (i) log p̂ (i) + 1 − l (i) log 1 − p̂ (i) (8) entity. So that column “#Attr.” for the last two datasets are
i=1 “4/5” and “6/8”.
In ER datasets, the positive instances are often far less than neg-
ative ones. To avoid the data-imbalance problem [45], we penalize Type Dataset Domain Size #Pos. #Attr.
errors corresponding to positive instances and negative instances Walmart-Amazon1 electronics 10,242 962 5
DBLP-ACM1 citation 12,363 2,220 4
(entity pairs labeled as “non-match”) with different weights. The Attribute-Aligned
DBLP-Scholar1 citation 28,707 5,347 4
ratio of the weight of the positive class to that of the negative class Hema-Taobao1 white spirits 22,190 2,219 7
ρ is a dataset-specific hyper-parameter, which is set as the ratio of Walmart-Amazon2 electronics 10,242 962 5
the number of negative instances to that of positive instances. The Dirty DBLP-ACM2 citation 12,363 2,220 4
DBLP-Scholar2 citation 28,707 5,347 4
final weighted cross-entropy objective is: Walmart-Amazon3 electronics 10,242 962 4/5
Heterogeneous
Hema-Taobao2 olive oil 13,368 2,228 6/8
|T | h
Õ i
J =− ρl (i) log p̂ (i) + 1 − l (i) log 1 − p̂ (i) (9)
i=1 compare the corresponding attribute values directly. As a re-
sult, the DeepMater is not applicable to this kind of datasets.
5 EXPERIMENTS Under this scenario, we concatenate all the attribute values
5.1 Experimental Settings for each entity to form a text set and then use DeepMatcher
Datasets. To evaluate different approaches on different ER set- to experiment on these datasets.
tings, this paper conducts experiments on three types of datasets • Magellan is a classical learning-based approach which has
(see Table 5 for statistics of the datasets). For attribute-aligned en- achieved competitive performance on multiple datasets. Mag-
tity resolution, we use 4 datasets. The first 3 datasets are publicly ellan trains a list of classifiers (decision tree, random forest,
available and have been used for entity resolution (e.g., [28, 34]). Naive Bayes, SVM and logistic regression) based on a large
The last Chinese dataset Hema-Taobao1 describes white spirits prod- set of features, and selects the best model on the validation
ucts and is created from two e-commerce platforms Hema1 and set. Note that, due to the use of homogeneous attribute pairs
Taobao2 . For dirty entity resolution, we use 3 open released datasets for the feature generation, Magellan is not applicable to sce-
from [34], where attribute values are randomly misplaced under at- narios where entities are heterogeneous. We therefore omit
tribute Title with 50% probability. For heterogeneous entity res- the results of this approach on the heterogeneous datasets.
olution, we use two datasets, one is Hema-Taobao2 , which is about To evaluate the performance of our model and all the baselines,
olive oil and also from the two e-commerce platforms Hema and we use precision (P), the fraction of correct match predictions; re-
Taobao, with the entity pairs described using different attributes. call (R), the fraction of correct matches being predicted as matches,
Specifically, for entities in the “Hema” platform, they are described and F1 , defined as 2PR/(P+R), as used in [34].
by attributes list of (Title, Name, Brand, Area, Net content, Model Training. We implement the model with PyTorch3 , an
Packaging), while for entities in the “Taobao” platform, the at- open-source deep learning framework with extensive support for
tribute list is (Title, Brand, Series, Area, Net content, accelerating training using GPUs, and run experiments on a server
Packaging, Province, City). The two diverse attribute lists equipped with Intel(R) Xeon(R) E5-2683 CPU, 128GB memory, and
from “Hema” and “Taobao” constitute a typical heterogeneous en- Nvidia TITAN X (Pascal) GPU.
tity resolution problem in real-world applications. The other dataset Using all competing methods, the best models on the validation
for heterogeneous ER problem is a pseudo dataset created by con- set are selected based on the F1 score of the positive instances,
catenating the two specific attribute values in the attribute-aligned and the subsequent performances on the test set are reported. The
Walmart-Amazon (see Table 5) dataset. Those threes types of datasets word embeddings for the English dataset are initialized using the
come with pre-defined train, validation, and test partitions. 300 dimensional FastText [4], a character-level embedding, which
Baselines. In this paper, we compare with two state-of-the-art can approximate the embeddings of the out-of-vocabulary (OOV)
ER baselines – DeepMatcher [34] and Magellan [26]. words using character-level word representations. For the Chinese
• DeepMatcher is a state-of-the-art deep learning based ER ap- datasets, the word embeddings are initialized with the 200 dimen-
proach, which is a structure matching model by: 1) learning sional Tencent AILab Chinese Embedding [44]. The OOV words are
the representations of each attribute value, 2) comparing randomly initialized from a normal distribution with the mean of
the similarity between corresponding attribute representa- 0 and the standard deviation of 0.1. All word embeddings are fixed
tions, and 3) aggregating the matching signals for the final during training. The attribute embeddings are again randomly ini-
ER decision. For the heterogeneous datasets, the attributes tialized from a normal distribution with 0 mean and 0.1 standard
are heterogeneous for different data sources, so we cannot deviation, and are trained simultaneously with other parameters.
The dimension of the attribute embeddings is 50 for English datasets,
1 https://www.freshhema.com
2 https://www.taobao.com 3 https://pytorch.org/
634
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
Table 6: Experimental results on three types of datasets. ∆F1 denotes the performance gap between the proposed model –
SeqSeqMatcher and the state-of-the-art approaches.
Walmart-Amazon1 86.3 71.5 78.2 70.9 64.6 67.6 72.3 71.5 71.9 +6.3
DBLP-ACM1 98.4 99.3 98.9 98.0 98.9 98.4 97.4 99.6 98.4 +0.5
Attribute-Aligned DBLP-Scholar1 92.8 97.9 95.3 94.8 94.5 94.7 94.3 90.4 92.3 +0.6
Hema-Taobao1 81.3 79.4 80.3 53.5 78.4 63.6 - - - +16.7
Walmart-Amazon2 72.3 64.8 68.3 56.3 51.6 53.8 33.8 42.0 37.4 +14.5
Dirty DBLP-ACM2 98.9 98.0 98.4 98.6 97.5 98.1 93.7 90.1 91.9 +0.3
DBLP-Scholar2 92.7 95.6 94.1 94.3 93.4 93.8 87.1 78.4 82.5 +0.3
Walmart-Amazon3 83.1 68.9 75.4 74.9 67.9 71.2 - - - +4.2
Heterogeneous Hema-Taobao2 95.9 94.0 94.9 71.0 66.0 68.4 - - - +26.5
Table 7: Experimental results of the modified models on three types of datasets. SeqSeqMatcher(-AttrInfo) denotes the model
with attribute information discarded from the main model as described in section 5.3.1. SeqSeqMatcher(-BiAlign) denotes the
model with the bidirectional alignment mechanism removed from the main model as described in section 5.3.2. ∆F1 under
each column denotes the performance gap between the corresponding modified model and the main model Seq2SeqMacher.
SeqSeqMatcher(-AttrInfo) SeqSeqMatcher(-BiAlign)
Type P R F1 ∆ F1 P R F1 ∆ F1
Walmart-Amazon1 43.1 56.5 48.9 -29.3 74.3 69.4 71.8 -6.4
DBLP-ACM2 97.3 97.1 97.2 -1.7 96.9 98.9 97.9 -1.0
Attribute-Aligned DBLP-Scholar2 86.3 92.4 89.3 -6.0 89.8 90.2 90.0 -5.3
Hema-Taobao2 66.8 81.7 73.5 -6.8 77.8 81.5 79.6 -0.7
Walmart-Amazon2 53.0 40.9 46.2 -22.1 65.0 68.4 66.7 -1.6
Dirty DBLP-ACM2 96.0 96.6 96.3 -2.1 94.3 96.4 95.3 -3.1
DBLP-Scholar2 89.1 89.1 89.1 -5.0 86.1 87.8 86.9 -7.2
Walmart-Amazon2 68.4 53.9 60.3 -15.1 72.2 74.1 73.1 -2.3
Heterogeneous Hema-Taobao2 70.1 79.6 74.5 -20.4 89.3 97.2 93.1 -1.8
and 40 for Chinese datasets. Adam [25] is used for parameters op- • The proposed model – Seq2SeqMatcher can effectively solve
timization with an initial learning rate of 0.001. The mini-batch the heterogeneous and dirty ER problems. Compared with
size is set to 32 for all datasets. The filter types include {unigram, DeepMatcher and Magellan, our method can achieve con-
bigram, trigram} and each type of filter contains 100 filters. The siderable performance gain on all dirty and heterogeneous
choice of value k used in the k-max weighted attention mechanism datasets. We believe that is because token-level compari-
is dataset-specific and based on the analysis in the following part. son can capture the token-level relatedness between entity
records, which are more fine-grained and more robust for
similarity judgment.
• The proposed model can achieve robust performance on dif-
5.2 Overall Performance ferent settings. In the attribute-aligned settings, our model
Table 6 shows the overall performance on all attribute-aligned, dirty does not use any attribute alignment information, but can
and heterogeneous datasets. For DeepMatcher and Magellan, we still achieve the best performance on all datasets. We believe
directly use their reported performance from [34] on attribute-aligned this is because our neural model can automatically learn the
and dirty settings. For heterogeneous settings, we only report the correspondence between aligned attribute values by utiliz-
performance of DeepMatcher by concatenating all attribute values ing the latent semantic relatedness.
of each entity to a composite attribute value as it is not directly • By comparing tokens across different attributes, our model
applicable to this scenario. We do not report the performance of appears to be robust when dealing with the dirty ER prob-
Magellan on the heterogeneous datasets, since it is not applicable lems. On the dirty datasets, our model achieves notable per-
to this case. However, the proposed model – Seq2SeqMatcher is ap- formance gain compared with DeepMatcher and Magellan.
plicable to all above settings. From Table 6, it can be seen that: We believe this is mainly because the token-level model is
635
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
(a) Attribute alignment on attribute- (b) Attribute alignment on dirty DBLP- (c) Attribute alignment on heterogeneous
aligned Walmart-Amazon dataset. ACM dataset. Hema-Taobao2 dataset.
Figure 3: Visualization of attribute alignments between different data sources on three kinds of datasets.
more robust than the attribute-level model, which can better entity provide structural information to the overall semantics of
address the misplacing, missing, and noisy data problems. the entity, which help in the entity matching process. To analyze
• By comparing entities in token level and learning matching the effectiveness of attribute information in entity resolution, we
models in an end-to-end manner, our proposed model can model ER as a text matching problem by discarding attribute in-
effectively address entity heterogeneity problem. As we can formation from our model Seq2SeqMatcher, i.e., we represent each
see, Seq2SeqMatcher shows remarkable performance on all token t just by its word embedding ew . Seq2SeqMatcher(-AttrInfo)
heterogeneous datasets. We owe this to the powerful abil- from Table 7 denotes the ablated model. From the results, we can
ity of our model to capture the latent associations between see that attributes provide useful information for entity resolution.
relevant tokens across different attributes. On all datasets, removing attributes results in performance drop
to some degree. This observation verifies that attributes can pro-
vide useful information to solving ER problems, and it is beneficial
to incorporate such information in our Seq2Seq matching models,
which helps induce token-level correspondence and disambiguate
tokens under different attributes.
To analyze whether our end-to-end model can learn attribute
alignments between different data sources, Figure 3 shows sev-
eral learned attribute alignment examples of the corresponding
datasets. Specifically, we extract the learned attribute embeddings
for the left data source and the right data source. For each attribute
in the left data source, we align it with the nearest attribute in the
right data source which has the largest embedding similarity with
it, and normalizes the resultant similarity matrix to form an at-
tributes alignment matrix. Note that, we learn the attribute embed-
dings without utilizing any semantic information of the attributes.
We can see that, the results in Figure 3 confirm that our model
can accurately learn attribute alignments by capturing toke-level
correspondence.
636
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
(a) Performance on attribute-aligned datasets (b) Performance on dirty datasets. (c) Performance on heterogeneous datasets
Figure 5: Performance on all kinds of datasets when k take values in {1, 2, 3, 4, all }
entities from only one direction. We denote the model with bidirec- hard attention only selects the most relevant token to com-
tional comparison mechanism removed as Seq2SeqMatcher(-BiAlign) pute the entity representation.
and experiment on three types of datasets, the results are shown in • The k-max weighted attention (k = [2, 3, 4]) mechanism is
Table 7. We can see that bidirectional alignment mechanism always usually more effective than traditional attention mechanism
outperforms unidirectional alignment mechanism. The results con- (k = all). We owe this to that the proposed k-max weighted
sistently verify the effectiveness of the bidirectional comparison attention mechanism can filter out redundant words and fo-
mechanism. We believe that’s because the bidirectional compari- cus more on relevant information when calculating the soft-
son mechanism can exactly model the many-to-many correspon- attended representations of tokens.
dence of tokens between two entities. • Under the k-max weighted attention mechanism, the appro-
For analysis, we also show a token alignment example of our priate choice of value k is dataset dependent. The proper-
Seq2SeqMatcher in Figure 4. This is done by visualizing the atten- ties of the datasets can determine the proportion of relevant
tion weights ai j in Equation (1). Each row of the matrix in each plot words, thus influencing the best choice of value k.
indicates the similarities of the corresponding word in the primary
entity with all words in the context entity, which will be used as
weights to compute the soft representations. We can see from the 6 CONCLUSION
alignments in Figure 4 that our model learns accurate correlations In this paper, we propose a sequence-to-sequence entity match-
between the same or similar words, even though the embedding ing model Seq2SeqMatcher for heterogeneous entity resolution. By
for each word is concatenated with the corresponding attribute modeling ER as a token-level sequence-to-sequence matching task,
embedding. our model can effectively solve the heterogeneous and dirty prob-
lems. Furthermore, by learning the representation of tokens, cap-
5.3.3 Effect of K-max Weighted Attention Mechanism. In turing the semantic relevance between tokens, and aggregating
entity resolution tasks, our model aligns one token with several matching evidence for accurate ER decisions in an end-to-end man-
other tokens using the soft attention. Traditional attention mecha- ner, our Seq2Seq entity matching model achieves substantial per-
nism obtains the corresponding soft representation by an attention- formance improvements on 9 standard entity resolution benchmarks.
weighted sum of all word representations [37], which will inevitably For future work, we find that the token embedding in entity
bring in noise and redundancy. In this paper, we propose a k-max resolution is a big challenge, because many words in ER are out-
weighted attention mechanism to alleviate redundancy. The basic of-vocabulary (OOV) words or numbers, and a word’s context is
idea is to filter irrelevant words and keep the top k of the largest usually sparse and structural. To address these issues, we would
attention weights. like to develop an embedding algorithm which is robust in entity
In order to verify the effectiveness of k-max weighted attention resolution and can effectively take the schema and structure of
mechanism and select the best choice of value k, we conduct ex- knowledge base into consideration.
periments with k varying from {1, 2, 3, 4, all }. In the above experi-
mental settings, k = 1 denotes hard attention and k = all denotes
traditional attention mechanism. From Figure 5, we can see that: ACKNOWLEDGMENTS
This work is supported by the National Key R&D Program of China
• The soft attention mechanism (k = [2, 3, 4, all]) is more ef- under Grant 2018YFB1005100; the National Natural Science Foun-
fective than the hard attention (k=1) for most of the cases. dation of China under Grants no. 61433015, 61572477 and 61772505;
We believe that is because the soft attention can capture the Young Elite Scientists Sponsorship Program no. YESS20160177;
a soft-attended correspondence between entities, while the and University of Chinese Academy of Sciences.
637
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China
638