Deep Sequence-To-Sequence Entity Matching For Heterogeneous Entity Resolution

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

Deep Sequence-to-Sequence Entity Matching for


Heterogeneous Entity Resolution

Hao Nie1,3 , Xianpei Han1,2 , Ben He3,1∗ , Le Sun1,2 , Bo Chen1 , Wei Zhang4 , Suhui Wu4 , Hao Kong4
1
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China
2
State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China
3
University of Chinese Academy of Sciences, Beijing, China
4
Alibaba Group, Hangzhou, China
1
{niehao2016, xianpei, sunle, chenbo}@iscas.ac.cn 3 benhe@ucas.ac.cn
4
{lantu.zw, linnai.wsh, konghao.kh}@alibaba-inc.com

ABSTRACT Name Brand $GGUHVV


Entity Resolution (ER) identifies records from different data sources ݁ଵ Apple iphone 8 plus CA
match or
that refer to the same real-world entity. Conventional ER approaches non-match ? Name Manufacturer Location Price
usually employ a structure matching mechanism, where attributes ݁ଶ iphone 8p Apple California 699.9
are aligned, compared and aggregated for ER decision. The struc-
ture matching approaches, unfortunately, often suffer from het-
[Name Apple] [Name, iphone] [Name, 8] [Name, plus] [$GGUHVV,CA]
erogeneous and dirty ER problems. That is, entities from different
data sources are described using different schemas, and attribute
values may be misplaced, missing, or noisy. In this paper, we pro- Sequence-to-Sequence
Matching
pose a deep sequence-to-sequence entity matching model, denoted
Seq2SeqMatcher, which can effectively solve the heterogeneous and
[Name, iphone] [Name, 8p] [Manufacturer, Apple] [Location, CaliforQia] [Price, 699.9]
dirty problems by modeling ER as a token-level sequence-to-sequence
matching task. Specifically, we propose an align-compare-aggregate
neural network for Seq2Seq entity matching, which can learn the Figure 1: An entity resolution example under the Seq2Seq
representations of tokens, capture the semantic relevance between entity matching framework.
tokens, and aggregate matching evidence for accurate ER decisions
in an end-to-end manner. Experimental results show that, by com- Information and Knowledge Management (CIKM’19), November 3–7, 2019,
paring entity records in token level and learning all components Beijing, China. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/
3357384.3358018
in an end-to-end manner, our Seq2Seq entity matching model can
achieve remarkable performance improvements on 9 standard en-
tity resolution benchmarks. 1 INTRODUCTION
Entity resolution (ER) aims to identify records referring to the same
CCS CONCEPTS real-world entity. For example in Figure 1, the entity records e 1 and
• Information systems → Entity resolution; Deduplication; • e 2 correspondingly from Walmart and Amazon would be resolved
Computing methodologies → Ontology engineering. as the same entity because they refer to the same real-world prod-
uct. ER is important in knowledge integration [8–10], data cleaning
KEYWORDS [6] and information management [16] and has therefore received
significant research attention in recent years [13, 15, 17, 27, 35].
entity resolution; attribute heterogeneity; matching; deep learning
In entity resolution, each record is a structured object composed
ACM Reference Format: of one or more <attribute, value> pairs. Conventional approaches
Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, usually model entity resolution as a structure-to-structure match-
Hao Kong. 2019. Deep Sequence-to-Sequence Entity Matching for Hetero-
ing task. Concretely, attributes are firstly aligned either manually
geneous Entity Resolution. In The 28th ACM International Conference on
or automatically, then similarities between corresponding attribute
*Corresponding authors. values are computed, finally similarity scores between aligned at-
tributes are aggregated to get the final similarity between records.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed For example, to resolve e 1 and e 2 in Figure 1, structure match-
for profit or commercial advantage and that copies bear this notice and the full cita- ing systems first identify Name and Address (Location) as
tion on the first page. Copyrights for components of this work owned by others than aligned attributes, then the total similarity is computed by aggre-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission gating the similarities of the corresponding attribute values (“Ap-
and/or a fee. Request permissions from permissions@acm.org. ple iphone 8 plus”, “iphone 8p”) under Name and (CA, California)
CIKM ’19, November 3–7, 2019, Beijing, China under Address (Location).
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6976-3/19/11…$15.00 The structure matching approaches, unfortunately, often face
https://doi.org/10.1145/3357384.3358018 problems when entity records are heterogeneous or dirty. Firstly,

629
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

in real-world ER applications, entity records are usually from dif- • We propose a deep align-compare-aggregate matching neu-
ferent data sources which are described using different schemas. ral network Seq2SeqMatcher for Seq2Seq entity matching,
This hinders us from comparing corresponding attribute values which can effectively learn the token representations, cap-
directly. For example, in Figure 1, entity records records e 1 and ture the semantic relevance between tokens, and aggregate
e 2 are heterogeneous because they are described using different all matching evidence for accurate ER decisions in an end-
schemas: (Name, Brand, Address) vs (Name, Manufacturer, to-end manner.
Location, Price). Traditional structure matching models can- • The proposed approach achieves state-of-the-art results on
not be directly applied to these heterogeneous cases. Secondly, at- multiple benchmarks on all attribute-aligned, dirty and het-
tribute values of records are often missing, noisy, or misplaced. For erogeneous cases.
example, see record e 1 in Figure 1, attribute value “Apple” under at-
tribute Brand is misplaced under attribute Name and the value for
attribute Brand is missing. Because traditional structure match- 2 RELATED WORK
ing approaches cannot compare attribute values across different Entity resolution has received widespread attention [12, 17, 20, 30,
attributes, they ignore important matching evidence in such dirty 34, 49, 50] in the past few decades. Traditional ER approaches can
cases. As depicted in Figure 1, a structure matching model cannot be roughly grouped into two categories: machine-learning based
identify entity records e 1 and e 2 as being associated to the same methods and rule-based methods. The machine-learning based meth-
Brand – “Apple”, as they reside in different attributes. ods, e.g. [3, 7, 39, 46], usually learn a classifier (e.g., decision tree
In this paper, we propose a new entity resolution approach – or SVM) based on a list of pre-defined handcrafted features from
Seq2Seq entity matching, aiming to effectively solve the heteroge- a given annotated dataset to predict if an entity pair is a “match”
neous and dirty cases by modeling ER as a token-level sequence- or a “non-match”. However, such approaches require feature engi-
to-sequence matching task. Figure 1 shows the general architec- neering and the results are not explainable. To address the issues
ture of our approach, where each record is linearized as a token se- faced with machine-learning based methods, rule-based solutions
quence, and each token is a pair of the form <attribute, word>. We [14, 41, 42] estimate similarity between corresponding attribute
can see that, our Seq2Seq entity matching approach: 1) compares values using various similarity metrics [3, 5], and make the ER de-
records in token-level instead of attribute-level, where no attribute cisions based on the pre-defined rules and thresholds, in order to
alignment knowledge is needed, therefore can naturally solve the mitigate the need for feature engineering and at the same time
heterogeneous schemas; and 2) tokens can be compared across at- allow for interpretable matching results. However, the rule-based
tributes and contribution of each token to the final ER decision is methods require heavy involvement of domain experts to select
automatically learned, therefore the dirty cases can be effectively the matching rules and appropriate thresholds. To address these
solved. For example, our method is able to resolve e 1 and e 2 by iden- issues, solutions have been proposed to learn matching functions
tifying token pairs (<Name, Apple>, <Manufacturer, Apple>), and thresholds automatically [26, 43], without the need for human
(<Address, CA>, <Location, California>), · · · , etc., which is efforts from domain experts.
robust to heterogeneous schemas and dirty values. With the development of deep learning (DL) [19, 31, 40] theories,
To this end, we design an align-compare-aggregate [37] neural many natural language processing (NLP) [18, 22] and computer vi-
network for Seq2Seq entity matching, which can learn the repre- sion (CV) [21, 29] applications have improved the state-of-the-arts,
sentations of tokens, capture the semantic relevance between to- owing to DL’s ability in abstraction and representation learning in
kens and aggregate matching evidence for accurate ER decisions in an end-to-end fashion. In the mean time, there has also been grow-
an end-to-end manner. Specifically, given two entities S and T , our ing interests in applying deep learning techniques to entity reso-
approach first represents each token by concatenating its word em- lution. In particular, there are two lines of such studies, namely
bedding and the corresponding attribute embedding, then finds the the representation based methods and the compare-aggregate based
local relatedness between tokens using neural attention [1] from methods.
two directions [47], and eventually aggregates all local semantic The Representation Based Methods first learn a distributed
matching evidence for final ER decision. In this way, our Seq2Seq representation for each entity record, then estimate the similarity
matching network can automatically learn the global relatedness between two records by comparing their vector representations.
between different entities. This kind of methods tries to condense the overall semantics of an
We evaluate our approach on three kinds of datasets: the attribute- entity into a single low-dimensional vector, which cannot capture
aligned dataset, dirty dataset and heterogeneous dataset. The ex- minor differences between entities [12].
perimental results show that the proposed model substantially out- DeepER [12], a recent pioneering work following this direction,
performs the state-of-the-art approaches on all above settings. The focuses on designing deep learning solutions to entity resolution.
main contributions of this paper can be summarized as follows: In DeepER, two models – AVG and LSTM-RNN [23] are employed to
model the entity representations. Compared with the representa-
tion based methods, our approach makes ER decisions by modeling
• We propose a new entity resolution framework - Seq2Seq token-level interactions, without encoding the whole entity into a
entity matching, which can effectively solve the heteroge- low-dimensional vector. Such a technique has shown superiority
neous and dirty entity resolution problems by transform- in a number of natural language processing applications. For ex-
ing attribute-level structure-to-structure matching to token- ample, in [32], strong interactions of sentence pairs are modeled
level sequence-to-sequence matching. via two coupled-LSTMs to avoid sentence encoding.

630
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

Compare-Aggregate Based Methods take a different way to Table 1: An entity record pair of the attribute-aligned ER
solve the ER problems. Specifically, the subunits (e.g., words or problem.
phrases) in the entities are firstly compared to get multiple match-
ing signals. Then these matching signals are aggregated to make Name City Age
the final ER decisions. DeepMatcher [34] is a representative work e1 Dave Smith New York 18
following this direction, which attempts to explore the design space e2 David Smith NYC 18
of entity resolution. Specifically, it first compares pairwise attribute
values to get corresponding matching vectors. Then, these match- Table 2: An entity record pair of the dirty ER problem.
ing vectors are aggregated to generate the final similarity score.
While this work shares similarities with DeepMatcher, a major
Name Brand Price
difference is that DeepMatcher is not applicable to scenarios where
e1 Adobe Acrobat8 299.99
entities are heterogeneous, or the attribute values are misplaced,
e2 Acrobat 8 Adobe 299.99
preventing direct comparisons of corresponding attribute values.
In contrast, our model utilizes the token-level alignments to per-
form entity resolution without using any attribute alignment in- Table 3: An entity record pair of the heterogeneous ER prob-
formation, allowing adaptive use in heterogeneous environments lem.
or dirty cases.
Name Brand Location
3 PROBLEM FORMULATION e1 iphone 8 plus Apple CA
3.1 Problem Settings. Name Manufacturer Location price
In this paper, entities are real-world objects (e.g., place, product, e2 iphone 8p Apple Inc. California 699.0
scholar article, person, organization, etc.), which are structured as
<attribute, value> pairs. Given a pair of entities, the aim of the
entity resolution task is to judge whether it is a “match” or “non-
match”.
Let E and E 0 be two collections of entities from the same or dif- • Dirty ER. In this category, entity records in E and E 0 are
ferent data sources, this paper assumes that entities in E and E 0 are described by the same schema with attributes A1 , · · · , Am .
either described by the same attributes (e.g., the same schema with However, attribute values may be misplaced under the wrong
attributes A1 ,…, Am in the case of attribute-aligned data) or differ- attribute (i.e., attribute values are not associated with their
ent attributes (e.g., the different schemas with attributes A1 ,…, Am corresponding attribute in the schema). See Table 2 for an
and B1 ,…, Bn in the case of heterogeneous data). The goal of entity example, the italic word Adobe in entity e 1 under attribute
resolution is to identify all pairs of entity records between E and Brand is misplaced in attribute Name. To effectively re-
E 0 that refer to the same real-world entity, and we call these pairs solve dirty records, an ER system must be able to compare
matches. To find all matches, an entity resolution system first finds across attributes and robust to noise.
candidate matches C by performing a blocking step, then a match- • Heterogeneous ER. In this category, entities in E and E 0
ing algorithm is performed on the candidate set C to identify cor- are described using different schemas (A1 , · · · , Am ) and (B1 ,
rect matches. This paper mainly focuses on the matching step of · · · , Bn ). These entities may come from different data sources
entity resolution, since the blocking step has been well addressed where different schemas - (Name, Brand, Location) and
by blocking algorithms as in [11, 36, 48]. (Name, Manufacturer, Location, Price) are used to
Formally, given two collections of entities E and E 0 and a can- describe the same real-world entity (see Table 3). The het-
didate set C containing entity pairs (e1 ∈ E, e2 ∈ E 0 ), in order to erogeneity of the entities hinders the direct comparisons of
apply the learning-based techniques to entity resolution problems, corresponding attribute values.
we further assume that we have a labeled dataset T of triples {(ei1 ,
|T | |T |
ei2 , l)}i=1 , where {(ei1 , ei2 )}i=1 ⊂ C and l is a label with values of
4 SEQ2SEQ ENTITY MATCHING FOR
either 1 or 0, indicating the entities ei1 and ei2 are “match” or “non-
match”. Given the labeled dataset T, our goal is to design a match-
HETEROGENEOUS ENTITY RESOLUTION
ing function M that can accurately distinguish between “match” In this section, we describe how to solve the entity resolution prob-
and “no-match” pairs (e1 , e2 ) in C. lem via sequence-to-sequence entity matching. Figure 2 shows the
framework of our Seq2Seq entity matching model, which consists of
3.2 Types of ER Problems. an embedding layer for token representations, an alignment layer
for capturing local relevance, a comparison layer for local similar-
As discussed previously, we consider three types of entity resolu- ity, and an aggregation layer for the global similarity. For each
tion problems in this paper: pair of entities S and T, our model is devised to output match or
• Attribute-Aligned ER. In this category, entity records non-match based on their similarity sim(S,T ). In the following, we
from E and E 0 are described using the same schema with first explain each layer step by step, before a description of model
attributes A1 ,…, Am . See Table 1 for examples. training.

631
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

match / non-match ? Table 4: Entity pair example showing various word impor-
Prediction Layer
tance. Italic tokens photo-editing software in entity e 2 are
not as important as other tokens when compared with en-
Aggregation Layer tity e 1 .
Comparison Matrix ‫ܥ‬ଵ Comparison Matrix ‫ܥ‬ଶ

Comparison Layer Title Manufacturer Price


adobe photoshop elements 4.0
e1 adobe 89.99
(mac)
Alignment Layer adobe photoshop elements 4.0
‫ܣ‬௠‫כ‬௡ e2 Adobe 85.95
Alignment Matrix photo-editing software for mac
word embedding
Representation Layer
attribute embedding model would learn similar embeddings for attribute Location
in entity e 1 and attribute Address in entity e 2 .
Alignment Layer. In entity resolution, one critical step is to
Name Brand Location Name Manufacturer Location Price
݁ଵ Apple iphone 8 plus CA ݁ଶ iphone 8p Apple California 699
identify the correspondence between tokens. For example, to com-
pare entities e 1 and e 2 in Figure 2, we need to find the correspon-
Figure 2: Model overview. dence between <Name, Apple> and <Manufacturer, Apple>,
<Location, CA> and <Address, California>, · · · , etc.. In our
model, the alignment layer uses an attention mechanism to com-
pute the correspondence between different tokens.
4.1 Seq2Seq Entity Matching Network Specifically, given the vector sequences of S and T , we compute
Given a pair of entity records S and T, our approach first linearizes a bidirectional alignment (i.e., S → T and T → S). In the forward
each record as a token sequence S := [s 1, s 2, · · · , sm ] and T := direction S → T , we construct an alignment matrix Am×n (m and n
[t 1, t 2, · · · , tn ], where each token is a <attribute, word> pair. For denote the numbers of tokens in entity S and T respectively), where
example in Figure 2, the entity record e 1 is linearized as a token each element ai,j ∈ Am×n computes the normalized attention score
sequence “<Name, Apple>,<Name, iphone>, <Name, 8>, <Name, between representations of tokens si and tj as:
plus>, <Location, CA>”. Given the token sequences of S and
T , our matching network performs entity resolution through the ai j = si · t j ∀si ∈ S, ∀t j ∈ T (1)
5 layers in Figure 2, and we describe these layers step by step as using neural attention [1]. We can see that, each row of the align-
follows. ment matrix Am×n represents the attention scores of a word in
Representation Layer. This layer embeds each token in S and entity S to all of the words in the entity T. For example, for token
T into a low-dimensional vector, so that it can be compared and <Name, Apple> in entity e 1 , its main attention score focuses on
aligned in the following layers. Unlike previous word embedding <Manufacturer, Apple>. For alignment from T → S, we con-
0
studies, each token in our model is a structured pair <attribute, struct an alignment matrix An×m in the same way.
word>. Therefore, the main challenge is how to take both attribute In entity resolution, the importance of tokens varies when it
information and word information into consideration. Furthermore, comes to the overall semantics of the entity. In our settings, if a
attribute provides useful context for disambiguating words, i.e., al- word in one entity has a counterpart in the other entity, this word
though “Apple” is ambiguous, <Brand, Apple> is not. shall be more important and should be paid more attention when
Specifically, we represent an input token t :=< a, w > as [ea , ew ] doing word comparisons (as depicted in Table 4). Based on the
by concatenating the word embedding with the corresponding at- intuition, we put forward a new mechanism to estimate the rel-
tribute embedding, where ew represents the word embedding of evance of words. As we can see, the distribution of each row in the
w and ea represents the corresponding attribute embedding. In re- word alignment matrix represents similarities of the correspond-
cent years, pre-trained word embeddings such as word2vec [33], ing word in one entity with all the words in the other entity. If a
GloVe [38], or character-level embeddings (e.g., FastText [4]) has word in one entity has similar words in the other entity, the dis-
been widely applied. Following DeepMatcher [34], this paper uses tribution curve of the corresponding row shall have some sharp
FastText to embed words, which we suggest is suitable to the entity points at its counterparts. And if not, the distribution curve of this
resolution task, as they can handle out-of-vocabulary (OOV) words row shall be flat. As a result, we can induce the relevant importance
by character-level representations. For attribute embeddings, they of a word in the entity from the distribution of the corresponding
are randomly initialized and learned simultaneously with other pa- row in the alignment matrix. Inspired by this analysis, we com-
rameters in the model. In this way, entity S is represented as a vec- pute the standard deviation for each row of the alignment matrix to
tor sequence S := [ea1 , ew 1 ], [ea2 , ew 2 ], · · · , [eam , ewm ]. We repre- form a weight vector with normalization. The resultant weight vec-
sent entity T in the same way. tor can be utilized to estimate the relevant importance of all words
For different data sources, our method utilizes independent at- in the entity. We denote the weight vector for entity pair S and T as
0 0 0 0
tribute embeddings, but the same word embeddings. Notice that w® = [w 1, w 2, · · · , wm ] and w® = [w 1, w 2, · · · , w n ]. The proposed
the learned attribute embeddings can also be used to induce at- mechanism can learn word importance automatically without im-
tribute alignments between different schemas. For example, our porting additional parameters.

632
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

Comparison Layer. Based on the alignment matrices and weight Then we can gather these two sets of matching vectors for the en-
0
vectors computed above, the aim of the comparison layer is to gen- tity pair S and T to form two matrices C and C . Remind that weight
0 0 0 0
erate a series of matching signals. Intuitively, we first compute a vectors w® = [w 1, w 2, · · · , wm ] and w® = [w 1, w 2, · · · , w n ] measure
soft-attended representation for each token in one entity using all the relevant importance of words for each entity. So we transform
0
tokens in the other entity, as described below: the two matrices C and C with the corresponding weight vector
Õ
n as below:
sˆi = ai j t j (2)
j=1
Cs := [w 1c®1, w 2c®2, · · · , wm c®m ]
where sˆi is the soft-attended representation of the ith word in en- h 0 0 0 0 0 0
i (6)
tity S. There exist two problems in such an attention mechanism: Ct := w 1c®1 , w 2c®2 , · · · , w n c®n
on the one hand, there are only a few fragments that share the
relevant meaning with each other in two entities. The semantic In this way, we compare entity S and T from two directions (S →
relation is obscure when irrelevant ones are involved and it is ob- T and T → S) simultaneously and get two comparison matrices
viously more reasonable to only combine the relevant fragments to which are used as the matching signals.
obtain the corresponding semantic matching vector. On the other Aggregation Layer. The aim of the aggregation layer is to ag-
hand, in the case that si in S is irrelevant to all fragments in T, si gregate matching signals from the two comparison matrices in-
needs to be deprecated since we know there is no matching to it in duced by both directions of the comparison phase. Inspired by [24],
T and it should not be involved in further matching process. How- we utilize convolutional neural networks and design filters based
ever, after the normalization of attention weights, an aligned sub- on various n-grams to serve as feature extractors. The CNN module
phrase is still generated which acts as noise for following phases. involves two sequential operations: convolution and max-pooling.
Inspired by [2], we propose to filter on the alignment matrix For the convolution operation, we define a list of filters {wo }. The
to address this issue using a mechanism called k-max weighted at- shape of each filter is d × h, where d is the dimension of the com-
tention. Specifically, let S be the primary entity, T be the context parison vector and h is the window size of the filter. The following
entity, the elements in the ith row of the alignment matrix Am×n equation expresses this process.
® ai 1 , ai 2 , · · · , ain }. Elements in a® are sorted
constitute a vector a={
 
in descending order, and the indexes of the top k of the largest at-
co,i = f wo ∗ C [i :i+h] + bo
tention weights are denoted as L = {i1 , i2 , · · · , ik }. We keep ai j if it   (7)
0 0
is among the top k in a® and set it to zero otherwise as follows: co,i = f wo ∗ C [i :i+h] + bo
( a
a i j = ÍL i j i∈L where the operation A ∗ B sums up all elements in B with the cor-
j ai j (3) 0
ai j = 0 j <L responding weights in A, C [i :i+h] and C [i :i+h] indicate the patches
0
Thus we modify the computation of the soft-attended representa- from C and C respectively, bo is a bias term and f is a non-linear
tion for token si as below: activation function (we use ReLU in this work). We apply this fil-
Õ ter to all possible patches, and produce two feature vectors c®o =
0 0 0 0
sˆi = ai j t j (4) [co, 1, co, 2, ..., co,O ] and c®o = [co, 1, co, 2, ..., co,O ]. To deal with
j ∈L variable feature size, we perform a max-pooling operation over c®o
0 0
By virtue of k-max weighted attention, relevant fragments are pre- and c®o by selecting the maximum value co = max c®o and co =
0
served while irrelevant ones are discarded in obtaining the cor- max c®o . Therefore, after these operations, each filter generates only
responding soft-attended representation of token si . In this way, one feature. We define several filters by varying window size and
irrelevant fragments are not involved in the corresponding soft- the initial values. Eventually, two feature vectors are generated,
attended representation. Moreover, a fragment without semantic one for each of the comparison matrices, and the feature dimen-
matching relations in the other entity does not take effect in the sion is equal to the number of filters.
subsequent matching phases. In terms of the choice of k, we con- Prediction Layer. The prediction layer performs similarity as-
duct detailed analysis in section 5.3.3 and find that it is dataset sessment based on the two feature vectors generated in the previ-
specific. ous step. Specifically, taking two feature vectors as input, we first
After the steps above, we get the soft-attended representations concatenate them and then pass the resultant vector to a two layer
of all the words in entity S and entity T. For these two entities, we fully-connected layer followed by a softmax classifier to get the fi-
compare each word in one entity with its soft-attended represen- nal similarity score of the entity pair (S, T ). We set the similarity
tation computed using k-most relevant words in the other entity. threshold to 0.5 in this work, and predict the entity pair as a match
Specifically, we use the comparison function element-wise absolute if the similarity score is above the threshold and a non-match oth-
0
difference to calculate a comparison vector c®i and c®i for the ith erwise.
word in entity S and T respectively.
4.2 Model Learning
c®i = |si − sˆi | |T |
(5) Given a training corpus T = {(ei1 , ei2 , li )}i=1 , where ei1 and ei2 are
0
c®i = ti − tˆi a pair of entities, and li ∈ {0, 1} indicates the similarity between

633
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

them, we assign l i = 1 if Si and Ti is a match, and l i = 0 other- Table 5: Statistics of three kinds of datasets used in our ex-
wise. we train our Seq2Seq entity matching model by minimizing periments. Columns “Size”, “#Pos.”, and “#Attr.” list the total
the cross-entropy objective function: numbers of examples, positive examples, and attributes for
each dataset. For the heterogeneous datasets, the number of
|T | h
Õ    i attributes in the left entity is unequal to that of the right
J =− l (i) log p̂ (i) + 1 − l (i) log 1 − p̂ (i) (8) entity. So that column “#Attr.” for the last two datasets are
i=1 “4/5” and “6/8”.
In ER datasets, the positive instances are often far less than neg-
ative ones. To avoid the data-imbalance problem [45], we penalize Type Dataset Domain Size #Pos. #Attr.
errors corresponding to positive instances and negative instances Walmart-Amazon1 electronics 10,242 962 5
DBLP-ACM1 citation 12,363 2,220 4
(entity pairs labeled as “non-match”) with different weights. The Attribute-Aligned
DBLP-Scholar1 citation 28,707 5,347 4
ratio of the weight of the positive class to that of the negative class Hema-Taobao1 white spirits 22,190 2,219 7
ρ is a dataset-specific hyper-parameter, which is set as the ratio of Walmart-Amazon2 electronics 10,242 962 5
the number of negative instances to that of positive instances. The Dirty DBLP-ACM2 citation 12,363 2,220 4
DBLP-Scholar2 citation 28,707 5,347 4
final weighted cross-entropy objective is: Walmart-Amazon3 electronics 10,242 962 4/5
Heterogeneous
Hema-Taobao2 olive oil 13,368 2,228 6/8
|T | h
Õ    i
J =− ρl (i) log p̂ (i) + 1 − l (i) log 1 − p̂ (i) (9)
i=1 compare the corresponding attribute values directly. As a re-
sult, the DeepMater is not applicable to this kind of datasets.
5 EXPERIMENTS Under this scenario, we concatenate all the attribute values
5.1 Experimental Settings for each entity to form a text set and then use DeepMatcher
Datasets. To evaluate different approaches on different ER set- to experiment on these datasets.
tings, this paper conducts experiments on three types of datasets • Magellan is a classical learning-based approach which has
(see Table 5 for statistics of the datasets). For attribute-aligned en- achieved competitive performance on multiple datasets. Mag-
tity resolution, we use 4 datasets. The first 3 datasets are publicly ellan trains a list of classifiers (decision tree, random forest,
available and have been used for entity resolution (e.g., [28, 34]). Naive Bayes, SVM and logistic regression) based on a large
The last Chinese dataset Hema-Taobao1 describes white spirits prod- set of features, and selects the best model on the validation
ucts and is created from two e-commerce platforms Hema1 and set. Note that, due to the use of homogeneous attribute pairs
Taobao2 . For dirty entity resolution, we use 3 open released datasets for the feature generation, Magellan is not applicable to sce-
from [34], where attribute values are randomly misplaced under at- narios where entities are heterogeneous. We therefore omit
tribute Title with 50% probability. For heterogeneous entity res- the results of this approach on the heterogeneous datasets.
olution, we use two datasets, one is Hema-Taobao2 , which is about To evaluate the performance of our model and all the baselines,
olive oil and also from the two e-commerce platforms Hema and we use precision (P), the fraction of correct match predictions; re-
Taobao, with the entity pairs described using different attributes. call (R), the fraction of correct matches being predicted as matches,
Specifically, for entities in the “Hema” platform, they are described and F1 , defined as 2PR/(P+R), as used in [34].
by attributes list of (Title, Name, Brand, Area, Net content, Model Training. We implement the model with PyTorch3 , an
Packaging), while for entities in the “Taobao” platform, the at- open-source deep learning framework with extensive support for
tribute list is (Title, Brand, Series, Area, Net content, accelerating training using GPUs, and run experiments on a server
Packaging, Province, City). The two diverse attribute lists equipped with Intel(R) Xeon(R) E5-2683 CPU, 128GB memory, and
from “Hema” and “Taobao” constitute a typical heterogeneous en- Nvidia TITAN X (Pascal) GPU.
tity resolution problem in real-world applications. The other dataset Using all competing methods, the best models on the validation
for heterogeneous ER problem is a pseudo dataset created by con- set are selected based on the F1 score of the positive instances,
catenating the two specific attribute values in the attribute-aligned and the subsequent performances on the test set are reported. The
Walmart-Amazon (see Table 5) dataset. Those threes types of datasets word embeddings for the English dataset are initialized using the
come with pre-defined train, validation, and test partitions. 300 dimensional FastText [4], a character-level embedding, which
Baselines. In this paper, we compare with two state-of-the-art can approximate the embeddings of the out-of-vocabulary (OOV)
ER baselines – DeepMatcher [34] and Magellan [26]. words using character-level word representations. For the Chinese
• DeepMatcher is a state-of-the-art deep learning based ER ap- datasets, the word embeddings are initialized with the 200 dimen-
proach, which is a structure matching model by: 1) learning sional Tencent AILab Chinese Embedding [44]. The OOV words are
the representations of each attribute value, 2) comparing randomly initialized from a normal distribution with the mean of
the similarity between corresponding attribute representa- 0 and the standard deviation of 0.1. All word embeddings are fixed
tions, and 3) aggregating the matching signals for the final during training. The attribute embeddings are again randomly ini-
ER decision. For the heterogeneous datasets, the attributes tialized from a normal distribution with 0 mean and 0.1 standard
are heterogeneous for different data sources, so we cannot deviation, and are trained simultaneously with other parameters.
The dimension of the attribute embeddings is 50 for English datasets,
1 https://www.freshhema.com
2 https://www.taobao.com 3 https://pytorch.org/

634
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

Table 6: Experimental results on three types of datasets. ∆F1 denotes the performance gap between the proposed model –
SeqSeqMatcher and the state-of-the-art approaches.

Seq2SeqMatcher DeepMatcher Magellan


Type Dataset P R F1 P R F1 P R F1 ∆ F1

Walmart-Amazon1 86.3 71.5 78.2 70.9 64.6 67.6 72.3 71.5 71.9 +6.3
DBLP-ACM1 98.4 99.3 98.9 98.0 98.9 98.4 97.4 99.6 98.4 +0.5
Attribute-Aligned DBLP-Scholar1 92.8 97.9 95.3 94.8 94.5 94.7 94.3 90.4 92.3 +0.6
Hema-Taobao1 81.3 79.4 80.3 53.5 78.4 63.6 - - - +16.7
Walmart-Amazon2 72.3 64.8 68.3 56.3 51.6 53.8 33.8 42.0 37.4 +14.5
Dirty DBLP-ACM2 98.9 98.0 98.4 98.6 97.5 98.1 93.7 90.1 91.9 +0.3
DBLP-Scholar2 92.7 95.6 94.1 94.3 93.4 93.8 87.1 78.4 82.5 +0.3
Walmart-Amazon3 83.1 68.9 75.4 74.9 67.9 71.2 - - - +4.2
Heterogeneous Hema-Taobao2 95.9 94.0 94.9 71.0 66.0 68.4 - - - +26.5

Table 7: Experimental results of the modified models on three types of datasets. SeqSeqMatcher(-AttrInfo) denotes the model
with attribute information discarded from the main model as described in section 5.3.1. SeqSeqMatcher(-BiAlign) denotes the
model with the bidirectional alignment mechanism removed from the main model as described in section 5.3.2. ∆F1 under
each column denotes the performance gap between the corresponding modified model and the main model Seq2SeqMacher.

SeqSeqMatcher(-AttrInfo) SeqSeqMatcher(-BiAlign)
Type P R F1 ∆ F1 P R F1 ∆ F1
Walmart-Amazon1 43.1 56.5 48.9 -29.3 74.3 69.4 71.8 -6.4
DBLP-ACM2 97.3 97.1 97.2 -1.7 96.9 98.9 97.9 -1.0
Attribute-Aligned DBLP-Scholar2 86.3 92.4 89.3 -6.0 89.8 90.2 90.0 -5.3
Hema-Taobao2 66.8 81.7 73.5 -6.8 77.8 81.5 79.6 -0.7
Walmart-Amazon2 53.0 40.9 46.2 -22.1 65.0 68.4 66.7 -1.6
Dirty DBLP-ACM2 96.0 96.6 96.3 -2.1 94.3 96.4 95.3 -3.1
DBLP-Scholar2 89.1 89.1 89.1 -5.0 86.1 87.8 86.9 -7.2
Walmart-Amazon2 68.4 53.9 60.3 -15.1 72.2 74.1 73.1 -2.3
Heterogeneous Hema-Taobao2 70.1 79.6 74.5 -20.4 89.3 97.2 93.1 -1.8

and 40 for Chinese datasets. Adam [25] is used for parameters op- • The proposed model – Seq2SeqMatcher can effectively solve
timization with an initial learning rate of 0.001. The mini-batch the heterogeneous and dirty ER problems. Compared with
size is set to 32 for all datasets. The filter types include {unigram, DeepMatcher and Magellan, our method can achieve con-
bigram, trigram} and each type of filter contains 100 filters. The siderable performance gain on all dirty and heterogeneous
choice of value k used in the k-max weighted attention mechanism datasets. We believe that is because token-level compari-
is dataset-specific and based on the analysis in the following part. son can capture the token-level relatedness between entity
records, which are more fine-grained and more robust for
similarity judgment.
• The proposed model can achieve robust performance on dif-
5.2 Overall Performance ferent settings. In the attribute-aligned settings, our model
Table 6 shows the overall performance on all attribute-aligned, dirty does not use any attribute alignment information, but can
and heterogeneous datasets. For DeepMatcher and Magellan, we still achieve the best performance on all datasets. We believe
directly use their reported performance from [34] on attribute-aligned this is because our neural model can automatically learn the
and dirty settings. For heterogeneous settings, we only report the correspondence between aligned attribute values by utiliz-
performance of DeepMatcher by concatenating all attribute values ing the latent semantic relatedness.
of each entity to a composite attribute value as it is not directly • By comparing tokens across different attributes, our model
applicable to this scenario. We do not report the performance of appears to be robust when dealing with the dirty ER prob-
Magellan on the heterogeneous datasets, since it is not applicable lems. On the dirty datasets, our model achieves notable per-
to this case. However, the proposed model – Seq2SeqMatcher is ap- formance gain compared with DeepMatcher and Magellan.
plicable to all above settings. From Table 6, it can be seen that: We believe this is mainly because the token-level model is

635
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

(a) Attribute alignment on attribute- (b) Attribute alignment on dirty DBLP- (c) Attribute alignment on heterogeneous
aligned Walmart-Amazon dataset. ACM dataset. Hema-Taobao2 dataset.

Figure 3: Visualization of attribute alignments between different data sources on three kinds of datasets.

more robust than the attribute-level model, which can better entity provide structural information to the overall semantics of
address the misplacing, missing, and noisy data problems. the entity, which help in the entity matching process. To analyze
• By comparing entities in token level and learning matching the effectiveness of attribute information in entity resolution, we
models in an end-to-end manner, our proposed model can model ER as a text matching problem by discarding attribute in-
effectively address entity heterogeneity problem. As we can formation from our model Seq2SeqMatcher, i.e., we represent each
see, Seq2SeqMatcher shows remarkable performance on all token t just by its word embedding ew . Seq2SeqMatcher(-AttrInfo)
heterogeneous datasets. We owe this to the powerful abil- from Table 7 denotes the ablated model. From the results, we can
ity of our model to capture the latent associations between see that attributes provide useful information for entity resolution.
relevant tokens across different attributes. On all datasets, removing attributes results in performance drop
to some degree. This observation verifies that attributes can pro-
vide useful information to solving ER problems, and it is beneficial
to incorporate such information in our Seq2Seq matching models,
which helps induce token-level correspondence and disambiguate
tokens under different attributes.
To analyze whether our end-to-end model can learn attribute
alignments between different data sources, Figure 3 shows sev-
eral learned attribute alignment examples of the corresponding
datasets. Specifically, we extract the learned attribute embeddings
for the left data source and the right data source. For each attribute
in the left data source, we align it with the nearest attribute in the
right data source which has the largest embedding similarity with
it, and normalizes the resultant similarity matrix to form an at-
tributes alignment matrix. Note that, we learn the attribute embed-
dings without utilizing any semantic information of the attributes.
We can see that, the results in Figure 3 confirm that our model
can accurately learn attribute alignments by capturing toke-level
correspondence.

5.3.2 Effect of Bidirectional Alignment. In our model, we com-


Figure 4: Visualization of token-level alignment between
pare entity records in two directions (S → T and T → S). As we
two entities.
can see, for each word in the primary entity, it may be covered
by a few fragments in the context entity, and vice versa. Thus we
should compare the entities from both directions to exactly model
5.3 Detailed Analysis the many-to-many matching relations between two entities. To ver-
5.3.1 Effect of Attribute Information. In entity resolution, en- ify the effectiveness of bidirectional token alignment, we conduct
tities are structured as <attribute, value> pairs, and each word in ablation studies on this module. Specifically, we remove the bidi-
the entity resides in a specific attribute. The attributes for each rectional alignment mechanism from the main model and compare

636
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

(a) Performance on attribute-aligned datasets (b) Performance on dirty datasets. (c) Performance on heterogeneous datasets

Figure 5: Performance on all kinds of datasets when k take values in {1, 2, 3, 4, all }

entities from only one direction. We denote the model with bidirec- hard attention only selects the most relevant token to com-
tional comparison mechanism removed as Seq2SeqMatcher(-BiAlign) pute the entity representation.
and experiment on three types of datasets, the results are shown in • The k-max weighted attention (k = [2, 3, 4]) mechanism is
Table 7. We can see that bidirectional alignment mechanism always usually more effective than traditional attention mechanism
outperforms unidirectional alignment mechanism. The results con- (k = all). We owe this to that the proposed k-max weighted
sistently verify the effectiveness of the bidirectional comparison attention mechanism can filter out redundant words and fo-
mechanism. We believe that’s because the bidirectional compari- cus more on relevant information when calculating the soft-
son mechanism can exactly model the many-to-many correspon- attended representations of tokens.
dence of tokens between two entities. • Under the k-max weighted attention mechanism, the appro-
For analysis, we also show a token alignment example of our priate choice of value k is dataset dependent. The proper-
Seq2SeqMatcher in Figure 4. This is done by visualizing the atten- ties of the datasets can determine the proportion of relevant
tion weights ai j in Equation (1). Each row of the matrix in each plot words, thus influencing the best choice of value k.
indicates the similarities of the corresponding word in the primary
entity with all words in the context entity, which will be used as
weights to compute the soft representations. We can see from the 6 CONCLUSION
alignments in Figure 4 that our model learns accurate correlations In this paper, we propose a sequence-to-sequence entity match-
between the same or similar words, even though the embedding ing model Seq2SeqMatcher for heterogeneous entity resolution. By
for each word is concatenated with the corresponding attribute modeling ER as a token-level sequence-to-sequence matching task,
embedding. our model can effectively solve the heterogeneous and dirty prob-
lems. Furthermore, by learning the representation of tokens, cap-
5.3.3 Effect of K-max Weighted Attention Mechanism. In turing the semantic relevance between tokens, and aggregating
entity resolution tasks, our model aligns one token with several matching evidence for accurate ER decisions in an end-to-end man-
other tokens using the soft attention. Traditional attention mecha- ner, our Seq2Seq entity matching model achieves substantial per-
nism obtains the corresponding soft representation by an attention- formance improvements on 9 standard entity resolution benchmarks.
weighted sum of all word representations [37], which will inevitably For future work, we find that the token embedding in entity
bring in noise and redundancy. In this paper, we propose a k-max resolution is a big challenge, because many words in ER are out-
weighted attention mechanism to alleviate redundancy. The basic of-vocabulary (OOV) words or numbers, and a word’s context is
idea is to filter irrelevant words and keep the top k of the largest usually sparse and structural. To address these issues, we would
attention weights. like to develop an embedding algorithm which is robust in entity
In order to verify the effectiveness of k-max weighted attention resolution and can effectively take the schema and structure of
mechanism and select the best choice of value k, we conduct ex- knowledge base into consideration.
periments with k varying from {1, 2, 3, 4, all }. In the above experi-
mental settings, k = 1 denotes hard attention and k = all denotes
traditional attention mechanism. From Figure 5, we can see that: ACKNOWLEDGMENTS
This work is supported by the National Key R&D Program of China
• The soft attention mechanism (k = [2, 3, 4, all]) is more ef- under Grant 2018YFB1005100; the National Natural Science Foun-
fective than the hard attention (k=1) for most of the cases. dation of China under Grants no. 61433015, 61572477 and 61772505;
We believe that is because the soft attention can capture the Young Elite Scientists Sponsorship Program no. YESS20160177;
a soft-attended correspondence between entities, while the and University of Chinese Academy of Sciences.

637
Session: Long - Heterogeneous Data CIKM ’19, November 3–7, 2019, Beijing, China

REFERENCES of the VLDB Endowment 9, 12 (2016), 1197–1208.


[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine [27] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2009. Comparative evaluation
Translation by Jointly Learning to Align and Translate. international conference of entity resolution approaches with fever. Proceedings of the VLDB Endowment
on learning representations (2015). 2, 2 (2009), 1574–1577.
[2] Weijie Bian, Si Li, Zhao Yang, Guang Chen, and Zhiqing Lin. 2017. A compare- [28] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity
aggregate model with dynamic-clip attention for answer selection. In Proceed- resolution approaches on real-world match problems. Proceedings of the VLDB
ings of the 2017 ACM on Conference on Information and Knowledge Management. Endowment 3, 1-2 (2010), 484–493.
ACM, 1987–1990. [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifi-
[3] Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection us- cation with deep convolutional neural networks. In Advances in neural informa-
ing learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD tion processing systems. 1097–1105.
international conference on Knowledge discovery and data mining. ACM, 39–48. [30] Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore
[4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En- Graepel, and Zoubin Ghahramani. 2013. Sigma: Simple greedy matching for
riching word vectors with subword information. Transactions of the Association aligning large knowledge bases. In Proceedings of the 19th ACM SIGKDD interna-
for Computational Linguistics 5 (2017), 135–146. tional conference on Knowledge discovery and data mining. ACM, 572–580.
[5] Michelle Cheatham and Pascal Hitzler. 2013. String similarity metrics for ontol- [31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature
ogy alignment. In International Semantic Web Conference. Springer, 294–309. 521, 7553 (2015), 436.
[6] Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data clean- [32] Pengfei Liu, Xipeng Qiu, Yaqian Zhou, Jifan Chen, and Xuanjing Huang. 2016.
ing: Overview and emerging challenges. In Proceedings of the 2016 International Modelling Interaction of Sentence Pair with Coupled-LSTMs. empirical methods
Conference on Management of Data. ACM, 2201–2206. in natural language processing (2016), 1703–1712.
[7] William W Cohen and Jacob Richman. 2002. Learning to match and cluster large [33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
high-dimensional data sets for data integration. In Proceedings of the eighth ACM Distributed representations of words and phrases and their compositionality. In
SIGKDD international conference on Knowledge discovery and data mining. ACM, Advances in neural information processing systems. 3111–3119.
475–480. [34] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
[8] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Mur- Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.
phy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings
A web-scale approach to probabilistic knowledge fusion. In Proceedings of the of the 2018 International Conference on Management of Data. ACM, 19–34.
20th ACM SIGKDD international conference on Knowledge discovery and data min- [35] Felix Naumann and Melanie Herschel. 2010. An introduction to duplicate detec-
ing. ACM, 601–610. tion. Synthesis Lectures on Data Management 2, 1 (2010), 1–87.
[9] Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Mur- [36] George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Super-
phy, Shaohua Sun, and Wei Zhang. 2014. From data fusion to knowledge fusion. vised meta-blocking. Proceedings of the VLDB Endowment 7, 14 (2014), 1929–
Proceedings of the VLDB Endowment 7, 10 (2014), 881–892. 1940.
[10] Xin Luna Dong and Felix Naumann. 2009. Data fusion: resolving data conflicts [37] Ankur P Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. 2016. A
for integration. Proceedings of the VLDB Endowment 2, 2 (2009), 1654–1655. Decomposable Attention Model for Natural Language Inference. empirical meth-
[11] Uwe Draisbach and Felix Naumann. 2009. A comparison and generalization of ods in natural language processing (2016), 2249–2255.
blocking and windowing algorithms for duplicate detection. In Proceedings of [38] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
the International Workshop on Quality in Databases (QDB). 51–56. Global vectors for word representation. In Proceedings of the 2014 conference on
[12] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad empirical methods in natural language processing (EMNLP). 1532–1543.
Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity [39] Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication
resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454–1467. using active learning. In Proceedings of the eighth ACM SIGKDD international
[13] Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. conference on Knowledge discovery and data mining. ACM, 269–278.
Duplicate record detection: A survey. IEEE Transactions on knowledge and data [40] Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview.
engineering 19, 1 (2007), 1–16. Neural networks 61 (2015), 85–117.
[14] Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record [41] Warren Shen, Xin Li, and AnHai Doan. 2005. Constraint-based entity matching.
matching rules. Proceedings of the VLDB Endowment 2, 1 (2009), 407–418. In AAAI. 862–867.
[15] Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. [42] Rohit Singh, Vamsi Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti,
2019. End-to-End Multi-Perspective Matching for Entity Resolution. In Proceed- Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Gener-
ings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, ating concise entity matching rules. In Proceedings of the 2017 ACM International
IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, Conference on Management of Data. ACM, 1635–1638.
4961–4967. https://doi.org/10.24963/ijcai.2019/689 [43] Parag Singla and Pedro Domingos. 2006. Entity resolution with markov logic.
[16] Amir Gandomi and Murtaza Haider. 2015. Beyond the hype: Big data concepts, In Sixth International Conference on Data Mining (ICDM’06). IEEE, 572–582.
methods, and analytics. International journal of information management 35, 2 [44] Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. 2018. Directional Skip-
(2015), 137–144. Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings.
[17] Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, prac- In Proceedings of the 2018 Conference of the North American Chapter of the
tice & open challenges. Proceedings of the VLDB Endowment 5, 12 (2012), 2018– Association for Computational Linguistics: Human Language Technologies, Vol-
2019. ume 2 (Short Papers). Association for Computational Linguistics, New Orleans,
[18] Yoav Goldberg. 2016. A primer on neural network models for natural language Louisiana, 175–180. https://doi.org/10.18653/v1/N18-2028
processing. Journal of Artificial Intelligence Research 57 (2016), 345–420. [45] Yanmin Sun, Andrew KC Wong, and Mohamed S Kamel. 2009. Classification
[19] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep of imbalanced data: A review. International Journal of Pattern Recognition and
learning. Vol. 1. MIT press Cambridge. Artificial Intelligence 23, 04 (2009), 687–719.
[20] Binbin Gu, Zhixu Li, Xiangliang Zhang, An Liu, Guanfeng Liu, Kai Zheng, Lei [46] Sheila Tejada, Craig A Knoblock, and Steven Minton. 2002. Learning domain-
Zhao, and Xiaofang Zhou. 2017. The interaction between schema matching and independent string transformation weights for high accuracy object identifica-
record matching in data integration. IEEE Transactions on Knowledge and Data tion. In Proceedings of the eighth ACM SIGKDD international conference on Knowl-
Engineering 29, 1 (2017), 186–199. edge discovery and data mining. ACM, 350–359.
[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual [47] Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral Multi-Perspective
learning for image recognition. In Proceedings of the IEEE conference on computer Matching for Natural Language Sentences. international joint conference on arti-
vision and pattern recognition. 770–778. ficial intelligence (2017), 4144–4150.
[22] Julia Hirschberg and Christopher D Manning. 2015. Advances in natural lan- [48] Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald,
guage processing. Science 349, 6245 (2015), 261–266. and Hector Garcia-Molina. 2009. Entity resolution with iterative blocking. In
[23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neu- Proceedings of the 2009 ACM SIGMOD International Conference on Management
ral computation 9, 8 (1997), 1735–1780. of data. ACM, 219–232.
[24] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. [49] Yang Yang, Yizhou Sun, Jie Tang, Bo Ma, and Juanzi Li. 2015. Entity matching
empirical methods in natural language processing (2014), 1746–1751. across heterogeneous sources. In Proceedings of the 21th ACM SIGKDD Interna-
[25] Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- tional Conference on Knowledge Discovery and Data Mining. ACM, 1395–1404.
mization. international conference on learning representations (2015). [50] Dongxiang Zhang, Long Guo, Xiangnan He, Jie Shao, Sai Wu, and Heng Tao
[26] Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jef- Shen. 2018. A graph-theoretic fusion framework for unsupervised entity res-
frey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. olution. In 2018 IEEE 34th International Conference on Data Engineering (ICDE).
Magellan: Toward building entity matching management systems. Proceedings IEEE, 713–724.

638

You might also like