Professional Documents
Culture Documents
A Supervised Machine Learning Framework With Combined Blocking For Detecting Serial Crimes
A Supervised Machine Learning Framework With Combined Blocking For Detecting Serial Crimes
https://doi.org/10.1007/s10489-021-02942-x
Abstract
Detecting serial crimes is to find criminals who have committed multiple crimes. A classification technique is often used to
process serial crime detection, but the pairwise comparison of crimes is of quadratic complexity, and the number of nonserial
case pairs far exceeds the number of serial case pairs. The blocking method can play a role in reducing pairwise calculation
and eliminating nonserial case pairs. But the limitation of previous studies is that most of them use a single criterion to
select blocks, which is difficult to guarantee an excellent blocking result. Some studies integrate multiple criteria into one
comprehensive index. However, the performance is easily affected by the weighting method. In this paper, we propose a
combined blocking (CB) approach. Each criminal behaviour is defined as a behaviour key (BHK) and used to form a block.
CB learns several weak blocking schemes by different blocking criteria and then combines them to form the final blocking
scheme. The final blocking scheme consists of several BHKs. Because rare behaviour can better identify crime series, each
BHK is assigned a score according to its rarity. BHKs and their scores are used to determine whether a case pair need to be
compared. After comparing with multiple blocking methods, CB can effectively guarantee the number of serial case pairs
while greatly reducing unnecessary nonserial case pairs. The CB is embedded in a supervised machine learning framework.
Experiments on real-world robbery cases demonstrate that it can effectively reduce pairwise comparison, alleviate the class
imbalance problem and improve detection performance.
Keywords Serial crime detection · Classification · Pairwise calculation · Blocking · Class imbalance
13
Vol.:(0123456789)
11518 Y. Li, X. Shao
series detection. Because each case must be compared with method. When some crimes have the same behaviours, they are
all the remaining cases, scalability becomes an issue. For more likely to be serial crimes if these behaviours are rare [20],
example, given a crime set consisting of n cases, then there so we aim to take the rarity of behaviours into account. For the
would be n × (n-1)/2 pairwise comparisons. Considering the blocking method, we aim to consider multiple criteria to form
above situation, and there are two problems to be solved. (1) a combined blocking method(CB) and do not integrate them
Data imbalance. Pairwise comparison generates case pairs, into a comprehensive index. CB learns several weak blocking
so serial crime pairs are always the minority class compared schemes by multiple blocking criteria and then combines them
with nonserial crime pairs. Previous studies have mentioned to form the final blocking scheme. Each criminal behaviour is
this class imbalance problem [16]. (2) Massive number of called a behaviour key (BHK). A BHK forms a block to con-
pairwise calculations. Because each crime must be compared tain cases of sharing this common BHK. The final blocking
with every remaining crime to determine whether the same scheme consists of several BHKs, and each BHK is assigned
offender committed them, the computational efforts of pair- a score according to its rarity. Each case pair calculates a com-
wise similarity measures grow quadratically with the number bined score according to the BHKs to which it belongs. Two
of crimes. Thus, it is problematic to handle large-scale crime. crimes are more easily compared when they share more iden-
Because serial case pairs are much fewer in number than non- tical BHKs, and they are more likely to be compared if their
serial case pairs, most similarity calculations are performed common BHKs are rare. Based on this key point, CB is able to
on nonserial case pairs, which are not the target of the task. pick out serial case pairs and filter out nonserial case pairs. The
For large-scale crimes, pairwise comparisons are computa- contributions of our study are as follows:
tionally intensive, and class imbalance is more serious.
To solve the above two problems, some measures need (1) We use the criminal behaviours to divide crimes into
to be taken to reduce the number of nonserial case pairs. different blocks and put the crimes with the same
Blocking technology can achieve this goal. In the blocking behaviours into the same block.
step, the objects to be compared are divided into several (2) We propose a combined blocking method that uses multi-
subsets called blocks, and only objects within the same block ple criteria for block selection. Each criterion forms a weak
are compared with each other [17]. Thus, only a portion of blocking scheme, and the final strong blocking scheme is
case pairs is selected as promising candidates for similar- combined by all the weak blocking schemes. According to
ity computation [18], which can reduce the class imbalance the characteristics of serial crimes, the rarity of criminal
ratio and unnecessary pairwise calculations. Blocking can behaviours is considered. BHKs and their scores are used
be divided into two steps: blocks formation and blocks selec- to select case pairs that need to be compared.
tion. Because the behaviours of recidivists are similar in dif- (3) We embed the CB into a supervised machine learning
ferent crimes, the behaviour of offenders can be used to form framework for crime series detection. Real-world rob-
blocks, and the cases with similar behaviours are divided bery cases are used to evaluate the performance of CB
into the same block. The pairs completeness (PC), reduction and serial crime detection. The results show that the
ratio (RR), harmonic mean (FM) of RR and PC, and pair CB can reduce pairwise calculations, alleviate the class
quality (PQ) are generally used to select blocks [19]. imbalance problem and improve detection performance.
Blocking methods have been widely used for record link-
age and entity matching [19]. However, it is not specifically Section 2 reviews the related studies concerning serial
designed for serial crime detection and does not consider the crime detection and blocking. Section 3 gives the basic
characteristics of the crime series. In addition, some blocking definitions and the problem to be solved in this paper. Sec-
methods use a single criterion or integrate multiple criteria into tion 4 describes the methodology and proposes the com-
one comprehensive index to select blocks. It is difficult to guar- bined blocking, including the learning and application of
antee an excellent blocking result via a single criterion, com- the blocking scheme. Section 5 conducts some experiments
prehensive criteria are easily affected by the weighting method. to prove the effectiveness of the combined blocking, and a
Due to the limitations mentioned above, we want to consider robbery case set of China is employed. Section 6 presents
the characteristics of crime series and design a new blocking our conclusion and future plans.
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11519
2 Literature review within the same block are compared with each other [36].
Some studies focus on how to produce better blocks (block
In this section, we review the related literature on serial crime formation). Others are studying which blocks should be
detection from the two aspects of similarity measures and selected to constitute the blocking scheme (block selection).
detection methods. Then, we discuss the research on blocking In the formation of blocks, the traditional block method
technology regarding the formation and selection of blocks. inserts each object into only one block [37]. Its disadvan-
tage is its low fault tolerance. The improper formation will
2.1 Serial crime detection easily insert two matching objects into different blocks. To
make up for such shortcomings, one solution is that each
Given the importance and challenges of serial crime detection, object can be inserted into multiple blocks [38]. In addition,
there have been many studies in this field, and various meth- researchers change the string of attribute values to make
ods have been used to detect crime series. Recently, machine one block contain more matching objects and fewer non-
learning methods have been applied, and both supervised matching objects. Q-gram assumes that the attribute val-
and unsupervised methods are involved. The unsupervised ues are all strings and uses a substring of length q to form
method mainly finds similar cases to form clusters. Borg and blocks [39]. Objects only need to have the same substring.
Boldt used four clustering algorithms to identify crime series Even if the attribute values do not match exactly, they can
based on behavioural similarity [21]. Zhu and Xie employed be inserted into the same block. String-map based blocking
the Restricted Boltzmann Machine (RBM) to obtain the co- maps strings into multidimensional Euclidean space so that
occurrence pattern in criminal behaviour to link crimes [22]. the distance between strings will not change and then groups
Supervised methods treat serial crime detection as a binary similar strings in a block [40]. There are also methods to
classification task, and the methods used include neural net- sort strings. Sorted Neighbourhood sorts all preset strings
works [13], logistic regression [7, 23, 24], decision trees [25], in alphabetical order and then slides a fixed-size window
and Bayesian classification [26], etc. Apart from these, various on the sorted record to divide the record in each window
other approaches are also applied. Reich and Porter designed into a block [41]. Suffix Array is a combination of Sorted
a semisupervised Bayesian model-based clustering algorithm Neighbourhood and Q-gram [42, 43]. The attribute values
to group similar crimes [27]. Some researchers have applied and their suffix are both sorted, which is the input of Sorted
fuzzy multicriteria decision making(MCDM) to combine sev- Neighbourhood blocking technology. The above methods
eral attributes to aggregate a single value denoting the overall can form good blocks, but they all assume that attribute val-
similarity between crimes [28, 29]. Qazi and William extracted ues are string.
reasonable correlations by combining human interaction with In the selection of blocks, the goal of the blocking
the machine learning method to identify crime series [30]. method is to select several blocks to form a blocking scheme
Supervised methods can achieve better results because they so that all or nearly all matching pairs can be contained,
learn domain knowledge from historical data. According to and the number of nonmatching pairs is as small as pos-
the process of detecting serial crimes in Fig. 1, the attribute sible. According to this goal, blocking schemes are usually
similarity needs to be calculated after the cases form case pairs. appraised by the reduction ratio (RR), pairs completeness
Different attribute types have their own similarity measurement (PC), harmonic mean (FM) of RR and PC, and pair qual-
methods. The absolute distance is often used for the comparison ity (PQ). These criteria can be used to select the optimal
of numerical attributes [31]. For categorical attributes, binari- blocks to form a blocking scheme [44]. Bilenko, Kamath,
zation is usually the first step [32], and then a binary distance and Mooney used PQ as a criterion for evaluating blocks,
measurement method such as Jaccard’s coefficient is used to sorted the PQ values of each block in descending order, and
calculate the distance of the attribute value [33]. The values of added the blocking scheme one by one until the matching
some crime attributes are keywords in the case narration. To pairs covered exceeded the threshold [45]. Based on this
understand their real meaning, word2vec can be used to meas- learning process, Kejriwal and Miranker also introduced a
ure the distance of keyword attributes [34]. In reality, after cases new Fisher score criterion to learn the blocking scheme [46].
are paired, the number of nonserial case pairs is far greater than Nascimento, Pires and Mestre exploited the co-occurrence
the number of serial case pairs [16, 35]. To improve the detec- of entities to prune, split and merge blocks to guarantee the
tion effect, this class imbalance problem needs to be solved. block size [47]. O’Hare et al. assigned the FM of a block as
its weight, selected the top 3 as a blocking filter, and then
iteratively clustered the records into blocks [48]. To divide
2.2 Blocking the blocks better, some scholars have attempted the use of
multiple criteria to select the block. Michelson and Kno-
Blocking methods divide objects into several blocks (subsets block devised a blocking method that selects each block with
of all objects) according to attribute values, and only objects the higher RR while maintaining the PC above a threshold
13
11520 Y. Li, X. Shao
[49]. Ramadan and Christen considered the Fisher score, Definition 3 (Similarity measure) The similarity measure
block size, and block size distribution to construct a com- simm can quantify the attribute similarity of a crime pair pij
prehensive index by weighting [50]. Song, Luo and Heflin on the attribute
( am). The calculation can be represented as
designed a block selection scheme, records are divided into Sij = sim ai , aj . A high Sijm means that crimes ci and cj
m m m m
a set of triples composed of subject, attribute and object, a are very similar on the attribute am.
block needs to have high discriminability and coverage to
be select [51]. When multiple indicators are integrated into Definition 4 (Similarity vector) The M attributes similarity
one criterion, the weight is difficult to determine. Based on of crime pair pij is Sij1 , … , Sijm , … , SijM , which can be defined
these methods and criteria, we propose a combined blocking
by a similarity vector Sij.
method that considers multiple blocking criteria and does
not need to combine multiple criteria into a comprehensive Sij =< Sij1 , … , Sijm , … , SijM > (1)
index. Every single criterion learns a weak blocking scheme,
and they together form the final blocking scheme. Definition 5 (Supervised serial crime detection)
Given solved crimes CS, and form labelled crime pairs
T = {Sij, yij}u × (M + 1), including serial crimes (yij = 1) and
3 Preliminary and problem definition nonserial crimes (yij = 0), to train a classifier. For unsolved
crimes Cu, form unlabelled crime pairs V = {Sij}v × M in Cu
Letting C = {c1,…,ci,…,cj,…,cn} be a set of cases, this study and between CS and Cu. Apply the trained classifier to deter-
aims to find serial crimes in C. In this section, we first give mine the label of unlabelled crime pairs.
several basic definitions and then describe the problems in The pairwise calculations grow quadratically with the
this paper. The notations are listed in Table 1. number of crimes. Scaling the above tasks to large-scale
crimes is problematic. After pairing, because the scale of
Definition 1 (Crime pair) Each crime ci has M attributes nonserial case pairs is much larger than that of serial ones,
ai =< a1i , … , am
i
, … , aM
i
> . A crime pair pij consists of two there are plenty of unnecessary calculations. This paper aims
crimes ci and cj. For n crimes, the number of crime pairs is to reduce the comparison between crimes and the computa-
n × (n − 1)/2. tional overhead during crime series detection while ensuring
the quality of the pairwise comparison.
Definition 2 (Blocking) Given a crime set C, the blocking
method divides it into several subsets C1,C2,...Ci. Each sub-
set is a block, and crime pairs are formed by crimes within 4 Methodology
the same block.
Table 1 Notation Information 4.1 Overview of the methodology
Notation Description
Crime linkage mainly has three steps: pairing, comparison
C The crime set and classification (Fig. 1). The blocking method plays a role
CS The solved crime set in the Pairing step. Figure 2 details the process of detecting
Cu The unsolved crime set serial crimes with blocking. The process contains two parts.
n The number of crimes (1) Training, which uses the solved cases to learn the block-
N The number of crime pairs,
ing scheme and train the classifier. (2) Detection, of which
N = n × (n-1)/2
pij The crime pair constitutes crime ci and
the goal is to determine whether an unlabelled crime pair is
cj, where i, j ∈ {1, 2, …, n}, i < j serial. There are four main phases as follows.
M The number of attributes
am The attribute m • Learning blocking schemes: Based on the solved crimes,
am
i
The value of the attribute am of crime ci learn the combined blocking, i.e., rules for dividing the
ai The attribute
( vector of ci, ) crime set into blocks.
ai = a1i , … , am
i
, … , aM
i • Applying blocking schemes: For the training part,
simm The similarity measure of the attribute
employ the blocking scheme on solved crimes to gen-
am
Sijm The similarity of pij in attribute am
erate labelled crime pairs for training. When detecting
Sij The similarity vector of pij, where serial crimes, a blocking scheme is applied to the union
( ) of solved crimes and unsolved crimes to generate unla-
Sij = Sij1 , … , Sijm , … , SijM
belled crime pairs related to unsolved cases because
yij The label of the crime pair pij, yij
∈{0, 1}
serial case pairs may form between unsolved cases or
between unsolved cases and solved cases.
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11521
• Comparison: Similarity vectors of case pairs are calcu- Example. Given a set of individual behaviour keys,
lated by the corresponding similarity measure. K = k1, …, kk, the disjunctive blocking scheme is k1 ∪ … ∪ kj,
• Classification: Employ the machine learning classification the conjunctive blocking scheme is k1 ∩ … ∩ kj, and the
algorithm to train labelled case pairs, obtain a classifier, disjunctive normal form blocking scheme is (k1 ∩ … ∩ kj
and classify similarity vectors of unlabelled crime pairs. ) ∪ … ∪ (k1 ∩ … ∩ kj) . Figure 3c is a disjunctive blocking
The classification output is whether a case pair is serial. scheme formed by <Nsuspects, 2 > ∪ < Wthreat, knife
threat > ∪ < Wproperty, snatched > .
4.2 Learning blocking schemes The blocking problem for crime series detection can be
described as follows. Given a set of BHKs, the optimization
4.2.1 Description of blocking goal is to determine the optimal blocking scheme so that
all or nearly all serial crime pairs can be covered and the
The blocking for serial crime detection aims to identify case generated nonserial crime pairs can be as few as possible.
pairs that share common behaviour characteristics to produce Formally, this objective can be expressed by Eq. (2):
candidate case pairs for comparison. In this paper, we define ∑
each behaviour as a behaviour key (BHK). Crimes covered by Γ∗ = argminΓ |Γ(p)|s.t. yij ≥ 𝜀
(2)
the same BHK indicate that the criminals have the same behav- pij ∈Γ(p)
Similarity
measures
13
11522 Y. Li, X. Shao
Fig. 3 An example of crimes, Number of suspects Ways of threat Ways to get property
behaviour keys, and a blocking Attribute
(Nsuspects) (Wthreat) (Wproperty)
scheme. a Eight cases and their
three attributes. b BHKs for the c1 2 Violence threat Snatched
8 cases. c A blocking scheme c2 1 Violence threat Searched
c3 2 Violence threat Snatched
c4 1 Violence threat Asked for
c5 1 Speech threat Searched
c6 1 Knife threat Asked for
c7 1 Knife threat Asked for
c8 3 Speech threat Snatched
(a)
Behaviour Key(BHK) Nsuspects Wthreat Wproperty
Nsuspects Wthreat Wproperty 2 Knife threat Snatched
1 Violence threat Snatched
2 Knife threat Searched c1 c6 c1
3 Speech threat Asked for c3 c7 c3
c8
(b) (c)
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11523
• Weak blocking scheme ΓRR represents the weak blocking scheme learned by the
criteria RR. BS means the block size. The remaining BHKs
Solved cases are used to learn weak blocking schemes. The in each blocking scheme are called candidate BHKs, and the
pseudocode is described in Algorithm 1. First, discard the set is expressed as K" (Eq. (8)).
BHKs that do not cover serial case pairs, because serial crimes { ( ) }
do not exhibit the characteristics of these BHKs. r(k) denotes Kε = k|kϵ ΓRR ∪ ΓPC ∪ ΓPQ ∪ ΓBS , k ∉ K � (8)
serial case pairs covered by the BHK k. Second, we calculate
According to the BHKs in the strong blocking scheme, the
the score of each BHK according to different blocking crite-
calculation process to determine whether a case pair should
ria and then sort them according to their scores. Third, |r(Γ)|
be compared is as follows. The combined score of a case pair
denotes the number of serial case pairs that a blocking scheme
is the key and it is related to BHKs that cover this case pair.
Γ covers. Add BHKs to the weak blocking scheme Γuntil |r(Γ)|
Each BHK is given a score, and the combined score of a case
is no less than ε or the set of BHKs is empty. A large ε specifies
pair is the sum of the scores of all the BHKs that cover it.
a blocking scheme to cover more serial case pairs.
• Strong blocking scheme blocking schemes are strong because they are selected under
different blocking criteria. They are denoted as K′ (Eq. (7)).
After the above steps, four weak blocking schemes are { ( )}
obtained, i.e., four sets of BHKs. The strong blocking K � = k|kϵ ΓRR ∩ ΓPC ∩ ΓPQ ∩ ΓBS (7)
scheme combines them. The BHKs included in all weak
13
11524 Y. Li, X. Shao
Because BHKs in K′ are strong BHKs, we give them a of case pairs they cover. For example, the BHK “Nsus-
score of infinity. The scores of BHKs in K"are determined pects-1” covers 10 case pairs, so its score is 10 1
(Fig. 5b).
based on the number of case pairs they cover. If a BHK A case pair may be covered by multiple BHKs, indicating
covers some cases, these cases use this type of behaviour. that there are multiple identical behaviours, so the pair is
The fewer case pairs a BHK covers, the rarer the behaviour likely to be serial. Therefore, the combined score of a case
pattern is. Research has shown that when some crimes have pair is the sum of scores of BHKs covering it. If a case pair
the same behaviours, they are more likely to be serial crimes is covered by BHKs with a high combined score, indicating
if these behaviours are rare [20]. Therefore, if a BHK covers that the shared behaviour is rare, the case pair is more likely
fewer case pairs, it indicates that these case pairs are more to be serial. The combined score calculation of a case pair
likely to be serial case pairs, and the BHK corresponds to a is shown in Eq. (10).
higher score. The formula is shown as Eq. (9): ( ) ∑ ( )
{ 𝜋 pij = 𝜇 ki
(10)
+∞, k ∈ K � ki ∈K (pij )
𝜇(k) = 1
, k ∈ Kε (9)
|k(p)|
BHKs Nsuspects Nsuspects Nsuspects Wthreat Wthreat Wthreat Wproperty Wproperty Wproperty
1 2 3 Violence threat Knife threat Speech threat Snatched Searched Asked for
c2 c1 c8 c1 c6 c5 c1 c2 c4
c4 c3 c2 c7 c8 c3 c5 c6
c5 c3 c8 c7
c6 c4
c7
(a)
(b)
Fig. 5 An example of BHKs and their score. a Cases covered by each BHKs. b Case pairs contained in each block
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11525
score of a case pair is, the more likely the case pair is serial. The details of learning the strong blocking scheme are
In order to include more serial case pairs, we use the mini- shown in Algorithm 2. First, four weak blocking schemes
mum score among serial case pairs as the threshold θ, and are obtained, and strong BHKs K′ and candidate BHKs K"
a strong blocking scheme would only cover the case pairs are found. Second, the score of each candidate BHK μ(k)
whose scores are larger than this threshold. Table 2 is the is calculated, as well as the minimum score of serial crime
combined scores for all case pairs mentioned in Fig. 5b. pairs r(K") covered by candidate BHKs K".
Among all the case pairs, suppose p13, p25 and p67 are
serial crime pairs, and their combined scores are 3/2, 11/10,
43/30 respectively, so the threshold θ is 11/10.
13
11526 Y. Li, X. Shao
Table 2 The combined score for Case pair p12 p13 p14 p18 p23 p24 p25 p26 p27
all case pairs
Combined score 1/6 3/2 1/6 1/3 1/6 4/15 11/10 1/10 1/10
Case pair p34 p38 p45 p46 p47 p56 p57 p58 p67
Combined score 1/6 1/3 1/10 13/30 13/30 1/10 1/10 1 43/30
4.3 Applying blocking schemes T. In this phase, a strong blocking scheme is learned. For
unlabelled case pairs (Algorithm 4), at least one of the two
An unlabelled serial case pair may be formed by two cases in a case pair belongs to the unsolved case set Cu, and
unsolved cases or a solved case and an unsolved case. In the the label of the case pair in unlabelled case pairs is unknown.
former situation, crime analysts can treat the two crimes as The selection of unlabelled case pairs is similar to that of
one crime. The latter situation indicates that the offender of labelled case pairs.
the unsolved case had previously committed a crime, and
further investigation of the suspect can be conducted.
The blocking scheme application can be divided into two After obtaining the case pairs for comparison, their simi-
parts: the generation of labelled case pairs and unlabelled larity vectors will be calculated according to similarity
case pairs. The labelled case pairs are used to train a clas- measures, as shown in Section 4.4.
sifier. The unlabelled case pairs are obtained in the detec-
tion phase. For labelled case pairs (Algorithm 3), the two • Similarity measures
cases of one case pair are all from solved cases Cs. Case
pairs covered by strong BHKs K′ and their corresponding There are three types of crime attributes: numeric attrib-
labels (pij, yij) are directly added to the set of labelled case utes, categorical attributes, and keyword attributes. Each
pairs T. The combined scores of the remaining case pairs attribute type corresponds to a similarity measure method.
are calculated according to the scores of candidate BHKs Absolute distance is applied to measure the similarity of
K", and only those that exceed the threshold θ are added to numeric values, Jaccard’s coefficient is applied for categori-
cal attributes, and word2vec is used to find the difference
between keywords. Our previous work discussed the detailed
Table 4 Confusion matrix calculation process of these similarity measures [53]. The
basic formulations are as follows.
Predicted label
Positive Negative Total • The similarity of numeric attributes
Real label Positive TP FN P
Negative FP TN N The similarity measure of numeric attributes is shown
Total P′ N′ P + N as Eq. (11).
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11527
( )
| | ( m ) The similarity of categorical attributes is shown as Eq.
Simm a m m
, a = 1 − | am j ||
− am ∕ amax − am (11)
ab_dist i j | i min (12). It is a modified Jaccard’s coefficient.
where am is the value of the case Ci on the attribute am, ⎧
� �
i
m ⎪ ∑
max denotes the maximum value of the attribute a and amin
m ∕(�q� + �r� + �s�)
am � � ⎪ 𝜔m
f
denotes the minimum value.
f ϵq
Simm
Jaccard a m m
i
, aj = ⎨
⎪ 𝜔m
NULL
, am
i
= amj
= NULL
⎪ 0, ai or am
m
is NULL
• The similarity of categorical attributes ⎩ j
(12)
13
11528 Y. Li, X. Shao
𝜔m
f
= 1 − nm
f
∕n (13) by 11 attributes based on previous studies. The attribute val-
ues are manually obtained. Table 3 describes the attribute
where q represents the set of attribute values that are pre- information and similarity measures.
sent in both cases, and r and s represent the set of attribute
values that are present in one case but absent in the other 5.2 Evaluation criteria
case. In the traditional Jaccard’s coefficient, the similarity of
two sets is calculated as |q|/(|q| + |r| + |s|). We change its Table 4 shows the confusion matrix for detecting serial
numerator by assigning weights 𝜔m f
to different attribute val- crimes. We evaluate the detection performance using the
ues, and nf denotes the number of cases with the attribute
m indexes of this matrix. ‘Positive’ indicates that the case pair
value f on the attribute am and n denote the total number of is serial, and ‘Negative’ indicates that the case pair is non-
∑ m serial. P is the number of actual positive samples, P′ is the
cases, so 𝜔m ∈ [0, 1] and 𝜔f is actually equal to
number of predicted positive samples, N is the number of
f
∑� m � f ∈q
�q� − nf ∕n . If the value is missing on the attribute am, actual negative samples and N′ is the number of predicted
f ∈q
negative samples.
the similarity of the case pair on this attribute is 𝜔m (cal-
NULL According to the above parameters, the metrics—pre-
culated according to Eq. (13)). When one of the two cases is
cision, recall, F-measure (FM), and G-mean (GM)—are
missing on attribute am but the other case is not missing, the
defined. For FM, the F1(b = 1) is used to evaluate the detec-
similarity is set to 0.
tion effect.
• The similarity of keyword attributes Precision = TP∕(TP + FP) (15)
The similarity measure of keyword attributes is shown as Recall = TP∕(TP + FN) (16)
Eq. (14). The similarity of two words w1 and w2 is measured
by word vectors [34], which is expressed as W2V(w1, w2) ( )
�
� ��
1 + b2 × Precision × Recall
⎧ FM = (17)
� � ⎪ max W2V am i,w
, am
j,w
, am
i,w
ϵ am
i
, amj,w
ϵ am
j b2 × (Precision × Recall)
w
m m m
Simword2vec ai , aj = ⎨ m
𝜔NULL , m m
ai = aj = NULL
⎪ am or am √
⎩ 0, i j
is NULL
TP TN
(14) GM = × (18)
TP + FN TN + FP
In Eq. (14), when a case ci has multiple keywords
( on the
)
m
attribute a , each keyword is represented as ai,w ai,w ϵ am
m m
. We also evaluate the predictive performance of classi-
i
fiers. The area under the receiver operating characteristics
The similarity of the keyword attribute is the maximum
curve (AUC) [54] is independent of the choice of classifica-
value of word similarity among all keywords. When the two
tion threshold, it considers the probabilities of all samples.
cases are missing on the attribute am, the similarity of the
The equivalent calculation for AUC is shown as Eq. (19).
case pair on this attribute is 𝜔m , which is calculated in the
NULL ranki is the rank value of the sample i, P is the number of
same way as Eq. (13).
positive samples and N is the number of negative samples.
The Lift measures how much better the predictive ability of
the model is compared to random selection (see Eq. (20)),
5 Experiments and it is often used in practical problems [55]. Because the
proportion of serial crime pairs is small, we use the top 5-th
5.1 Datasets percentile lift, i.e., the proportion of serial crime pairs in the
top 5% is compared with the proportion of serial crime pairs
The crime dataset in this work was collected by the judi-
in the whole crime set.
cial document on the OpenLaw website. The dataset covers
solved robbery crimes from January 2013 to October 2018 ∑ P×(1+P)
rank i −
in Zhengzhou City, Henan Province, China. AUC =
i∈positiveCalss 2 (19)
N×P
The case set contains 364 cases involving 292 offenders,
111 of which are serial crimes. Among crime series, the
maximum number of cases committed by a single offender TP∕(TP + FP)
Lift = (20)
was 9. After pairing, a total of 66, 066 case pairs were P∕(P + N)
formed, of which 158 were serial case pairs and 65, 908
were nonserial ones. The M.O. of each crime is represented
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11529
This section studies the changes in the number and weight of 5.3.2 Experiment 2: The performance of the blocking
BHKs with different values of the parameter ε. We change methods
the value of ε by setting ρ ∙ α, where α is the number of serial
case pairs in the training fold. The value of ρ varies from 0.9 We compare CB with the blocking methods Adaptive [45],
to 1.0, with ρ = 0.9 meaning that the blocking scheme needs Fisher [46], BL [48], ICSKD [51] and MCBS [47]. The
to cover 90% of serial case pairs. descriptions of these methods are as follows:
We analyse the number and weight of BHKs for the data-
set containing all cases (Fig. 6a). With the growth of ρ, the Adaptive: It uses PQ to evaluate blocks, sorts the PQ val-
number of BHKs gradually increases. This is because, to ues of each block in descending order, and adds them to
cover more serial case pairs, more BHKsare needed. When the blocking scheme one by one until the covered match-
the number of BHKs reaches 59, all serial case pairs have ing pairs exceeded the threshold.
strong
candidates 40 0.9
total 0.91
60 58 58 58 59 59 59 59
57 57 0.92
54 55 35
8 9 0.93
0.94
30
0.95
Number of BHKs
Number of BHKs
0.96
40 25 0.97
0.98
20 0.99
57 57 58 58 58 59 59 1
54 55
51 50 15
20
10
0 0
0.90 0.92 0.94 0.96 0.98 1.00 0.0 0.2 0.4 0.6 0.8 1.0
Weight
(a) (b)
Fig. 6 Information on BHKs with different values of ρ. a The number of BHKs with different ρ; ‘same’ is the strong BHKs, ‘candidates’ is the
candidates BHKs, and ‘means’ is the total of BHKs. b The weight distributions of BHKs with different values of ρ
13
11530 Y. Li, X. Shao
Table 5 Comparison of blocking methods BL: It assigns the FM of a block as its weight and selects
PC RR FPC,RR PQ the top 3 features as a blocking filter. A record will be
assigned to an existing block if it shares the same value
Fisher .9747 .3963 .5635 .0042 with the representative record on the top 3 features. The
Adaptive .9843 .4507 .6183 .0046 first record is a representative record and records that do
BL .9715 .1652 .2810 .0028 not match existing representative records will become a
ICSKD .8286 .6208 .7051 .0053 new representative record. Blocks are formed based on
ARCS .9933 .5495 .7074 .0053 representative records.
ECBS .9933 .5702 .7243 .0055 ICSKD: It iteratively discovers blocks with high dis-
CB .9820 .6487 .7785 .0071 criminability and the block needs to have a high value
in the comprehensive indicators (FL) of discriminabil-
ity and coverage. High -discriminability means that few
1.2 instances have the same value on this feature and it can
Fisher Adaptive
ICSKD BL get a suitable reduction ratio. High coverage means that
1.0
ARCS
CB
ECBS
many instances have a value on this feature. There are
two thresholds η, ψ of discriminability, FL. We test this
0.8 method with η values {0.02, 0.04, 0.06, 0.08, 0.1} and ψ
values {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}.
MCBS: It exploits the co-occurrence of entities among
0.6
the generated blocks for pruning, splitting and merging
blocks to control the size of blocks. It has two different
0.4
weighting schemes (ARCS and ECBS) to calculate the
co-occurrence of entities. Its default algorithm can pro-
0.2 duce the highest PC results, so we adopt it in the experi-
ment. There are two important parameters Smin and
0.0 Smax. We evaluate ARCS and ECBS weighting schemes
with Smin values {5, 10, 15, 20, 25, 30} and Smax values
PC RR FPC , RR {40, 50, 60, 70, 80, 90, 100}.
Fig. 7 Box plot of blocking methods The effects of these blocking strategies are shown in
Table 5. To ensure more serial crime pairs to be covered,
the table presents the results when PC reaches the highest in
Fisher: It designs a Fisher score to sort eligible blocks and these methods. The PC of CB is lower than that of Adaptive,
find the excellent blocks. The algorithm terminates when ARCS and ERBS, but its RR, FPC,RR and PQ have the best
the covered matching pairs in the blocking scheme exceed performance. We have drawn a box plot of these blocking
a certain percentage. methods for the 20-repetition 10-fold cross-validation test
0.4 0.4
0.010
0.2 0.2
0.005
0.0 0.0
0.85 0.90 0.95 1.00 0.90 0.95 1.00 0.90 0.95 1.00
PC
(a) (b) (c)
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11531
13
11532 Y. Li, X. Shao
Fig. 10 Precision comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the
result of using combined blocking
Fig. 11 Recall comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the
result of using combined blocking
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11533
Fig. 12 FM comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the result
of using combined blocking
Fig. 13 GM comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the result
of using combined blocking
13
11534 Y. Li, X. Shao
• Random forest (RF): RF consists of several decision GM, AUC and Lift for comparison. It should be mentioned
trees, and random attribute selection is designed for the that the blocking method may reduce the number of serial
formation of decision trees. In our experiment, the num- case pairs. To enable fair comparison, the denominator of
ber of trees is 200. the recall formula is the actual number of serial case pairs
• K-Nearest Neighbour (KNN): The idea is to set the class before blocking (Fig. 9).
label of a sample to the majority class of its k nearest In this experiment, the parameter ρ of CB was set to 1.0.
neighbours. In our experiment, k = 5. The results on Precision, Recall, FM, and GM are shown
• Gradient Boosting Decision Tree (GDBT): It expands in Figs. 10, 11, 12, and 13. Algorithms after using CB out-
and enhances classification trees based on gradient boost- perform the results of not using CB on most data sets. The
ing. The residual of a decision tree is the basis for train- CB also has a positive effect on the Stack, and the Stack has
ing the next tree. In our experiment, the number of trees the best performance on many crime sets in terms of Preci-
was 200. sion. There is a large gap between the number of samples
• Neural Network (NN): NN processes information by of the minority and the majority in the imbalanced data set,
adjusting the network connection parameters. We set the which hinders the algorithm from learning minority patterns
number of hidden layers of the network to 100. [57]. The minority is more likely to be misclassified. The
• Logistic Regression (LR): LR uses logistic functions in imbalance of case pairs can easily lead to the misclassifica-
statistical techniques to classify samples. In our experi- tion of serial case pairs. As presented in Figs. 10, 11, 12,
ments, the l1 penalty was used. and 13, with the growth of cases, the IR is increasing (see
Fig. 9), the classification performance shows a downward
trend after applying CB. It shows that the data imbalance
In addition to the above five algorithms, we averaged the harms classification.
predicted probabilities of these five algorithms to form a Alternately, the use of CB improves the detection effect.
stacked ensemble (Stack). We use Precision, Recall, FM, In different datasets, the performance is relatively stable, and
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11535
Table 8 Liftcomparison of using and not using CB with different algorithms. ‘Normal’ meansthe result of not using CB; ‘CB’ means the result
of using combined blocking;the better results in ‘Normal’ and ‘CB’ are emphasized in bold
RF KNN GDBT NN LR Stack
the downward trend has also been improved. In Fig. 9, the imbalance. As the number of cases increases, the imbalance
IR of the training set and test set decreased after using CB. ratio gradually increases and the Lift of algorithms is also
With the increase in cases, the reduction in IR has soared. getting larger.
In general, the IR change of the data set using CB is even We also draw the classification difference between using
smaller than that without CB. Because of the reduction trend and no using CB in Fig. 14. For different algorithms, most of
of IR, the classification performance can be improved and is the classification difference is positive, which means that the
relatively stable. The improvement of Precision and Recall use of CB can improve the classification effect. In addition,
shows that after blocking, the algorithm can find more serial the more cases there are in the dataset, the more significant
case pairs correctly. This is because when applying the CB the improvement is. This is because the imbalance ratio is
method, the imbalance problem is alleviated. larger if the dataset contains more cases. As the number of
Tables 7 and 8 present the results of AUC and Lift. In cases increases, CB can eliminate more case pairs and sig-
addition to NN, after using CB, the results of RF, KNN, nificantly diminish IR (Fig. 9), and then the improvement
GDBT, and LR are improved on most crime sets, and the becomes prominent.
predictive performance of the final Stack algorithm is also To show the significant difference between using CB
improved. On many crime sets, the results of Stack using CB and not using CB, the Wilcoxon signed-rank test [58] is
are also superior to other algorithms in terms of AUC and applied to test the precision, recall, FM, GM in different
Lift. According to the calculation formula of Eq. (20), we classification algorithms. As shown in Table 9, the p values
know that the denominator of Lift is the proportion of posi- are less than 0.05, which means that indicators after using
tive samples which is small in crime sets because of the class CB are significantly better than those without CB.
13
11536 Y. Li, X. Shao
0.35
RF Precision 0.25
RF Recall
0.30 KNN KNN
GDBT 0.20 GDBT
0.25 NN NN
0.15
LR
Difference
Difference
0.20 LR
Stack 0.10 Stack
0.15 0.05
0.10 0.00
0.05 -0.05
0.00 -0.10
-0.05 -0.15
150 200 250 300 350 150 200 250 300 350
Nc Nc
0.234
0.30 RF FM RF GM
KNN 0.195 KNN
0.25 GDBT GDBT
0.156
0.20 NN NN
0.117
Difference
Difference
LR LR
0.15
Stack Stack
0.078
0.10
0.039
0.05
0.000
0.00
-0.039
-0.05
-0.078
-0.10
150 200 250 300 350 150 200 250 300 350
Nc Nc
Fig. 14 Difference between using and not using CB with different means the result of not using CB. ‘GMCB’ means the result of using
algorithms. Taking GM as an example, the value of the difference is combined blocking
calculated by the formula: Difference = GMCB - GMNormal. ‘GMNormal
6 Conclusion the serial case pairs can be retained and 64.87% of the case
pairs is eliminated. This result is better than our compari-
In this work, we propose a combined blocking method to son methods Fisher and Adaptive. On the other hand, the
efficiently identify serial cases and reduce comparisons of CB is embedded in a supervised machine learning frame-
case pairs. It defines criminal behaviours as BHKs and con- work, with RF, KNN, GDBT, NN and LR as the baseline
sists of four weak blocking schemes, i.e. four subsets of all classification algorithms. Based on the predictions of these
BHKs. The blocking criteria PC, RR, PQ and block size are algorithms, we also build a stacked ensemble method. Some
used to form each weak blocking scheme. In the final strong cases were selected randomly from the original case set to
blocking scheme, each BHK is assigned a combined score form 26 case sets with different numbers of cases. They are
according to its rarity (which is the number of case pairs it used to evaluate the performance of serial crime detection.
covers), and whether to compare a case pair or not is deter- Compared with the classification without CB, the imbal-
mined by the sum of the scores of the BHK covering it. The ance ratio of data after using CB is greatly alleviated and
CB method uses multiple blocking criteria, but it does not the downward trend becomes more and more obvious as the
integrate them into a comprehensive index, which provides number of cases increases. After using CB, the detection
a new idea of multi-criteria blocking. effect performs better on the precision, recall, FM and GM.
The practical implication of the CB method is that it takes Precision is increased by 30% at most, and the maximum
the rarity of criminal behaviours into account, and it is able improvement of recall, FM, GM is 20%. In the predictive
to greatly reduce the number of pairwise calculations and performance, the algorithms except NN have improved AUC
improves the efficiency of large-scale crime serial crime and Lift after using CB.
detection. A real-world robbery case set was used to evalu- In our future research, we will attempt to design a block-
ate the experimental effect. In the experiments, 98.2% of ing criterion that is more suitable for crime series detection.
13
A supervised machine learning framework with combined blocking for detecting serial crimes 11537
Table 9 The Wilcoxon signed-rank test of Precision, Recall, FM, and 7. Woodhams J et al (2018) Linking serial sexual offences: Moving
GM in different algorithms towards an ecologically valid test of the principles of crime link-
age. Legal and Criminological Psychology 24:12S–140S
Measure method R+ R− p value Hypothesis (0.05) 8. Canter D, Hammond L A comparison of the efficacy of different
decay functions in geographical profiling for a sample of US serial
Precision RF 350 1 9.34E-06 Rejected killers. Journal of Investigative Psychology and Offender Profiling
KNN 340 11 2.94E-05 Rejected 3(2):91–103. https://doi.org/10.1002/jip.45
GDBT 351 0 8.30E-06 Rejected 9. Wang T, Rudin C, Wagner D, Sevieri R (Mar 2015) Finding pat-
NN 311 40 5.79E-04 Rejected terns with a rotten core: data mining for crime series with cores.
Big Data 3(1):3–21. https://doi.org/10.1089/big.2014.0021
LR 340 11 2.94E-05 Rejected 10. Markson L, Woodhams J, Bond JW (2010) Linking serial resi-
Stack 342 9 2.35E-05 Rejected dential burglary: comparing the utility of modus operandi behav-
Recall RF 351 0 8.30E-06 Rejected iours, geographical proximity, and temporal proximity. Journal of
KNN 351 0 8.30E-06 Rejected Investigative Psychology and Offender Profiling. https://doi.org/
10.1002/jip.120
GDBT 351 0 8.30E-06 Rejected 11. Woodhams J, Hollin CR, Bull R (2007) The psychology of link-
NN 272 79 1.42E-02 Rejected ing crimes: a review of the evidence. Leg Criminol Psychol
LR 350 1 9.34E-06 Rejected 12(2):233–249. https://doi.org/10.1348/135532506x118631
Stack 342 9 2.35E-05 Rejected 12. Burrell A, Bull R, Bond J (2012) Linking personal robbery
offences using offender behaviour. J Investig Psychol Offender
FM RF 351 0 8.30E-06 Rejected Profiling 9(3):201–222. https://doi.org/10.1002/jip.1365
KNN 351 0 8.30E-06 Rejected 13. Chi H, Lin Z, Jin H, Xu B, Qi M (2017) A decision support system
GDBT 351 0 8.30E-06 Rejected for detecting serial crimes. Knowl-Based Syst 123:88–101. https://
NN 295 56 2.40E-03 Rejected doi.org/10.1016/j.knosys.2017.02.017
14. Phua C, Gayler R, Lee V, Smith-Miles K (2009) On the communal
LR 350 1 9.34E-06 Rejected analysis suspicion scoring for identity crime in streaming credit
Stack 345 6 1.67E-05 Rejected applications. Eur J Oper Res 195(2):595–612. https://doi.org/10.
GM RF 351 0 8.30E-06 Rejected 1016/j.ejor.2008.02.015
KNN 351 0 8.30E-06 Rejected 15. Gee D, Belofastov A (2007) Profiling sexual fantasy. In: Kocsis
RN (ed) Criminal profiling: international theory, research, and
GDBT 351 0 8.30E-06 Rejected practice. Humana Press, Totowa, NJ, pp 49–71. https://doi.org/
NN 277 74 9.94E-03 Rejected 10.1007/978-1-60327-146-2_3
LR 350 1 9.34E-06 Rejected 16. Borg A, Boldt M, Lavesson N, Melander U, Boeva V (2014)
Stack 342 9 2.35E-05 Rejected Detecting serial residential burglaries using clustering. Expert
Syst Appl 41(11):5252–5266. https://doi.org/10.1016/j.eswa.
2014.02.035
17. Chen L, Gu W, Tian X, Chen G (2019) AHAB: aligning heteroge-
neous knowledge bases via iterative blocking. Inf Process Manag
While reducing pairwise comparisons, it will also aim to 56(1):1–13. https://doi.org/10.1016/j.ipm.2018.08.006
improve the separability of data so that the algorithm can 18. O’Hare K, Jurek A, de Campos C (2018) A new technique of
find more crime series. selecting an optimal blocking method for better record linkage.
Inf Syst 77:151–166. https://doi.org/10.1016/j.is.2018.06.006
19. Christen P (2011) A survey of indexing techniques for scalable
record linkage and deduplication. IEEE Trans Knowl Data Eng
References 24(9):1537–1555
20. Lin S, Brown DE (2006) An outlier-based data association method
1. Tonkin M et al (2017) Using offender crime scene behavior to for linking criminal incidents. Decis Support Syst 41(3):604–615.
link stranger sexual assaults: a comparison of three statistical https://doi.org/10.1016/j.dss.2004.06.005
approaches. J Crim Just 50:19–28. https://doi.org/10.1016/j.jcrim 21. Borg A, Boldt M (2016) Clustering residential burglaries using
jus.2017.04.002 modus operandi and spatiotemporal information. Interna-
2. C. M. d. M. Mota, C. J. J. d. Figueiredo, D. V. e. S. Pereira (2020) tional Journal of Information Technology & Decision Making
Identifying areas vulnerable to homicide using multiple criteria 15(01):23–42. https://doi.org/10.1142/s0219622015500339
analysis and spatial analysis. Omega 102211. https://doi.org/10. 22. Zhu S, Xie Y (2019) Crime event embedding with unsupervised
1016/j.omega.2020.102211 feature selection. In: ICASSP 2019–2019 IEEE International Con-
3. Chohlas-Wood A, Levine ES (2019) A recommendation engine to ference on Acoustics, Speech and Signal Processing (ICASSP).
aid in identifying crime patterns. INFORMS Journal on Applied IEEE, pp 3922–3926
Analytics. https://doi.org/10.1287/inte.2019.0985 23. Bennell C, Canter DV (2002) Linking commercial burglaries by
4. Isafiade OE, Bagula AB (2020) Series mining for public safety modus operandi: tests using regression and ROC analysis. Sci
advancement in emerging smart cities. Future Generation Com- Justice 42(3):153–164. https://doi.org/10.1016/s1355-0306(02)
puter Systems 108:777–802. https://d oi.o rg/1 0.1 016/j.f uture.2 020. 71820-0
03.002 24. Tonkin M, Grant T, Bond JW (2008) To link or not to link: a test
5. Porter MD (2016) A statistical approach to crime linkage. Am Stat of the case linkage principles using serial car theft data. 5(1–
70(2):152–165. https://doi.org/10.1080/00031305.2015.1123185 2):59–77. https://doi.org/10.1002/jip.74
6. Hazelwood RR, Warren JI (2004) Linkage analysis: modus oper- 25. Tonkin M, Woodhams J, Bull R, Bond JW, Santtila P (2012) A
andi, ritual, and signature in serial sexual crime. Aggress Violent comparison of logistic regression and classification tree analysis
Behav 9(3):307–318. https://doi.org/10.1016/j.avb.2004.02.002
13
11538 Y. Li, X. Shao
for Behavioural case linkage. J Investig Psychol Offender Profiling on Challenges in Web Information Retrieval and Integration, pp
9(3):235–258. https://doi.org/10.1002/jip.1367 30–39
26. Ku C-H, Leroy G (2014) A decision support system: automated 43. Allam A, Skiadopoulos S, Kalnis P (2018) Improved suffix block-
crime report analysis and classification for e-government. Gov Inf ing for record linkage and entity resolution. Data Knowl Eng
Q 31(4):534–544. https://doi.org/10.1016/j.giq.2014.08.003 117:98–113. https://doi.org/10.1016/j.datak.2018.07.005
27. Reich BJ, Porter MD (2015) Partially supervised spatiotemporal 44. O'Hare K, Jurek-Loughrey A, Campos C (2019) A Review of
clustering for burglary crime series identification. Journal of the Unsupervised and Semi-supervised Blocking Methods for
Royal Statistical Society Series a Statistics in Society 178(2):465– Record Linkage. In: Linking and Mining Heterogeneous and
480. https://doi.org/10.1111/RSSA.12076 Multi-view Data. Springer, pp 79–105. https://doi.org/10.1007/
28. Goala S, Dutta P (2018) A fuzzy multicriteria decision-making 978-3-030-01872-6_4
approach to crime linkage. International Journal of Information 45. Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking:
Technologies and Systems Approach 11(2):31–50. https://d oi.o rg/ Learning to scale up record linkage. In: Sixth International Con-
10.4018/ijitsa.2018070103 ference on Data Mining (ICDM'06). IEEE, pp 87–96
29. Albertetti F, Cotofrei P, Grossrieder L, Ribaux O, Stoffel K 46. Kejriwal M, Miranker DP (2013) An unsupervised algorithm for
(2013) The CriLiM methodology: crime linkage with a fuzzy learning blocking schemes. In: 2013 IEEE 13th International Con-
mcdm approach. In: Proceedings - 2013 European intelligence ference on Data Mining. IEEE, pp 340–349
and security informatics conference, EISIC, vol 2013, pp 67–74. 47. Nascimento DC, Pires CES, Mestre DG (2019) Exploiting
https://doi.org/10.1109/EISIC.2013.17 block co-occurrence to control block sizes for entity resolu-
30. Qazi N, Wong BLW (2019) An interactive human centered data tion. Knowl Inf Syst 62(1):359–400. https://doi.org/10.1007/
science approach towards crime pattern analysis. Information Pro- s10115-019-01347-0
cessing & Management 56(6):102066. https://doi.org/10.1016/j. 48. O’Hare K, Jurek-Loughrey A, de Campos C (2019) An unsuper-
ipm.2019.102066 vised blocking technique for more efficient record linkage. Data
31. Brown DE, Hagen S (2003) Data association methods with appli- Knowl Eng 122:181–195
cations to law enforcement. Decis Support Syst 34(3):369–378 49. Michelson M, Knoblock CA (2006) Learning blocking schemes
32. Boriah S, Chandola V, Kumar V (2008) Similarity Measures for for record linkage. In: AAAI, vol 6, pp 440–445
Categorical Data: A Comparative Evaluation. In: Proceedings 50. Ramadan B, Christen P (2015) Unsupervised blocking key selec-
of the 2008 SIAM International Conference on Data Mining, pp tion for real-time entity resolution. In: Pacific-Asia Conference on
243–254. https://doi.org/10.1137/1.9781611972788.22 Knowledge Discovery and Data Mining. Springer, pp 574–585
33. Bennell C, Jones NJ, Melnyk T (2009) Addressing problems with 51. Song D, Luo Y, Heflin J (2017) Linking heterogeneous data in the
traditional crime linking methods using receiver operating char- semantic web using scalable and domain-independent candidate
acteristic analysis. Leg Criminol Psychol 14(2):293–310. https:// selection. IEEE Trans Knowl Data Eng 29(1):143–156. https://
doi.org/10.1348/135532508x349336 doi.org/10.1109/tkde.2016.2606399
34. Mikolov T, Chen K, Corrado G, Dean J Efficient Estimation 52. Carr RD, Doddi S, Konjevod G, Marathe M (2000) C. Associa-
of Word Representations in Vector Space. In: arXiv e-prints tion For Computing Machinery Inc; Association For, and I. N. C.
Accessed on: January 01, 2013Available: https://ui.adsabs.harva Machinery. In: On the red-blue set cover problem (Proceedings
rd.edu/\#abs/2013arXiv1301.3781M of the Eleventh Annual Acm-Siam Symposium on Discrete Algo-
35. Tonkin M, Lemeire J, Santtila P, Winter JM (2019) Linking prop- rithms), pp 345–353
erty crime using offender crime scene behaviour: A comparison of 53. Li Y-S, Qi M-L (2019) An approach for understanding offender
methods. Journal of Investigative Psychology and Offender Profil- modus operandi to detect serial robbery crimes. Journal of Com-
ing. https://doi.org/10.1002/jip.1525 putational Science 36:101024. https://d oi.o rg/1 0.1 016/j.j ocs.2 019.
36. Papadakis G, Skoutas D, Thanos E, Palpanas T (2020) Blocking 101024
and Filtering Techniques for Entity Resolution: A Survey. ACM 54. Hand DJ, Till RJ (2001) A simple generalisation of the area under
Computing Surveys 53(2):1–42. https://doi.org/10.1145/3377455 the ROC curve for multiple class classification problems. Mach
37. I. Fellegi and A. Sunter, "A Theory for Record Linkage," Journal Learn 45(2):171–186
of the American Statistical Association, vol. 64, pp. 1183–1210, 55. De Caigny A, Coussement K, De Bock KW (2018) A new hybrid
. doi: https://doi.org/10.1080/01621459.1969.10501049 classification algorithm for customer churn prediction based on
38. Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia- logistic regression and decision trees. Eur J Oper Res 269(2):760–
Molina H (2009) Entity resolution with iterative blocking. In: 772. https://doi.org/10.1016/j.ejor.2018.02.009
Presented at the international conference on Management of Data. 56. Pedregosa F et al (2011) Scikit-learn: Machine Learning in
https://doi.org/10.1145/1559845.1559870 Python. Journal of Machine Learning Research 12(85):2825–2830
39. Gravano L (2001) Approximate string joins in a database (almost) 57. Su C, Ju S, Liu Y, Yu Z (2015) Improving random Forest and
for free. In: Vldb 01: international conference on very large data rotation Forest for highly imbalanced datasets. Intelligent Data
bases Analysis 19(6):1409–1432. https://doi.org/10.3233/ida-150789
40. Jin L, Li C, Mehrotra S (2003) Efficient record linkage in large 58. Demsar J (2006) Statistical comparisons of classifiers over mul-
data sets. In: Eighth International Conference on Database Sys- tiple data sets. J Mach Learn Res 7:1–30
tems for Advanced Applications, 2003. (DASFAA 2003). Pro-
ceedings, pp 137–146 Publisher’s note Springer Nature remains neutral with regard to
41. Hernández MA, Stolfo SJ (1995) The merge/purge problem for jurisdictional claims in published maps and institutional affiliations.
large databases. ACM SIGMOD Rec 24(2):127–138
42. Aizawa A, Oyama K (2005) A fast linkage detection scheme for
multi-source information integration. In: International Workshop
13