Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Applied Intelligence (2021) 52:11517–11538

https://doi.org/10.1007/s10489-021-02942-x

A supervised machine learning framework with combined blocking


for detecting serial crimes
Yusheng Li1,2 · Xueyan Shao1,2

Accepted: 19 October 2021 / Published online: 27 January 2022


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
Detecting serial crimes is to find criminals who have committed multiple crimes. A classification technique is often used to
process serial crime detection, but the pairwise comparison of crimes is of quadratic complexity, and the number of nonserial
case pairs far exceeds the number of serial case pairs. The blocking method can play a role in reducing pairwise calculation
and eliminating nonserial case pairs. But the limitation of previous studies is that most of them use a single criterion to
select blocks, which is difficult to guarantee an excellent blocking result. Some studies integrate multiple criteria into one
comprehensive index. However, the performance is easily affected by the weighting method. In this paper, we propose a
combined blocking (CB) approach. Each criminal behaviour is defined as a behaviour key (BHK) and used to form a block.
CB learns several weak blocking schemes by different blocking criteria and then combines them to form the final blocking
scheme. The final blocking scheme consists of several BHKs. Because rare behaviour can better identify crime series, each
BHK is assigned a score according to its rarity. BHKs and their scores are used to determine whether a case pair need to be
compared. After comparing with multiple blocking methods, CB can effectively guarantee the number of serial case pairs
while greatly reducing unnecessary nonserial case pairs. The CB is embedded in a supervised machine learning framework.
Experiments on real-world robbery cases demonstrate that it can effectively reduce pairwise comparison, alleviate the class
imbalance problem and improve detection performance.

Keywords  Serial crime detection · Classification · Pairwise calculation · Blocking · Class imbalance

1 Introduction [2], it is inefficient, time-consuming, and error-prone to


manually judge serial crimes based on human memory [3].
There are many recidivists in the real world, and multiple Therefore, an automated serial case linkage algorithm has
crimes committed by recidivists are called serial crimes. great significance for reducing crime [4], improving the effi-
Studies have shown that serial crimes represent a large pro- ciency of case detection, and maintaining community safety.
portion of all cases. For example, in the United States (US) Serial crime detection is generally based on the similarity
and the United Kingdom (UK), nearly 50% of crimes involve of criminal behaviour [5]. Research on the use of behavioural
only 6%–10% of criminals [1]. Therefore, many countries similarity to detect serial crimes concerns various types of
have asked police to strengthen their detection of serial crimes, including serious crimes, such as rape [6, 7] and mur-
crimes. Because the decision-making process in public secu- der [8], and large-scale crimes, such as theft [9, 10], robbery
rity is a difficult task and the number of crimes is massive [11–13] and fraud [14]. The modus operandi (M.O.) concerns
the behaviour of criminals when they commit a crime, escape,
* Xueyan Shao and protect themselves from arrest [15]. According to crime
xyshao@casisd.cn types, the M.O. of criminals is characterized by different attrib-
Yusheng Li utes, and each attribute value is a type of criminal behaviour.
liyusheng17@mails.ucas.ac.cn Many researchers applied classification techniques to
serial crime detection. Every two crimes constitute a crime
1
Institutes of Science and Development, Chinese Academy pair, and serial crime detection is the task of identifying
of Sciences, Beijing 100190, China
whether a pair of crimes are serial based on their similarity of
2
School of Public Policy and Management, University attribute values. Figure 1 depicts a typical workflow for crime
of Chinese Academy of Sciences, Beijing 100049, China

13
Vol.:(0123456789)
11518 Y. Li, X. Shao

series detection. Because each case must be compared with method. When some crimes have the same behaviours, they are
all the remaining cases, scalability becomes an issue. For more likely to be serial crimes if these behaviours are rare [20],
example, given a crime set consisting of n cases, then there so we aim to take the rarity of behaviours into account. For the
would be n × (n-1)/2 pairwise comparisons. Considering the blocking method, we aim to consider multiple criteria to form
above situation, and there are two problems to be solved. (1) a combined blocking method(CB) and do not integrate them
Data imbalance. Pairwise comparison generates case pairs, into a comprehensive index. CB learns several weak blocking
so serial crime pairs are always the minority class compared schemes by multiple blocking criteria and then combines them
with nonserial crime pairs. Previous studies have mentioned to form the final blocking scheme. Each criminal behaviour is
this class imbalance problem [16]. (2) Massive number of called a behaviour key (BHK). A BHK forms a block to con-
pairwise calculations. Because each crime must be compared tain cases of sharing this common BHK. The final blocking
with every remaining crime to determine whether the same scheme consists of several BHKs, and each BHK is assigned
offender committed them, the computational efforts of pair- a score according to its rarity. Each case pair calculates a com-
wise similarity measures grow quadratically with the number bined score according to the BHKs to which it belongs. Two
of crimes. Thus, it is problematic to handle large-scale crime. crimes are more easily compared when they share more iden-
Because serial case pairs are much fewer in number than non- tical BHKs, and they are more likely to be compared if their
serial case pairs, most similarity calculations are performed common BHKs are rare. Based on this key point, CB is able to
on nonserial case pairs, which are not the target of the task. pick out serial case pairs and filter out nonserial case pairs. The
For large-scale crimes, pairwise comparisons are computa- contributions of our study are as follows:
tionally intensive, and class imbalance is more serious.
To solve the above two problems, some measures need (1) We use the criminal behaviours to divide crimes into
to be taken to reduce the number of nonserial case pairs. different blocks and put the crimes with the same
Blocking technology can achieve this goal. In the blocking behaviours into the same block.
step, the objects to be compared are divided into several (2) We propose a combined blocking method that uses multi-
subsets called blocks, and only objects within the same block ple criteria for block selection. Each criterion forms a weak
are compared with each other [17]. Thus, only a portion of blocking scheme, and the final strong blocking scheme is
case pairs is selected as promising candidates for similar- combined by all the weak blocking schemes. According to
ity computation [18], which can reduce the class imbalance the characteristics of serial crimes, the rarity of criminal
ratio and unnecessary pairwise calculations. Blocking can behaviours is considered. BHKs and their scores are used
be divided into two steps: blocks formation and blocks selec- to select case pairs that need to be compared.
tion. Because the behaviours of recidivists are similar in dif- (3) We embed the CB into a supervised machine learning
ferent crimes, the behaviour of offenders can be used to form framework for crime series detection. Real-world rob-
blocks, and the cases with similar behaviours are divided bery cases are used to evaluate the performance of CB
into the same block. The pairs completeness (PC), reduction and serial crime detection. The results show that the
ratio (RR), harmonic mean (FM) of RR and PC, and pair CB can reduce pairwise calculations, alleviate the class
quality (PQ) are generally used to select blocks [19]. imbalance problem and improve detection performance.
Blocking methods have been widely used for record link-
age and entity matching [19]. However, it is not specifically Section 2 reviews the related studies concerning serial
designed for serial crime detection and does not consider the crime detection and blocking. Section 3 gives the basic
characteristics of the crime series. In addition, some blocking definitions and the problem to be solved in this paper. Sec-
methods use a single criterion or integrate multiple criteria into tion 4 describes the methodology and proposes the com-
one comprehensive index to select blocks. It is difficult to guar- bined blocking, including the learning and application of
antee an excellent blocking result via a single criterion, com- the blocking scheme. Section 5 conducts some experiments
prehensive criteria are easily affected by the weighting method. to prove the effectiveness of the combined blocking, and a
Due to the limitations mentioned above, we want to consider robbery case set of China is employed. Section 6 presents
the characteristics of crime series and design a new blocking our conclusion and future plans.

Fig. 1  Typical serial crime


detection workflow
Pairing Comparision Classification

Every two crimes form Calculate similarity of Determine whether a


a crime pair criminal behavior in crime pair is serial
each crime pair

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11519

2 Literature review within the same block are compared with each other [36].
Some studies focus on how to produce better blocks (block
In this section, we review the related literature on serial crime formation). Others are studying which blocks should be
detection from the two aspects of similarity measures and selected to constitute the blocking scheme (block selection).
detection methods. Then, we discuss the research on blocking In the formation of blocks, the traditional block method
technology regarding the formation and selection of blocks. inserts each object into only one block [37]. Its disadvan-
tage is its low fault tolerance. The improper formation will
2.1 Serial crime detection easily insert two matching objects into different blocks. To
make up for such shortcomings, one solution is that each
Given the importance and challenges of serial crime detection, object can be inserted into multiple blocks [38]. In addition,
there have been many studies in this field, and various meth- researchers change the string of attribute values to make
ods have been used to detect crime series. Recently, machine one block contain more matching objects and fewer non-
learning methods have been applied, and both supervised matching objects. Q-gram assumes that the attribute val-
and unsupervised methods are involved. The unsupervised ues are all strings and uses a substring of length q to form
method mainly finds similar cases to form clusters. Borg and blocks [39]. Objects only need to have the same substring.
Boldt used four clustering algorithms to identify crime series Even if the attribute values do not match exactly, they can
based on behavioural similarity [21]. Zhu and Xie employed be inserted into the same block. String-map based blocking
the Restricted Boltzmann Machine (RBM) to obtain the co- maps strings into multidimensional Euclidean space so that
occurrence pattern in criminal behaviour to link crimes [22]. the distance between strings will not change and then groups
Supervised methods treat serial crime detection as a binary similar strings in a block [40]. There are also methods to
classification task, and the methods used include neural net- sort strings. Sorted Neighbourhood sorts all preset strings
works [13], logistic regression [7, 23, 24], decision trees [25], in alphabetical order and then slides a fixed-size window
and Bayesian classification [26], etc. Apart from these, various on the sorted record to divide the record in each window
other approaches are also applied. Reich and Porter designed into a block [41]. Suffix Array is a combination of Sorted
a semisupervised Bayesian model-based clustering algorithm Neighbourhood and Q-gram [42, 43]. The attribute values
to group similar crimes [27]. Some researchers have applied and their suffix are both sorted, which is the input of Sorted
fuzzy multicriteria decision making(MCDM) to combine sev- Neighbourhood blocking technology. The above methods
eral attributes to aggregate a single value denoting the overall can form good blocks, but they all assume that attribute val-
similarity between crimes [28, 29]. Qazi and William extracted ues are string.
reasonable correlations by combining human interaction with In the selection of blocks, the goal of the blocking
the machine learning method to identify crime series [30]. method is to select several blocks to form a blocking scheme
Supervised methods can achieve better results because they so that all or nearly all matching pairs can be contained,
learn domain knowledge from historical data. According to and the number of nonmatching pairs is as small as pos-
the process of detecting serial crimes in Fig. 1, the attribute sible. According to this goal, blocking schemes are usually
similarity needs to be calculated after the cases form case pairs. appraised by the reduction ratio (RR), pairs completeness
Different attribute types have their own similarity measurement (PC), harmonic mean (FM) of RR and PC, and pair qual-
methods. The absolute distance is often used for the comparison ity (PQ). These criteria can be used to select the optimal
of numerical attributes [31]. For categorical attributes, binari- blocks to form a blocking scheme [44]. Bilenko, Kamath,
zation is usually the first step [32], and then a binary distance and Mooney used PQ as a criterion for evaluating blocks,
measurement method such as Jaccard’s coefficient is used to sorted the PQ values of each block in descending order, and
calculate the distance of the attribute value [33]. The values of added the blocking scheme one by one until the matching
some crime attributes are keywords in the case narration. To pairs covered exceeded the threshold [45]. Based on this
understand their real meaning, word2vec can be used to meas- learning process, Kejriwal and Miranker also introduced a
ure the distance of keyword attributes [34]. In reality, after cases new Fisher score criterion to learn the blocking scheme [46].
are paired, the number of nonserial case pairs is far greater than Nascimento, Pires and Mestre exploited the co-occurrence
the number of serial case pairs [16, 35]. To improve the detec- of entities to prune, split and merge blocks to guarantee the
tion effect, this class imbalance problem needs to be solved. block size [47]. O’Hare et al. assigned the FM of a block as
its weight, selected the top 3 as a blocking filter, and then
iteratively clustered the records into blocks [48]. To divide
2.2 Blocking the blocks better, some scholars have attempted the use of
multiple criteria to select the block. Michelson and Kno-
Blocking methods divide objects into several blocks (subsets block devised a blocking method that selects each block with
of all objects) according to attribute values, and only objects the higher RR while maintaining the PC above a threshold

13
11520 Y. Li, X. Shao

[49]. Ramadan and Christen considered the Fisher score, Definition 3  (Similarity measure) The similarity measure
block size, and block size distribution to construct a com- simm can quantify the attribute similarity of a crime pair pij
prehensive index by weighting [50]. Song, Luo and Heflin on the attribute
( am). The calculation can be represented as
designed a block selection scheme, records are divided into Sij = sim ai , aj  . A high Sijm means that crimes ci and cj
m m m m

a set of triples composed of subject, attribute and object, a are very similar on the attribute am.
block needs to have high discriminability and coverage to
be select [51]. When multiple indicators are integrated into Definition 4  (Similarity vector) The M attributes similarity
one criterion, the weight is difficult to determine. Based on of crime pair pij is Sij1 , … , Sijm , … , SijM  , which can be defined
these methods and criteria, we propose a combined blocking
by a similarity vector Sij.
method that considers multiple blocking criteria and does
not need to combine multiple criteria into a comprehensive Sij =< Sij1 , … , Sijm , … , SijM > (1)
index. Every single criterion learns a weak blocking scheme,
and they together form the final blocking scheme. Definition 5 (Supervised serial crime detection)
Given solved crimes CS, and form labelled crime pairs
T = {Sij, yij}u × (M + 1), including serial crimes (yij = 1) and
3 Preliminary and problem definition nonserial crimes (yij = 0), to train a classifier. For unsolved
crimes Cu, form unlabelled crime pairs V = {Sij}v × M in Cu
Letting C = {c1,…,ci,…,cj,…,cn} be a set of cases, this study and between CS and Cu. Apply the trained classifier to deter-
aims to find serial crimes in C. In this section, we first give mine the label of unlabelled crime pairs.
several basic definitions and then describe the problems in The pairwise calculations grow quadratically with the
this paper. The notations are listed in Table 1. number of crimes. Scaling the above tasks to large-scale
crimes is problematic. After pairing, because the scale of
Definition 1  (Crime pair) Each crime ci has M attributes nonserial case pairs is much larger than that of serial ones,
ai =< a1i , … , am
i
, … , aM
i
> . A crime pair pij consists of two there are plenty of unnecessary calculations. This paper aims
crimes ci and cj. For n crimes, the number of crime pairs is to reduce the comparison between crimes and the computa-
n × (n − 1)/2. tional overhead during crime series detection while ensuring
the quality of the pairwise comparison.
Definition 2  (Blocking) Given a crime set C, the blocking
method divides it into several subsets C1,C2,...Ci. Each sub-
set is a block, and crime pairs are formed by crimes within 4 Methodology
the same block.
Table 1  Notation Information 4.1 Overview of the methodology
Notation Description
Crime linkage mainly has three steps: pairing, comparison
C The crime set and classification (Fig. 1). The blocking method plays a role
CS The solved crime set in the Pairing step. Figure 2 details the process of detecting
Cu The unsolved crime set serial crimes with blocking. The process contains two parts.
n The number of crimes (1) Training, which uses the solved cases to learn the block-
N The number of crime pairs,
ing scheme and train the classifier. (2) Detection, of which
N = n × (n-1)/2
pij The crime pair constitutes crime ci and
the goal is to determine whether an unlabelled crime pair is
cj, where i, j ∈ {1, 2, …, n}, i < j serial. There are four main phases as follows.
M The number of attributes
am The attribute m • Learning blocking schemes: Based on the solved crimes,
am
i
The value of the attribute am of crime ci learn the combined blocking, i.e., rules for dividing the
ai The attribute
( vector of ci, ) crime set into blocks.
ai = a1i , … , am
i
, … , aM
i • Applying blocking schemes: For the training part,
simm The similarity measure of the attribute
employ the blocking scheme on solved crimes to gen-
am
Sijm The similarity of pij in attribute am
erate labelled crime pairs for training. When detecting
Sij The similarity vector of pij, where serial crimes, a blocking scheme is applied to the union
( ) of solved crimes and unsolved crimes to generate unla-
Sij = Sij1 , … , Sijm , … , SijM
belled crime pairs related to unsolved cases because
yij The label of the crime pair pij, yij
∈{0, 1}
serial case pairs may form between unsolved cases or
between unsolved cases and solved cases.

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11521

• Comparison: Similarity vectors of case pairs are calcu- Example. Given a set of individual behaviour keys,
lated by the corresponding similarity measure. K = k1, …, kk, the disjunctive blocking scheme is k1 ∪ … ∪ kj,
• Classification: Employ the machine learning classification the conjunctive blocking scheme is k1 ∩ … ∩ kj, and the
algorithm to train labelled case pairs, obtain a classifier, disjunctive normal form blocking scheme is (k1 ∩ … ∩ kj
and classify similarity vectors of unlabelled crime pairs. ) ∪ … ∪ (k1 ∩ … ∩ kj) . Figure 3c is a disjunctive blocking
The classification output is whether a case pair is serial. scheme formed by <Nsuspects, 2 >  ∪  <  Wthreat, knife
threat > ∪ < Wproperty, snatched > .
4.2 Learning blocking schemes The blocking problem for crime series detection can be
described as follows. Given a set of BHKs, the optimization
4.2.1 Description of blocking goal is to determine the optimal blocking scheme so that
all or nearly all serial crime pairs can be covered and the
The blocking for serial crime detection aims to identify case generated nonserial crime pairs can be as few as possible.
pairs that share common behaviour characteristics to produce Formally, this objective can be expressed by Eq. (2):
candidate case pairs for comparison. In this paper, we define ∑
each behaviour as a behaviour key (BHK). Crimes covered by Γ∗ = argminΓ |Γ(p)|s.t. yij ≥ 𝜀
(2)
the same BHK indicate that the criminals have the same behav- pij ∈Γ(p)

iour. These crimes are more likely to be serial crimes. BHK


where Γ(p) means case pairs covered by the blocking
is the primary element that constitutes the blocking scheme.
scheme Γ. y ij = 1 means that p ij is a serial case pair

(yij ∈ {0, 1}), so yij is the number of serial case pairs in
Definition 6  (Behaviour key). A behaviour key is a type of pij eΓ(p)
criminal behaviour, k =  < am, v>. am is an attribute and v repre- Γ(p). ε means that Γ should cover at least ε serial case pairs.
sents a value in the attribute am. BHKs determine which crimes This is NP-hard [46, 52], and it is generally recommended
are covered, and each BHK corresponds to a specific block. to use approximate algorithms for blocking. A combined
Example. Given eight cases and their three attributes blocking method is proposed in Section 4.2.3. To cover as
(Fig. 3a). Figure 3b shows the set of BHKs for these eight many serial case pairs as possible, this paper only discusses
cases. The number of suspects has three values, i.e., 1, 2, the disjunctive blocking scheme.
and 3. <N_suspects, 2 > is a behaviour key, and the cases it
covers {c1, c3} constitute a block (see Fig. 3c).
4.2.2 Metrics of blocking
Definition 7  (Blocking Scheme). Because criminal behav-
iours are various, it is difficult for a single block to match all Although blocking schemes can reduce comparison calcu-
serial crimes. Multiple BHKs are required to form a block- lation, not all blocking schemes should be adopted or are
ing scheme. A blocking scheme Γ is a disjunction or con- equally good. For example, a blocking scheme can signifi-
junction of BHKs. cantly reduce computation, but if serial crimes are divided

Labeled Combined Selected labeled Similarity vector of


crime pairs blocking crime pairs labeled crime pair
Detection algorithm Training

Similarity
measures

Unlabeled Selected unlabeled Similarity vector of Y/N serial


crime pairs crime pairs unlabeled crime pair crime pair
Detection

Solved crimes Unsolved crimes

Learning Applying blocking scheme Comparison Classification


blocking scheme

Fig. 2  The framework for detecting serial crimes with combined blocking

13
11522 Y. Li, X. Shao

Fig. 3  An example of crimes, Number of suspects Ways of threat Ways to get property
behaviour keys, and a blocking Attribute
(Nsuspects) (Wthreat) (Wproperty)
scheme. a Eight cases and their
three attributes. b BHKs for the c1 2 Violence threat Snatched
8 cases. c A blocking scheme c2 1 Violence threat Searched
c3 2 Violence threat Snatched
c4 1 Violence threat Asked for
c5 1 Speech threat Searched
c6 1 Knife threat Asked for
c7 1 Knife threat Asked for
c8 3 Speech threat Snatched
(a)
Behaviour Key(BHK) Nsuspects Wthreat Wproperty
Nsuspects Wthreat Wproperty 2 Knife threat Snatched
1 Violence threat Snatched
2 Knife threat Searched c1 c6 c1
3 Speech threat Asked for c3 c7 c3
c8

(b) (c)

into different blocks, the result is serial crimes that cannot Ψs


be detected. This is not a promising blocking scheme. Alter- PQ = (6)
Ψ
nately, if a blocking scheme can divide all serial crimes into
the same block but does not contribute much to calculation
reduction, it is also not an excellent blocking scheme. There 4.2.3 Combined blocking
is a trade-off between dividing serial crimes into the same
block and reducing the number of comparisons. Assuming According to the calculation formulas of PC, RR, and PQ,
that Ψ is the number of candidate case pairs covered by the we know that greater PC means that more serial crimes are
blocking scheme, N is the number of overall case pairs, Ψs divided into the same block, greater RR means that more
is the number of serial case pairs covered by the blocking nonserial crime pairs are reduced and greater PQ means
scheme, and NS is the quantity of overall serial case pairs. that the proportion of covered serial case pairs is higher.
Pairs completeness is used to measure the proportion of If the size of selected blocks is too small, it may also pro-
serial crimes in the same block (Eq. (3)). The reduction ratio duce high PC, RR and PQ, but it may not be able to make
is used to measure the amount of calculation that the block- some serial case pairs be covered in actual application. A
ing scheme can eliminate (Eq. (4)). large block size increases the likelihood that a block con-
tains more crime series in actual use, so we also consider
Ψs the block size (BS), that is, the number of case pairs in the
PC = (3)
Ns block, which can make the blocking scheme more robust.
ki(p) denotes the set of case pairs covered by BHK ki, and
Ψ |ki(p)| is its block size.
RR = 1 − (4) We do not integrate the above four criteria into a com-
N
prehensive criterion but instead design a combined block-
To express the trade-off between the above two metrics, ing method. The process of CB is shown in Fig. 4. First,
their harmonic mean FPC,RR is computed. It can be expressed BHKs are generated from the crime set. Second, four weak
by the Eq. (5): blocking schemes are learned based on the above criteria
2 × RR × PC and they are composed of several BHKs. Finally, a strong
FPC,RR = (5) blocking scheme is obtained by combining these weak
RR + PC
blocking schemes. In the strong blocking scheme, if a BHK
Pairs quality (PQ) is to calculate the proportion of serial is included in each blocking scheme, it is a strong BHK, oth-
case pairs covered by the blocking scheme in the candidate erwise, it is a candidate BHK. A strong BHK is assigned a
case pairs (Eq. (6)). score of infinity and a candidate BHK is given a score based
on its rarity. BHKs and their scores are used to determine
whether a case pair need to be compared.

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11523

• Weak blocking scheme ΓRR represents the weak blocking scheme learned by the
criteria RR. BS means the block size. The remaining BHKs
Solved cases are used to learn weak blocking schemes. The in each blocking scheme are called candidate BHKs, and the
pseudocode is described in Algorithm 1. First, discard the set is expressed as K" (Eq. (8)).
BHKs that do not cover serial case pairs, because serial crimes { ( ) }
do not exhibit the characteristics of these BHKs. r(k) denotes Kε = k|kϵ ΓRR ∪ ΓPC ∪ ΓPQ ∪ ΓBS , k ∉ K � (8)
serial case pairs covered by the BHK k. Second, we calculate
According to the BHKs in the strong blocking scheme, the
the score of each BHK according to different blocking crite-
calculation process to determine whether a case pair should
ria and then sort them according to their scores. Third, |r(Γ)|
be compared is as follows. The combined score of a case pair
denotes the number of serial case pairs that a blocking scheme
is the key and it is related to BHKs that cover this case pair.
Γ covers. Add BHKs to the weak blocking scheme Γuntil |r(Γ)|
Each BHK is given a score, and the combined score of a case
is no less than ε or the set of BHKs is empty. A large ε specifies
pair is the sum of the scores of all the BHKs that cover it.
a blocking scheme to cover more serial case pairs.

• Strong blocking scheme blocking schemes are strong because they are selected under
different blocking criteria. They are denoted as K′ (Eq. (7)).
After the above steps, four weak blocking schemes are { ( )}
obtained, i.e., four sets of BHKs. The strong blocking K � = k|kϵ ΓRR ∩ ΓPC ∩ ΓPQ ∩ ΓBS (7)
scheme combines them. The BHKs included in all weak

13
11524 Y. Li, X. Shao

Fig. 4  The process of combined blocking

Because BHKs in K′ are strong BHKs, we give them a of case pairs they cover. For example, the BHK “Nsus-
score of infinity. The scores of BHKs in K"are determined pects-1” covers 10 case pairs, so its score is 10 1
(Fig. 5b).
based on the number of case pairs they cover. If a BHK A case pair may be covered by multiple BHKs, indicating
covers some cases, these cases use this type of behaviour. that there are multiple identical behaviours, so the pair is
The fewer case pairs a BHK covers, the rarer the behaviour likely to be serial. Therefore, the combined score of a case
pattern is. Research has shown that when some crimes have pair is the sum of scores of BHKs covering it. If a case pair
the same behaviours, they are more likely to be serial crimes is covered by BHKs with a high combined score, indicating
if these behaviours are rare [20]. Therefore, if a BHK covers that the shared behaviour is rare, the case pair is more likely
fewer case pairs, it indicates that these case pairs are more to be serial. The combined score calculation of a case pair
likely to be serial case pairs, and the BHK corresponds to a is shown in Eq. (10).
higher score. The formula is shown as Eq. (9): ( ) ∑ ( )
{ 𝜋 pij = 𝜇 ki
(10)
+∞, k ∈ K � ki ∈K (pij )
𝜇(k) = 1
, k ∈ Kε (9)
|k(p)|

K(pij) indicates the set of BHKs covering the case pair


|k(p)| represents the number of case pairs covered by the
pij. According to the example in Fig. 5, the combined score
BHK k. Based on the cases in Fig. 3, the BHKs are obtained
of case pair p13 is 1 + 61 + 13 = 32 . The higher the combined
in Fig. 5 and their scores are calculated based on the number

BHKs Nsuspects Nsuspects Nsuspects Wthreat Wthreat Wthreat Wproperty Wproperty Wproperty
1 2 3 Violence threat Knife threat Speech threat Snatched Searched Asked for

c2 c1 c8 c1 c6 c5 c1 c2 c4
c4 c3 c2 c7 c8 c3 c5 c6
c5 c3 c8 c7
c6 c4
c7
(a)

Nsuspects Nsuspects Wthreat Wthreat Wthreat Wproperty Wproperty Wproperty


1 2 Violence threat Knife threat Speech threat Snatched Searched Asked for
p24 p25 p26 p27 p45 p13 p12 p13 p14 p67 p58 p13 p18 p25 p46 p47
p46 p47 p56 p57 p67 p23 p24 p34 p38 p67

(b)

Fig. 5  An example of BHKs and their score. a Cases covered by each BHKs. b Case pairs contained in each block

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11525

score of a case pair is, the more likely the case pair is serial. The details of learning the strong blocking scheme are
In order to include more serial case pairs, we use the mini- shown in Algorithm 2. First, four weak blocking schemes
mum score among serial case pairs as the threshold θ, and are obtained, and strong BHKs K′ and candidate BHKs K"
a strong blocking scheme would only cover the case pairs are found. Second, the score of each candidate BHK μ(k)
whose scores are larger than this threshold. Table 2 is the is calculated, as well as the minimum score of serial crime
combined scores for all case pairs mentioned in Fig. 5b. pairs r(K") covered by candidate BHKs K".
Among all the case pairs, suppose p13, p25 and p67 are
serial crime pairs, and their combined scores are 3/2, 11/10,
43/30 respectively, so the threshold θ is 11/10.

13
11526 Y. Li, X. Shao

Table 2  The combined score for Case pair p12 p13 p14 p18 p23 p24 p25 p26 p27
all case pairs
Combined score 1/6 3/2 1/6 1/3 1/6 4/15 11/10 1/10 1/10
Case pair p34 p38 p45 p46 p47 p56 p57 p58 p67
Combined score 1/6 1/3 1/10 13/30 13/30 1/10 1/10 1 43/30

Table 3  Information on Attribute Description Type Measure


attributes and similarity
measures ON The number of offenders Numeric Absolute
WT What tools did the offender use, e.g., a knife. Categorical Jaccard
HD How the offender deceived the victim Categorical Jaccard
HH How the offender harmed the victim Categorical Jaccard
HR How the offender robbed goods Categorical Jaccard
HT How the offender threatened the victim Categorical Jaccard
WH Which part of the victim was harmed Categorical Jaccard
WR What was robbed by offenders Categorical Jaccard
HC How the offender controlled the victim Keywords word2vec
HB How the offender broke through obstacles Keywords word2vec
WA What actions did the offender take Keywords word2vec

4.3 Applying blocking schemes T. In this phase, a strong blocking scheme is learned. For
unlabelled case pairs (Algorithm 4), at least one of the two
An unlabelled serial case pair may be formed by two cases in a case pair belongs to the unsolved case set Cu, and
unsolved cases or a solved case and an unsolved case. In the the label of the case pair in unlabelled case pairs is unknown.
former situation, crime analysts can treat the two crimes as The selection of unlabelled case pairs is similar to that of
one crime. The latter situation indicates that the offender of labelled case pairs.
the unsolved case had previously committed a crime, and
further investigation of the suspect can be conducted.
The blocking scheme application can be divided into two After obtaining the case pairs for comparison, their simi-
parts: the generation of labelled case pairs and unlabelled larity vectors will be calculated according to similarity
case pairs. The labelled case pairs are used to train a clas- measures, as shown in Section 4.4.
sifier. The unlabelled case pairs are obtained in the detec-
tion phase. For labelled case pairs (Algorithm 3), the two • Similarity measures
cases of one case pair are all from solved cases Cs. Case
pairs covered by strong BHKs K′ and their corresponding There are three types of crime attributes: numeric attrib-
labels (pij, yij) are directly added to the set of labelled case utes, categorical attributes, and keyword attributes. Each
pairs T. The combined scores of the remaining case pairs attribute type corresponds to a similarity measure method.
are calculated according to the scores of candidate BHKs Absolute distance is applied to measure the similarity of
K", and only those that exceed the threshold θ are added to numeric values, Jaccard’s coefficient is applied for categori-
cal attributes, and word2vec is used to find the difference
between keywords. Our previous work discussed the detailed
Table 4  Confusion matrix calculation process of these similarity measures [53]. The
basic formulations are as follows.
Predicted label
Positive Negative Total • The similarity of numeric attributes
Real label Positive TP FN P
Negative FP TN N The similarity measure of numeric attributes is shown
Total P′ N′ P + N as Eq. (11).

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11527

( )
| | ( m ) The similarity of categorical attributes is shown as Eq.
Simm a m m
, a = 1 − | am j ||
− am ∕ amax − am (11)
ab_dist i j | i min (12). It is a modified Jaccard’s coefficient.
where am is the value of the case Ci on the attribute am, ⎧
� �
i
m ⎪ ∑
max denotes the maximum value of the attribute a and amin
m ∕(�q� + �r� + �s�)
am � � ⎪ 𝜔m
f
denotes the minimum value.
f ϵq
Simm
Jaccard a m m
i
, aj = ⎨
⎪ 𝜔m
NULL
, am
i
= amj
= NULL
⎪ 0, ai or am
m
is NULL
• The similarity of categorical attributes ⎩ j

(12)

13
11528 Y. Li, X. Shao

𝜔m
f
= 1 − nm
f
∕n (13) by 11 attributes based on previous studies. The attribute val-
ues are manually obtained. Table 3 describes the attribute
where q represents the set of attribute values that are pre- information and similarity measures.
sent in both cases, and r and s represent the set of attribute
values that are present in one case but absent in the other 5.2 Evaluation criteria
case. In the traditional Jaccard’s coefficient, the similarity of
two sets is calculated as |q|/(|q| + |r| + |s|). We change its Table 4 shows the confusion matrix for detecting serial
numerator by assigning weights 𝜔m f
to different attribute val- crimes. We evaluate the detection performance using the
ues, and nf denotes the number of cases with the attribute
m indexes of this matrix. ‘Positive’ indicates that the case pair
value f on the attribute am and n denote the total number of is serial, and ‘Negative’ indicates that the case pair is non-
∑ m serial. P is the number of actual positive samples, P′ is the
cases, so 𝜔m ∈ [0, 1] and 𝜔f is actually equal to
number of predicted positive samples, N is the number of
f
∑� m � f ∈q
�q� − nf ∕n  . If the value is missing on the attribute am, actual negative samples and N′ is the number of predicted
f ∈q
negative samples.
the similarity of the case pair on this attribute is 𝜔m (cal-
NULL According to the above parameters, the metrics—pre-
culated according to Eq. (13)). When one of the two cases is
cision, recall, F-measure (FM), and G-mean (GM)—are
missing on attribute am but the other case is not missing, the
defined. For FM, the F1(b = 1) is used to evaluate the detec-
similarity is set to 0.
tion effect.
• The similarity of keyword attributes Precision = TP∕(TP + FP) (15)

The similarity measure of keyword attributes is shown as Recall = TP∕(TP + FN) (16)
Eq. (14). The similarity of two words w1 and w2 is measured
by word vectors [34], which is expressed as W2V(w1, w2) ( )

� ��
1 + b2 × Precision × Recall
⎧ FM = (17)
� � ⎪ max W2V am i,w
, am
j,w
, am
i,w
ϵ am
i
, amj,w
ϵ am
j b2 × (Precision × Recall)
w
m m m
Simword2vec ai , aj = ⎨ m
𝜔NULL , m m
ai = aj = NULL
⎪ am or am √
⎩ 0, i j
is NULL
TP TN
(14) GM = × (18)
TP + FN TN + FP
In Eq. (14), when a case ci has multiple keywords
( on the
)
m
attribute a , each keyword is represented as ai,w ai,w ϵ am
m m
 . We also evaluate the predictive performance of classi-
i
fiers. The area under the receiver operating characteristics
The similarity of the keyword attribute is the maximum
curve (AUC) [54] is independent of the choice of classifica-
value of word similarity among all keywords. When the two
tion threshold, it considers the probabilities of all samples.
cases are missing on the attribute am, the similarity of the
The equivalent calculation for AUC is shown as Eq. (19).
case pair on this attribute is 𝜔m  , which is calculated in the
NULL ranki is the rank value of the sample i, P is the number of
same way as Eq. (13).
positive samples and N is the number of negative samples.
The Lift measures how much better the predictive ability of
the model is compared to random selection (see Eq. (20)),
5 Experiments and it is often used in practical problems [55]. Because the
proportion of serial crime pairs is small, we use the top 5-th
5.1 Datasets percentile lift, i.e., the proportion of serial crime pairs in the
top 5% is compared with the proportion of serial crime pairs
The crime dataset in this work was collected by the judi-
in the whole crime set.
cial document on the OpenLaw website. The dataset covers
solved robbery crimes from January 2013 to October 2018 ∑ P×(1+P)
rank i −
in Zhengzhou City, Henan Province, China. AUC =
i∈positiveCalss 2 (19)
N×P
The case set contains 364 cases involving 292 offenders,
111 of which are serial crimes. Among crime series, the
maximum number of cases committed by a single offender TP∕(TP + FP)
Lift = (20)
was 9. After pairing, a total of 66, 066 case pairs were P∕(P + N)
formed, of which 158 were serial case pairs and 65, 908
were nonserial ones. The M.O. of each crime is represented

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11529

5.3 Results and discussion been covered. In addition, there is no strong BHK at the


beginning. This is because no BHK is included by all weak
In this section, three experiments are discussed. First, we blocking schemes. As the number of BHKs increases, strong
study the effect of the parameter ε (which specifies the num- BHKs appear at ρ = 0.99.
ber of serial case pairs that need to be covered) on the num- The weight of each BHK ranges from 0 to 1. Figure 6b
ber of BHKs and BHK weights. Second, the CB method presents the weight distribution of BHKs generated by dif-
is compared with the other blocking techniques, and the ferent values of ρ. The X-axis indicates the weight, and the
performance with different ε parameter values is analysed. Y-axis indicates the BHK frequency under the weight inter-
Third, several popular classification algorithms are used to val. Each curve is a normal distribution curve generated
evaluate the detection effect after using CB. The experiment by the mean and standard deviation of the BHK weights.
uses 10-fold cross-validation and is repeated 20 times, and Because some ρ values produce the same BHKs, the number
the final average value is used as the performance result. In of curves is not eleven. For example, ρ = 0.92 and ρ = 0.93
each fold, part of the case set is used for training, and the have the same BHKs, so only one distribution curve is
rest is used for testing. The training set generates labelled drawn. In Fig. 6b, most BHK weights are small, and only
case pairs to learn the blocking scheme and classifier, and a few have relatively large weights. The weight distribu-
cases of the test set are treated as unsolved cases to form tions of BHKs are similar with different parameters, but as
unlabelled case pairs. ρ increases, the number of BHKs with small weights also
increases. This is because, to cover more serial case pairs,
more BHKs are added to the blocking scheme, and then the
5.3.1 Experiment 1: The effects of parameters number of BHKs with small weights grows.

This section studies the changes in the number and weight of 5.3.2 Experiment 2: The performance of the blocking
BHKs with different values of the parameter ε. We change methods
the value of ε by setting ρ ∙ α, where α is the number of serial
case pairs in the training fold. The value of ρ varies from 0.9 We compare CB with the blocking methods Adaptive [45],
to 1.0, with ρ = 0.9 meaning that the blocking scheme needs Fisher [46], BL [48], ICSKD [51] and MCBS [47]. The
to cover 90% of serial case pairs. descriptions of these methods are as follows:
We analyse the number and weight of BHKs for the data-
set containing all cases (Fig. 6a). With the growth of ρ, the Adaptive: It uses PQ to evaluate blocks, sorts the PQ val-
number of BHKs gradually increases. This is because, to ues of each block in descending order, and adds them to
cover more serial case pairs, more BHKsare needed. When the blocking scheme one by one until the covered match-
the number of BHKs reaches 59, all serial case pairs have ing pairs exceeded the threshold.

strong
candidates 40 0.9
total 0.91
60 58 58 58 59 59 59 59
57 57 0.92
54 55 35
8 9 0.93
0.94
30
0.95
Number of BHKs
Number of BHKs

0.96
40 25 0.97
0.98
20 0.99
57 57 58 58 58 59 59 1
54 55
51 50 15
20
10

0 0
0.90 0.92 0.94 0.96 0.98 1.00 0.0 0.2 0.4 0.6 0.8 1.0
Weight
(a) (b)

Fig. 6  Information on BHKs with different values of ρ. a The number of BHKs with different ρ; ‘same’ is the strong BHKs, ‘candidates’ is the
candidates BHKs, and ‘means’ is the total of BHKs. b The weight distributions of BHKs with different values of ρ 

13
11530 Y. Li, X. Shao

Table 5  Comparison of blocking methods BL: It assigns the FM of a block as its weight and selects
PC RR FPC,RR PQ the top 3 features as a blocking filter. A record will be
assigned to an existing block if it shares the same value
Fisher .9747 .3963 .5635 .0042 with the representative record on the top 3 features. The
Adaptive .9843 .4507 .6183 .0046 first record is a representative record and records that do
BL .9715 .1652 .2810 .0028 not match existing representative records will become a
ICSKD .8286 .6208 .7051 .0053 new representative record. Blocks are formed based on
ARCS .9933 .5495 .7074 .0053 representative records.
ECBS .9933 .5702 .7243 .0055 ICSKD: It iteratively discovers blocks with high dis-
CB .9820 .6487 .7785 .0071 criminability and the block needs to have a high value
in the comprehensive indicators (FL) of discriminabil-
ity and coverage. High -discriminability means that few
1.2 instances have the same value on this feature and it can
Fisher Adaptive
ICSKD BL get a suitable reduction ratio. High coverage means that
1.0
ARCS
CB
ECBS
many instances have a value on this feature. There are
two thresholds η, ψ of discriminability, FL. We test this
0.8 method with η values {0.02, 0.04, 0.06, 0.08, 0.1} and ψ
values {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}.
MCBS: It exploits the co-occurrence of entities among
0.6
the generated blocks for pruning, splitting and merging
blocks to control the size of blocks. It has two different
0.4
weighting schemes (ARCS and ECBS) to calculate the
co-occurrence of entities. Its default algorithm can pro-
0.2 duce the highest PC results, so we adopt it in the experi-
ment. There are two important parameters Smin and
0.0 Smax. We evaluate ARCS and ECBS weighting schemes
with Smin values {5, 10, 15, 20, 25, 30} and Smax values
PC RR FPC , RR {40, 50, 60, 70, 80, 90, 100}.

Fig. 7  Box plot of blocking methods The effects of these blocking strategies are shown in
Table 5. To ensure more serial crime pairs to be covered,
the table presents the results when PC reaches the highest in
Fisher: It designs a Fisher score to sort eligible blocks and these methods. The PC of CB is lower than that of Adaptive,
find the excellent blocks. The algorithm terminates when ARCS and ERBS, but its RR, FPC,RR and PQ have the best
the covered matching pairs in the blocking scheme exceed performance. We have drawn a box plot of these blocking
a certain percentage. methods for the 20-repetition 10-fold cross-validation test

1.0 1.0 0.025


CB
0.8 0.8 Fisher
0.020
Adaptive
0.6 0.6
0.015
FPC , RR
PQ
RR

0.4 0.4
0.010
0.2 0.2
0.005
0.0 0.0
0.85 0.90 0.95 1.00 0.90 0.95 1.00 0.90 0.95 1.00
PC
(a) (b) (c)

Fig. 8  Performance of blocking methods with different values of ρ 

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11531

Table 6  Information on 26 Nc Nserial Nnonserial IR Nc Nserial Nnonserial IR


crime sets. Nc represents the
number of crimes. Nserial is the c120 158 6982 44.19 c250 158 30,967 195.99
number of serial crime pairs and
c130 158 8227 52.07 c260 158 33,512 212.10
Nnonserial is that of the nonserial
crime pairs. IR indicates the c140 158 9572 60.58 c270 158 36,157 228.84
imbalance ratio (Nnonserial / c150 158 11,017 69.73 c280 158 38,902 246.22
Nserial) c160 158 12,562 79.51 c290 158 41,747 264.22
c170 158 14,207 89.92 c300 158 44,692 282.86
c180 158 15,952 100.96 c310 158 47,737 302.13
c190 158 17,797 112.64 c320 158 50,882 322.04
c200 158 19,742 124.95 c330 158 54,127 342.58
c210 158 21,787 137.89 c340 158 57,472 363.75
c220 158 23,932 151.47 c350 158 60,917 385.55
c230 158 26,177 165.68 c360 158 64,462 407.99
c240 158 28,522 180.52 c364 158 65,908 417.14

(Fig. 7). ARCS and ERBS present a similar situation, they


perform better than other methods on PC but perform poorly 500 train set-Normal
train set-CB
on RR. The ICSKD is similar to CB on RR, but it is lower test set-Normal
on PC. Among Adaptive, Fisher and CB, the distribution Imbalance Ratio(IR) 400
test set-CB

of them on PC is approximately the same. For the RR and


FPC,RR indicators, the CB is more concentrated and higher.
300
Overall, CB sacrifices a little ability to cover serial case pairs
but dramatically reduces the number of nonserial case pairs.
It can be applied to large-scale non-serious cases, such as 200

minor robbery and theft because the detection of these large-


scale case series requires blocking technology. 100
Because CB, Adaptive, and Fisher have the parameter
that controls the proportion of serial crime pairs, we analyze 0
their performance of PC, RR, FPC,RR, and PQ with different 100 150 200 250 300 350 400
values of ρ, as shown in Fig. 8. In Fig. 8a, the PC versus RR Nc
curves of the three methods have a similar shape. When the
PC becomes high, the corresponding RR will decrease. This Fig. 9  Imbalance ratio of using CB and not using CB in the training
is because a higher PC makes the blocking scheme cover set and the test set. ‘train set-Normal’ means the imbalance ratio of
more serial case pairs, and fewer case pairs are excluded. not using CB in the train set. ‘train set-CB’ means the imbalance ratio
However, the curve of CB is above those of the other two of using CB in the train set. ‘test set-Normal’ means the imbalance
ratio of not using CB the test set. ‘test set-CB’ means the imbalance
methods. For the same PC, CB can obtain a higher RR. With ratio of using CB in the test set
the increase in ρ, the three methods gradually decrease in
FPC,RR and PQ (Fig. 8b, c). With the increase in ρ, the num-
ber of serial case pairs covered by the blocking scheme
increases, and more nonserial case pairs are also covered. serial crimes and a portion of the nonserial crimes. When
However, the curve of CB is above those of the other two forming case pairs, the number of serial case pairs in each
methods. case set is the same, and the number of nonserial case pairs
increases with the number of cases. The details of these
5.3.3 Experiment 3: The performance of serial crime crime sets are shown in Table 6; for example, c300 means
detection that the case set contains 300 cases. For the detection algo-
rithm, we selected five popular machine learning classifica-
The CB can effectively reduce the comparison of case pairs, tion methods and compared the classification performance of
but the detection effect is more important. To evaluate the using CB and not using CB. The algorithms were developed
effect of CB on serial crime detection, we constructed vari- with the Python package scikit-learn [56]. We tested several
ous crime sets based on the original case set. These crime configurations of algorithms and selected the best settings,
sets are subsets of the original case set. They contain all and they are briefly described as follows:

13
11532 Y. Li, X. Shao

1.0 RF 1.0 KNN 1.0 GDBT

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Nc Nc Nc Normal
1.0 1.0 LR 1.0 Stack CB
NN

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Nc Nc Nc

Fig. 10  Precision comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the
result of using combined blocking

1.0 1.0 1.0


RF KNN GDBT
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350 Normal
Nc Nc Nc CB
1.0 1.0 1.0
NN LR Stack
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Nc Nc Nc

Fig. 11  Recall comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the
result of using combined blocking

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11533

1.0 1.0 1.0


RF KNN GDBT
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Normal
Nc Nc Nc CB
1.0 1.0 1.0
NN LR Stack
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Nc Nc Nc

Fig. 12  FM comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the result
of using combined blocking

1.0 1.0 1.0


RF KNN GDBT
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Normal
Nc Nc Nc
1.0 1.0 1.0 CB
NN LR Stack
0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
Nc Nc Nc

Fig. 13  GM comparison of using and not using CB with different algorithms. ‘Normal’ means the result of not using CB; ‘CB’ means the result
of using combined blocking

13
11534 Y. Li, X. Shao

Table 7  AUCcomparison of RF KNN GDBT NN LR Stack


using and not using CB with
different algorithms. ‘Normal’ N Normal CB Normal CB Normal CB Normal CB Normal CB Normal CB
meansthe result of not using 120 .9573 .9625 .8793 .8910 .9695 .9725 .9584 .9347 .9363 .9365 .9696 .9705
CB; ‘CB’ means the result of
130 .9568 .9603 .8781 .8934 .9708 .9717 .9608 .9407 .9389 .9400 .9691 .9695
using combined blocking;the
better results in ‘Normal’ and 140 .9569 .9590 .8769 .8867 .9724 .9737 .9636 .9455 .9416 .9417 .9725 .9733
‘CB’ are emphasized in bold 150 .9571 .9645 .8743 .8881 .9719 .9747 .9652 .9513 .9450 .9453 .9741 .9746
160 .9602 .9632 .8783 .8888 .9703 .9751 .9667 .9527 .9479 .9480 .9755 .9761
170 .9563 .9628 .8736 .8878 .9702 .9760 .9672 .9570 .9496 .9507 .9753 .9764
180 .9559 .9607 .8707 .8828 .9666 .9725 .9657 .9535 .9466 .9497 .9737 .9743
190 .9546 .9652 .8630 .8882 .9686 .9733 .9670 .9537 .9471 .9499 .9754 .9763
200 .9539 .9638 .8658 .8876 .9704 .9769 .9716 .9615 .9538 .9568 .9776 .9788
210 .9497 .9607 .8641 .8817 .9681 .9743 .9716 .9624 .9517 .9546 .9770 .9776
220 .9467 .9610 .8644 .8832 .9331 .9753 .9727 .9631 .9521 .9550 .9768 .9783
230 .9473 .9573 .8642 .8803 .9452 .9732 .9707 .9599 .9518 .9519 .9774 .9773
240 .9476 .9613 .8617 .8805 .9521 .9736 .9757 .9655 .9570 .9585 .9801 .9804
250 .9470 .9584 .8590 .8777 .9489 .9752 .9754 .9660 .9600 .9620 .9799 .9802
260 .9443 .9563 .8517 .8758 .9467 .9747 .9756 .9670 .9596 .9604 .9798 .9808
270 .9402 .9542 .8476 .8628 .9401 .9712 .9726 .9624 .9561 .9552 .9778 .9778
280 .9397 .9529 .8450 .8609 .9125 .9742 .9741 .9643 .9564 .9555 .9792 .9792
290 .9383 .9510 .8433 .8679 .9001 .9668 .9736 .9710 .9582 .9579 .9780 .9781
300 .9383 .9511 .8455 .8661 .8816 .9707 .9758 .9757 .9582 .9584 .9798 .9797
310 .9372 .9542 .8426 .8635 .8897 .9694 .9774 .9760 .9624 .9634 .9816 .9813
320 .9372 .9490 .8419 .8690 .8811 .9751 .9772 .9766 .9623 .9638 .9823 .9819
330 .9363 .9474 .8368 .8598 .8480 .9736 .9747 .9759 .9629 .9635 .9809 .9808
340 .9329 .9427 .8392 .8585 .8292 .9642 .9714 .9703 .9564 .9557 .9773 .9773
350 .9379 .9504 .8417 .8558 .8317 .9683 .9775 .9764 .9607 .9628 .9823 .9820
360 .9345 .9424 .8382 .8559 .8076 .9363 .9733 .9714 .9558 .9545 .9797 .9792
364 .9278 .9293 .8304 .8339 .7841 .9554 .9597 .9601 .9237 .9218 .9652 .9690

• Random forest (RF): RF consists of several decision GM, AUC and Lift for comparison. It should be mentioned
trees, and random attribute selection is designed for the that the blocking method may reduce the number of serial
formation of decision trees. In our experiment, the num- case pairs. To enable fair comparison, the denominator of
ber of trees is 200. the recall formula is the actual number of serial case pairs
• K-Nearest Neighbour (KNN): The idea is to set the class before blocking (Fig. 9).
label of a sample to the majority class of its k nearest In this experiment, the parameter ρ of CB was set to 1.0.
neighbours. In our experiment, k = 5. The results on Precision, Recall, FM, and GM are shown
• Gradient Boosting Decision Tree (GDBT): It expands in Figs. 10, 11, 12, and 13. Algorithms after using CB out-
and enhances classification trees based on gradient boost- perform the results of not using CB on most data sets. The
ing. The residual of a decision tree is the basis for train- CB also has a positive effect on the Stack, and the Stack has
ing the next tree. In our experiment, the number of trees the best performance on many crime sets in terms of Preci-
was 200. sion. There is a large gap between the number of samples
• Neural Network (NN): NN processes information by of the minority and the majority in the imbalanced data set,
adjusting the network connection parameters. We set the which hinders the algorithm from learning minority patterns
number of hidden layers of the network to 100. [57]. The minority is more likely to be misclassified. The
• Logistic Regression (LR): LR uses logistic functions in imbalance of case pairs can easily lead to the misclassifica-
statistical techniques to classify samples. In our experi- tion of serial case pairs. As presented in Figs. 10, 11, 12,
ments, the l1 penalty was used. and 13, with the growth of cases, the IR is increasing (see
Fig. 9), the classification performance shows a downward
trend after applying CB. It shows that the data imbalance
In addition to the above five algorithms, we averaged the harms classification.
predicted probabilities of these five algorithms to form a Alternately, the use of CB improves the detection effect.
stacked ensemble (Stack). We use Precision, Recall, FM, In different datasets, the performance is relatively stable, and

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11535

Table 8  Liftcomparison of using and not using CB with different algorithms. ‘Normal’ meansthe result of not using CB; ‘CB’ means the result
of using combined blocking;the better results in ‘Normal’ and ‘CB’ are emphasized in bold
RF KNN GDBT NN LR Stack

N Normal CB Normal CB Normal CB Normal CB Normal CB Normal CB


120 13.793 13.791 10.417 10.361 13.448 13.593 12.838 11.922 11.458 11.514 13.718 13.685
130 14.010 13.999 10.947 10.854 13.792 13.957 13.335 12.372 11.896 11.861 13.974 13.876
140 15.243 15.291 12.073 11.951 15.208 15.293 14.595 13.467 12.427 12.507 15.222 15.270
150 15.540 15.533 12.525 12.266 15.355 15.754 14.897 14.278 12.796 12.949 15.572 15.623
160 16.468 16.506 13.311 13.260 15.998 16.336 15.780 15.011 13.351 13.399 16.390 16.393
170 16.891 16.881 13.648 13.576 16.379 16.789 15.971 15.926 13.821 13.906 16.702 16.923
180 17.048 17.096 13.785 13.753 16.522 17.085 16.228 16.241 13.866 13.951 16.917 17.051
190 16.786 16.796 13.334 13.408 16.206 16.584 15.886 15.675 13.988 13.947 16.601 16.699
200 17.491 17.469 14.260 14.526 17.021 17.515 16.730 16.757 14.733 14.746 17.403 17.563
210 17.612 17.642 14.367 14.430 17.125 17.470 16.900 17.024 14.808 14.959 17.597 17.688
220 17.897 17.881 14.626 15.252 16.575 17.959 17.097 17.221 15.215 15.129 18.044 18.081
230 17.747 17.707 14.535 14.735 16.582 17.456 16.735 16.747 14.481 14.625 17.608 17.643
240 17.952 17.972 14.550 15.185 16.944 17.673 17.268 17.257 15.332 15.296 18.073 18.144
250 18.073 18.090 14.444 15.244 17.179 17.998 17.354 17.363 16.005 16.236 18.247 18.260
260 18.011 18.095 14.160 15.130 17.257 17.923 17.477 17.535 16.026 16.091 18.324 18.328
270 17.754 17.892 14.003 14.512 16.827 17.685 17.083 17.265 15.471 15.317 17.952 17.975
280 17.789 17.914 13.881 14.553 16.375 17.591 17.206 17.175 15.568 15.507 17.996 18.117
290 17.854 17.890 13.805 14.831 16.107 17.461 17.454 17.304 15.960 16.080 18.247 18.171
300 17.981 17.996 13.893 14.766 15.824 17.795 17.522 17.394 16.020 15.881 18.322 18.200
310 18.056 18.231 13.760 14.641 16.171 17.982 17.716 17.641 16.477 16.569 18.505 18.506
320 18.068 18.019 13.728 14.847 16.163 18.170 17.599 17.616 16.366 16.534 18.440 18.384
330 17.995 18.082 13.524 14.491 15.402 17.987 17.556 17.630 16.541 16.493 18.418 18.421
340 17.856 17.866 13.622 14.442 14.770 17.505 17.442 17.391 15.966 15.862 18.315 18.249
350 18.072 18.263 13.710 14.314 15.019 18.006 17.805 17.740 16.365 16.447 18.568 18.568
360 18.004 18.034 13.579 14.324 14.425 16.926 17.579 17.453 15.834 15.849 18.430 18.414
364 17.468 17.400 13.325 13.484 13.441 16.630 16.732 16.520 13.976 13.869 17.619 17.801

the downward trend has also been improved. In Fig. 9, the imbalance. As the number of cases increases, the imbalance
IR of the training set and test set decreased after using CB. ratio gradually increases and the Lift of algorithms is also
With the increase in cases, the reduction in IR has soared. getting larger.
In general, the IR change of the data set using CB is even We also draw the classification difference between using
smaller than that without CB. Because of the reduction trend and no using CB in Fig. 14. For different algorithms, most of
of IR, the classification performance can be improved and is the classification difference is positive, which means that the
relatively stable. The improvement of Precision and Recall use of CB can improve the classification effect. In addition,
shows that after blocking, the algorithm can find more serial the more cases there are in the dataset, the more significant
case pairs correctly. This is because when applying the CB the improvement is. This is because the imbalance ratio is
method, the imbalance problem is alleviated. larger if the dataset contains more cases. As the number of
Tables 7 and 8 present the results of AUC and Lift. In cases increases, CB can eliminate more case pairs and sig-
addition to NN, after using CB, the results of RF, KNN, nificantly diminish IR (Fig. 9), and then the improvement
GDBT, and LR are improved on most crime sets, and the becomes prominent.
predictive performance of the final Stack algorithm is also To show the significant difference between using CB
improved. On many crime sets, the results of Stack using CB and not using CB, the Wilcoxon signed-rank test [58] is
are also superior to other algorithms in terms of AUC and applied to test the precision, recall, FM, GM in different
Lift. According to the calculation formula of Eq. (20), we classification algorithms. As shown in Table 9, the p values
know that the denominator of Lift is the proportion of posi- are less than 0.05, which means that indicators after using
tive samples which is small in crime sets because of the class CB are significantly better than those without CB.

13
11536 Y. Li, X. Shao

0.35
RF Precision 0.25
RF Recall
0.30 KNN KNN
GDBT 0.20 GDBT
0.25 NN NN
0.15
LR
Difference

Difference
0.20 LR
Stack 0.10 Stack
0.15 0.05
0.10 0.00
0.05 -0.05
0.00 -0.10
-0.05 -0.15
150 200 250 300 350 150 200 250 300 350
Nc Nc
0.234
0.30 RF FM RF GM
KNN 0.195 KNN
0.25 GDBT GDBT
0.156
0.20 NN NN
0.117
Difference

Difference
LR LR
0.15
Stack Stack
0.078
0.10
0.039
0.05
0.000
0.00
-0.039
-0.05
-0.078
-0.10
150 200 250 300 350 150 200 250 300 350
Nc Nc

Fig. 14  Difference between using and not using CB with different means the result of not using CB. ‘GMCB’ means the result of using
algorithms. Taking GM as an example, the value of the difference is combined blocking
calculated by the formula: Difference = ­GMCB - ­GMNormal. ‘GMNormal

6 Conclusion the serial case pairs can be retained and 64.87% of the case
pairs is eliminated. This result is better than our compari-
In this work, we propose a combined blocking method to son methods Fisher and Adaptive. On the other hand, the
efficiently identify serial cases and reduce comparisons of CB is embedded in a supervised machine learning frame-
case pairs. It defines criminal behaviours as BHKs and con- work, with RF, KNN, GDBT, NN and LR as the baseline
sists of four weak blocking schemes, i.e. four subsets of all classification algorithms. Based on the predictions of these
BHKs. The blocking criteria PC, RR, PQ and block size are algorithms, we also build a stacked ensemble method. Some
used to form each weak blocking scheme. In the final strong cases were selected randomly from the original case set to
blocking scheme, each BHK is assigned a combined score form 26 case sets with different numbers of cases. They are
according to its rarity (which is the number of case pairs it used to evaluate the performance of serial crime detection.
covers), and whether to compare a case pair or not is deter- Compared with the classification without CB, the imbal-
mined by the sum of the scores of the BHK covering it. The ance ratio of data after using CB is greatly alleviated and
CB method uses multiple blocking criteria, but it does not the downward trend becomes more and more obvious as the
integrate them into a comprehensive index, which provides number of cases increases. After using CB, the detection
a new idea of multi-criteria blocking. effect performs better on the precision, recall, FM and GM.
The practical implication of the CB method is that it takes Precision is increased by 30% at most, and the maximum
the rarity of criminal behaviours into account, and it is able improvement of recall, FM, GM is 20%. In the predictive
to greatly reduce the number of pairwise calculations and performance, the algorithms except NN have improved AUC
improves the efficiency of large-scale crime serial crime and Lift after using CB.
detection. A real-world robbery case set was used to evalu- In our future research, we will attempt to design a block-
ate the experimental effect. In the experiments, 98.2% of ing criterion that is more suitable for crime series detection.

13
A supervised machine learning framework with combined blocking for detecting serial crimes 11537

Table 9  The Wilcoxon signed-rank test of Precision, Recall, FM, and 7. Woodhams J et al (2018) Linking serial sexual offences: Moving
GM in different algorithms towards an ecologically valid test of the principles of crime link-
age. Legal and Criminological Psychology 24:12S–140S
Measure method R+ R− p value Hypothesis (0.05) 8. Canter D, Hammond L A comparison of the efficacy of different
decay functions in geographical profiling for a sample of US serial
Precision RF 350 1 9.34E-06 Rejected killers. Journal of Investigative Psychology and Offender Profiling
KNN 340 11 2.94E-05 Rejected 3(2):91–103. https://​doi.​org/​10.​1002/​jip.​45
GDBT 351 0 8.30E-06 Rejected 9. Wang T, Rudin C, Wagner D, Sevieri R (Mar 2015) Finding pat-
NN 311 40 5.79E-04 Rejected terns with a rotten core: data mining for crime series with cores.
Big Data 3(1):3–21. https://​doi.​org/​10.​1089/​big.​2014.​0021
LR 340 11 2.94E-05 Rejected 10. Markson L, Woodhams J, Bond JW (2010) Linking serial resi-
Stack 342 9 2.35E-05 Rejected dential burglary: comparing the utility of modus operandi behav-
Recall RF 351 0 8.30E-06 Rejected iours, geographical proximity, and temporal proximity. Journal of
KNN 351 0 8.30E-06 Rejected Investigative Psychology and Offender Profiling. https://​doi.​org/​
10.​1002/​jip.​120
GDBT 351 0 8.30E-06 Rejected 11. Woodhams J, Hollin CR, Bull R (2007) The psychology of link-
NN 272 79 1.42E-02 Rejected ing crimes: a review of the evidence. Leg Criminol Psychol
LR 350 1 9.34E-06 Rejected 12(2):233–249. https://​doi.​org/​10.​1348/​13553​2506x​118631
Stack 342 9 2.35E-05 Rejected 12. Burrell A, Bull R, Bond J (2012) Linking personal robbery
offences using offender behaviour. J Investig Psychol Offender
FM RF 351 0 8.30E-06 Rejected Profiling 9(3):201–222. https://​doi.​org/​10.​1002/​jip.​1365
KNN 351 0 8.30E-06 Rejected 13. Chi H, Lin Z, Jin H, Xu B, Qi M (2017) A decision support system
GDBT 351 0 8.30E-06 Rejected for detecting serial crimes. Knowl-Based Syst 123:88–101. https://​
NN 295 56 2.40E-03 Rejected doi.​org/​10.​1016/j.​knosys.​2017.​02.​017
14. Phua C, Gayler R, Lee V, Smith-Miles K (2009) On the communal
LR 350 1 9.34E-06 Rejected analysis suspicion scoring for identity crime in streaming credit
Stack 345 6 1.67E-05 Rejected applications. Eur J Oper Res 195(2):595–612. https://​doi.​org/​10.​
GM RF 351 0 8.30E-06 Rejected 1016/j.​ejor.​2008.​02.​015
KNN 351 0 8.30E-06 Rejected 15. Gee D, Belofastov A (2007) Profiling sexual fantasy. In: Kocsis
RN (ed) Criminal profiling: international theory, research, and
GDBT 351 0 8.30E-06 Rejected practice. Humana Press, Totowa, NJ, pp 49–71. https://​doi.​org/​
NN 277 74 9.94E-03 Rejected 10.​1007/​978-1-​60327-​146-2_3
LR 350 1 9.34E-06 Rejected 16. Borg A, Boldt M, Lavesson N, Melander U, Boeva V (2014)
Stack 342 9 2.35E-05 Rejected Detecting serial residential burglaries using clustering. Expert
Syst Appl 41(11):5252–5266. https://​doi.​org/​10.​1016/j.​eswa.​
2014.​02.​035
17. Chen L, Gu W, Tian X, Chen G (2019) AHAB: aligning heteroge-
neous knowledge bases via iterative blocking. Inf Process Manag
While reducing pairwise comparisons, it will also aim to 56(1):1–13. https://​doi.​org/​10.​1016/j.​ipm.​2018.​08.​006
improve the separability of data so that the algorithm can 18. O’Hare K, Jurek A, de Campos C (2018) A new technique of
find more crime series. selecting an optimal blocking method for better record linkage.
Inf Syst 77:151–166. https://​doi.​org/​10.​1016/j.​is.​2018.​06.​006
19. Christen P (2011) A survey of indexing techniques for scalable
record linkage and deduplication. IEEE Trans Knowl Data Eng
References 24(9):1537–1555
20. Lin S, Brown DE (2006) An outlier-based data association method
1. Tonkin M et al (2017) Using offender crime scene behavior to for linking criminal incidents. Decis Support Syst 41(3):604–615.
link stranger sexual assaults: a comparison of three statistical https://​doi.​org/​10.​1016/j.​dss.​2004.​06.​005
approaches. J Crim Just 50:19–28. https://​doi.​org/​10.​1016/j.​jcrim​ 21. Borg A, Boldt M (2016) Clustering residential burglaries using
jus.​2017.​04.​002 modus operandi and spatiotemporal information. Interna-
2. C. M. d. M. Mota, C. J. J. d. Figueiredo, D. V. e. S. Pereira (2020) tional Journal of Information Technology & Decision Making
Identifying areas vulnerable to homicide using multiple criteria 15(01):23–42. https://​doi.​org/​10.​1142/​s0219​62201​55003​39
analysis and spatial analysis. Omega 102211. https://​doi.​org/​10.​ 22. Zhu S, Xie Y (2019) Crime event embedding with unsupervised
1016/j.​omega.​2020.​102211 feature selection. In: ICASSP 2019–2019 IEEE International Con-
3. Chohlas-Wood A, Levine ES (2019) A recommendation engine to ference on Acoustics, Speech and Signal Processing (ICASSP).
aid in identifying crime patterns. INFORMS Journal on Applied IEEE, pp 3922–3926
Analytics. https://​doi.​org/​10.​1287/​inte.​2019.​0985 23. Bennell C, Canter DV (2002) Linking commercial burglaries by
4. Isafiade OE, Bagula AB (2020) Series mining for public safety modus operandi: tests using regression and ROC analysis. Sci
advancement in emerging smart cities. Future Generation Com- Justice 42(3):153–164. https://​doi.​org/​10.​1016/​s1355-​0306(02)​
puter Systems 108:777–802. https://d​ oi.o​ rg/1​ 0.1​ 016/j.f​ uture.2​ 020.​ 71820-0
03.​002 24. Tonkin M, Grant T, Bond JW (2008) To link or not to link: a test
5. Porter MD (2016) A statistical approach to crime linkage. Am Stat of the case linkage principles using serial car theft data. 5(1–
70(2):152–165. https://​doi.​org/​10.​1080/​00031​305.​2015.​11231​85 2):59–77. https://​doi.​org/​10.​1002/​jip.​74
6. Hazelwood RR, Warren JI (2004) Linkage analysis: modus oper- 25. Tonkin M, Woodhams J, Bull R, Bond JW, Santtila P (2012) A
andi, ritual, and signature in serial sexual crime. Aggress Violent comparison of logistic regression and classification tree analysis
Behav 9(3):307–318. https://​doi.​org/​10.​1016/j.​avb.​2004.​02.​002

13
11538 Y. Li, X. Shao

for Behavioural case linkage. J Investig Psychol Offender Profiling on Challenges in Web Information Retrieval and Integration, pp
9(3):235–258. https://​doi.​org/​10.​1002/​jip.​1367 30–39
26. Ku C-H, Leroy G (2014) A decision support system: automated 43. Allam A, Skiadopoulos S, Kalnis P (2018) Improved suffix block-
crime report analysis and classification for e-government. Gov Inf ing for record linkage and entity resolution. Data Knowl Eng
Q 31(4):534–544. https://​doi.​org/​10.​1016/j.​giq.​2014.​08.​003 117:98–113. https://​doi.​org/​10.​1016/j.​datak.​2018.​07.​005
27. Reich BJ, Porter MD (2015) Partially supervised spatiotemporal 44. O'Hare K, Jurek-Loughrey A, Campos C (2019) A Review of
clustering for burglary crime series identification. Journal of the Unsupervised and Semi-supervised Blocking Methods for
Royal Statistical Society Series a Statistics in Society 178(2):465– Record Linkage. In: Linking and Mining Heterogeneous and
480. https://​doi.​org/​10.​1111/​RSSA.​12076 Multi-view Data. Springer, pp 79–105. https://​doi.​org/​10.​1007/​
28. Goala S, Dutta P (2018) A fuzzy multicriteria decision-making 978-3-​030-​01872-6_4
approach to crime linkage. International Journal of Information 45. Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking:
Technologies and Systems Approach 11(2):31–50. https://d​ oi.o​ rg/​ Learning to scale up record linkage. In: Sixth International Con-
10.​4018/​ijitsa.​20180​70103 ference on Data Mining (ICDM'06). IEEE, pp 87–96
29. Albertetti F, Cotofrei P, Grossrieder L, Ribaux O, Stoffel K 46. Kejriwal M, Miranker DP (2013) An unsupervised algorithm for
(2013) The CriLiM methodology: crime linkage with a fuzzy learning blocking schemes. In: 2013 IEEE 13th International Con-
mcdm approach. In: Proceedings - 2013 European intelligence ference on Data Mining. IEEE, pp 340–349
and security informatics conference, EISIC, vol 2013, pp 67–74. 47. Nascimento DC, Pires CES, Mestre DG (2019) Exploiting
https://​doi.​org/​10.​1109/​EISIC.​2013.​17 block co-occurrence to control block sizes for entity resolu-
30. Qazi N, Wong BLW (2019) An interactive human centered data tion. Knowl Inf Syst 62(1):359–400. https://​doi.​org/​10.​1007/​
science approach towards crime pattern analysis. Information Pro- s10115-​019-​01347-0
cessing & Management 56(6):102066. https://​doi.​org/​10.​1016/j.​ 48. O’Hare K, Jurek-Loughrey A, de Campos C (2019) An unsuper-
ipm.​2019.​102066 vised blocking technique for more efficient record linkage. Data
31. Brown DE, Hagen S (2003) Data association methods with appli- Knowl Eng 122:181–195
cations to law enforcement. Decis Support Syst 34(3):369–378 49. Michelson M, Knoblock CA (2006) Learning blocking schemes
32. Boriah S, Chandola V, Kumar V (2008) Similarity Measures for for record linkage. In: AAAI, vol 6, pp 440–445
Categorical Data: A Comparative Evaluation. In: Proceedings 50. Ramadan B, Christen P (2015) Unsupervised blocking key selec-
of the 2008 SIAM International Conference on Data Mining, pp tion for real-time entity resolution. In: Pacific-Asia Conference on
243–254. https://​doi.​org/​10.​1137/1.​97816​11972​788.​22 Knowledge Discovery and Data Mining. Springer, pp 574–585
33. Bennell C, Jones NJ, Melnyk T (2009) Addressing problems with 51. Song D, Luo Y, Heflin J (2017) Linking heterogeneous data in the
traditional crime linking methods using receiver operating char- semantic web using scalable and domain-independent candidate
acteristic analysis. Leg Criminol Psychol 14(2):293–310. https://​ selection. IEEE Trans Knowl Data Eng 29(1):143–156. https://​
doi.​org/​10.​1348/​13553​2508x​349336 doi.​org/​10.​1109/​tkde.​2016.​26063​99
34. Mikolov T, Chen K, Corrado G, Dean J Efficient Estimation 52. Carr RD, Doddi S, Konjevod G, Marathe M (2000) C. Associa-
of Word Representations in Vector Space. In: arXiv e-prints tion For Computing Machinery Inc; Association For, and I. N. C.
Accessed on: January 01, 2013Available: https://​ui.​adsabs.​harva​ Machinery. In: On the red-blue set cover problem (Proceedings
rd.​edu/​\#​abs/​2013a​rXiv1​301.​3781M of the Eleventh Annual Acm-Siam Symposium on Discrete Algo-
35. Tonkin M, Lemeire J, Santtila P, Winter JM (2019) Linking prop- rithms), pp 345–353
erty crime using offender crime scene behaviour: A comparison of 53. Li Y-S, Qi M-L (2019) An approach for understanding offender
methods. Journal of Investigative Psychology and Offender Profil- modus operandi to detect serial robbery crimes. Journal of Com-
ing. https://​doi.​org/​10.​1002/​jip.​1525 putational Science 36:101024. https://d​ oi.o​ rg/1​ 0.1​ 016/j.j​ ocs.2​ 019.​
36. Papadakis G, Skoutas D, Thanos E, Palpanas T (2020) Blocking 101024
and Filtering Techniques for Entity Resolution: A Survey. ACM 54. Hand DJ, Till RJ (2001) A simple generalisation of the area under
Computing Surveys 53(2):1–42. https://​doi.​org/​10.​1145/​33774​55 the ROC curve for multiple class classification problems. Mach
37. I. Fellegi and A. Sunter, "A Theory for Record Linkage," Journal Learn 45(2):171–186
of the American Statistical Association, vol. 64, pp. 1183–1210, 55. De Caigny A, Coussement K, De Bock KW (2018) A new hybrid
. doi: https://​doi.​org/​10.​1080/​01621​459.​1969.​10501​049 classification algorithm for customer churn prediction based on
38. Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia- logistic regression and decision trees. Eur J Oper Res 269(2):760–
Molina H (2009) Entity resolution with iterative blocking. In: 772. https://​doi.​org/​10.​1016/j.​ejor.​2018.​02.​009
Presented at the international conference on Management of Data. 56. Pedregosa F et  al (2011) Scikit-learn: Machine Learning in
https://​doi.​org/​10.​1145/​15598​45.​15598​70 Python. Journal of Machine Learning Research 12(85):2825–2830
39. Gravano L (2001) Approximate string joins in a database (almost) 57. Su C, Ju S, Liu Y, Yu Z (2015) Improving random Forest and
for free. In: Vldb 01: international conference on very large data rotation Forest for highly imbalanced datasets. Intelligent Data
bases Analysis 19(6):1409–1432. https://​doi.​org/​10.​3233/​ida-​150789
40. Jin L, Li C, Mehrotra S (2003) Efficient record linkage in large 58. Demsar J (2006) Statistical comparisons of classifiers over mul-
data sets. In: Eighth International Conference on Database Sys- tiple data sets. J Mach Learn Res 7:1–30
tems for Advanced Applications, 2003. (DASFAA 2003). Pro-
ceedings, pp 137–146 Publisher’s note Springer Nature remains neutral with regard to
41. Hernández MA, Stolfo SJ (1995) The merge/purge problem for jurisdictional claims in published maps and institutional affiliations.
large databases. ACM SIGMOD Rec 24(2):127–138
42. Aizawa A, Oyama K (2005) A fast linkage detection scheme for
multi-source information integration. In: International Workshop

13

You might also like