1 s2.0 S002002552030966X Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Information Sciences 546 (2021) 977–995

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Estimating the number of clusters in a ranking data context


Wilson Calmon ⇑, Mariana Albi
Institute of Mathematics and Statistics, Fluminense Federal University, Niteroi 24210201, Brazil

a r t i c l e i n f o a b s t r a c t

Article history: This study introduces two methods for estimating the number of clusters specially
Received 19 December 2019 designed to identify the number of groups in a finite population of objects or items ranked
Received in revised form 26 August 2020 by several judges under the assumption that these judges belong to a homogeneous pop-
Accepted 24 September 2020
ulation. The proposed methods are both based on a hierarchical version of the classical
Available online 1 October 2020
Plackett–Luce model in which the number of clusters is set as an additional parameter.
These methods do not require continuous score data to be available or restrict the number
Keywords:
of clusters to be greater than one or less than the total number of objects, thereby enabling
Number of clusters
Ranking data
their application in a wide range of scenarios. The results of a large simulation study sug-
Plackett–Luce gest that the proposed methods outperform well-established methodologies (Calinski &
Clustering Harabasz, gap, Hartigan, Krzanowski & Lai, jump, and silhouette) as well as some recently
Ordinal classification proposed approaches (instability, quantization error modeling, slope, and utility). They
realize the highest percentages of correct estimates of the number of clusters and the
smallest errors compared with these well-established methodologies. We illustrate the
proposed methods by analyzing a ranking dataset obtained from Formula One motor
racing.
Ó 2020 Elsevier Inc. All rights reserved.

1. Introduction

In many areas such as medicine, psychology, economics, law, and engineering, cluster analysis has been used to partition
a set of k objects (e.g., people’s feelings, financial assets, judicial decisions, production processes, people, and inanimate
objects) into g groups or clusters [20,14,25]. Since the number of clusters g is typically required as an input for clustering
algorithms (see [36]), various methods for estimating g have been proposed in the last five decades, including several clas-
sical methods [5,22,41,28,45,43], along with more recent approaches [31,15,29,19,10,26,32,48,36,49]. However, most meth-
ods were originally designed to analyze continuous data, and few such approaches can accurately estimate the number of
clusters in a dataset if the data context differs from that originally postulated, especially because there is no universally
accepted definition of a cluster [19,10]. Based on the foregoing, we introduce a novel method for estimating the number
of clusters for ranking data (RD). Significant improvements have been realized by using methods specifically developed
for addressing RD [18,35,1]; however, to the best of our knowledge, no approach for estimating the number of clusters
has thus far been proposed for RD.
RD are plentiful and typically arise from the desire to rank a set of objects (items or individuals) [1]. They are produced
and analyzed in many fields, including sporting competitions and economics/finance [16,23,42,30,39]. However, although

⇑ Corresponding author.
E-mail addresses: calmonwilson@id.uff.br (W. Calmon), malbi@id.uff.br (M. Albi).

https://doi.org/10.1016/j.ins.2020.09.056
0020-0255/Ó 2020 Elsevier Inc. All rights reserved.
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

clustering techniques can be useful for identifying group patterns among the entities ranking the objects (i.e., judges) [39],
we are interested in the opposite case in this study, namely, when only objects should be grouped [16].
A well-known model for RD [18,1] is the Plackett–Luce (PL) model [33,40]. In this model, a positive scalar parameter is
associated with each object and the probability that an object is ranked first is proportional to its own parameter. This model
remains a benchmark for analyzing RD [6,11,23]. In this study, we propose a hierarchical version of the PL model in which
the number of clusters is set as an additional parameter. Following a Bayesian approach [3,8], the number of clusters is esti-
mated as the mode of its marginal posterior distribution. In contrast to several alternatives [41,15,19], the number of clusters
is allowed to take any integer value between one and the total number of objects. We use intermediate stages to link the
likelihood function with the number of clusters, in which intuitive parameters (a classification vector, the frequency distri-
bution of the objects across clusters, and a scalar parameter for controlling the degree of separation between neighboring
clusters) are introduced along with prior distributions for them. We suggest two distinct methods for estimating the number
of clusters: hierarchical PL (HPL) and fast HPL (FHPL). The former is directly based on the HPL model, whereas the latter
derives from a restricted version of that model. The prior distribution considered in the FHPL method attributes zero prob-
ability to several feasible classification vectors. One of the main advantages of the FHPL method is that it is computationally
faster than the HPL method.
We conduct a large simulation study to assess the performances of our proposed procedures against 10 alternative meth-
ods: Calinski and Harabasz [5], gap [45], Hartigan [22], instability [15], jump [43], Krzanowski and Lai [28], quantization
error modeling [26], silhouette [41], slope [19], and utility [31]. We use four classes of data-generating processes (dgps)
to simulate several score data (SD) matrices, which are, in turn, transformed into RD matrices. While the alternative methods
are employed to analyze the scores, rankings are analyzed using the HPL and FHPL approaches. We compare the methods in
terms of their success rate (the percentage of results in which the true number of clusters has been identified) and accuracy
(using the absolute mean deviation and its normalized version). Since HPL is a highly time-consuming method, it is
employed here only in cases in which the number of objects is small (four or five). The simulation results demonstrate that
both the HPL and the FHPL methods satisfactorily estimate the number of clusters in the considered contexts. Indeed, HPL
and FHPL both outperform the alternative methods regardless of which comparison criteria are adopted.
We conclude our analysis with an example. The HPL and FHPL methods are used to estimate the number of clusters in a
real dataset that contains RD obtained from Formula One motor racing. We show that these two proposed methods enable us
to identify the number of clusters, which is useful for clustering or classifying the drivers. To demonstrate another advantage
of the proposed methods, we also present statistical classifications of the drivers for several numbers of clusters, which are
derived naturally using these methods.
The remainder of the paper is organized as follows. Section 2 introduces the basic definitions and notations. Section 3
reviews related studies. Section 4 presents the HPL and FHPL methods for estimating the number of clusters. The simulation
results are summarized in Section 5, while the main results of the illustrative example with data from Formula One are pre-
sented in Section 6. Section 7 concludes.

2. Background

2.1. RD and scores

We consider a fixed set of k objects fObject 1; . . . ; Object kg and, similar to [35], re-express this set as
Nk  f1; 2; . . . ; kg; Object i is identified by label i. A ranking is typically expressed by a rank vector R ¼ ðRð1Þ; . . . ; RðkÞÞ,
where RðiÞ denotes the rank of object i; if RðiÞ ¼ m, then Object i is the m-th best object. We assume no ties. Therefore,
the rank vector R take values in the set Pk  fpermutations of ð1; . . . ; kÞg; for convenience, let R1 ðmÞ denote the object
of rank ‘‘m”.
The objects are ranked by entities called judges [18]. RD are identified by n rank vectors R1 ; . . . ; Rn , where
 
Rj ¼ Rj ð1Þ; . . . ; Rj ðkÞ is the rank vector of the j-th judge. RD can be expressed in a k  n matrix Rkn , which can be expressed
in terms of its columns as ½R1 . . . Rn . In this paper, we do not use distinct notations for the random rank vector or its realiza-

tion. The i-th row of Rkn is the rank vector of object i—denoted here as Ri ¼ ðR1 ðiÞ; . . . ; Rn ðiÞÞ.
Some rankings are based on performance indicators or scores [1]. Let Sj ðiÞ be the score of the i-th object when it is eval-
 
uated by the j-th judge. Analogously, we consider the SD matrix Skn in which the j-th column Sj ¼ Sj ð1Þ; . . . ; Sj ðkÞ and i-th

row Si ¼ ðS1 ðiÞ; . . . ; Sn ðiÞÞ are the score vector of the j-th judge and of the i-th object, respectively. Ranks are often uniquely
determined by scores according to the following inequality:
 0  0 0
Rj ðiÞ < Rj i () Sj ðiÞ > Sj i ; for all distinct objects i and i : ð1Þ
In practice, several RD matrices cannot be determined from SD matrices. Additionally, scores that induce the observed
ranks can be unobservable. Nevertheless, there are at least two advantages of associating ranks and scores. First, a score
can be modeled as a continuous variable, while ranks assume integer values. Second, the ranks assigned to distinct objects
are strongly dependent, whereas scores may be assumed to be independent in various scenarios; this assumption is made in
the PL model and other available models for RD [1].
978
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

2.2. Clustering

Clustering techniques aim to summarize a set of objects into a small (well-grounded) number of clusters—nonempty sub-
sets that partition Nk , with each ideally containing only similar objects [20]. For a specified number of clusters g, these pro-
n o
0
cedures result in a cluster partition Cg  Cg1 ; . . . ; Cgg . Objects i and i should be allocated to the same cluster Cg‘ (Cg‘ 2 Cg ) if
they present similar features. Analogous to the RD and SD matrices, we consider a feature data (FD) matrix, Y kn , in which the

i-th row Y i ¼ ðY 1 ðiÞ; . . . ; Y n ðiÞÞ is the feature vector of object i.
 
0
The dissimilarity between objects i and i is defined in terms of their feature vectors (Y i and Y i0 , respectively). If the fea-
    hP   0 2 i1=2
n
tures are continuous, a typical choice for the dissimilarity is the Euclidean distance: d Y i ; Y i0 ¼ j¼1 Y j ðiÞ  Y j i

[45]. For noncontinuous features, other metrics should be adopted [31,7]. Let the average of the squared distances in cluster
Cg‘ and the overall average of the squared distances be, respectively,
 P P     Pk Pk   
2 2
d‘ 
2 1 g
i2C‘ i0 2C‘ d
g Y i ; Y i0 and d  k2 i¼1 i0 ¼1 d Y i ; Y i0 : A well-known metric for measuring the variability associ-
2 1
ð#Cg‘ Þ
2

ated with the cluster partition Cg is the within-cluster sum of squares (WCSS) [26,48], which is denoted here by W ðg Þ and is
contrasted with the between-cluster sum of squares, namely, Bðg Þ. These are defined as

1X
g
k
W ðg Þ  ð#Cg‘ Þd2‘ and Bðg Þ  d2  W ðg Þ: ð2Þ
2 ‘¼1 2

Several methods for estimating the number of clusters use the WCSS as a cluster validity index (CVI) [28,45,48]; some of
these methods depend on both the WCSS and the between-cluster sum of squares [5].
We intentionally adopt the same notation (n) for both the number of judges and the number of features. Although cluster
analysis is useful in broader contexts, it seems reasonable to use such techniques to address SD (Y j ðiÞ ¼ Sj ðiÞ; 8i; j) and RD
(Y j ðiÞ ¼ Rj ðiÞ; 8i; j). In the first case, no major difficulties are encountered in adopting typical clustering procedures since
the scores are continuous features. If grouping SD and RD, the resulting clusters can be identified as distinct performance
(or quality) groups [16].

2.3. Clustering algorithms

 
Consider an observed FD matrix Y kn with k feature vectors Y 1 ; . . . ; Y k . A clustering algorithm enables the generation of a
n o
cluster partition Cg  Cg1 ; . . . ; Cgg of Nk from Y kn . Two widely known approaches are the K-means [2] and hierarchical clus-
tering [47] algorithms. For a fixed value of g, the former aims to identify a cluster partition that minimizes the WCSS,
whereas the latter starts from a cluster partition in which the clusters are unit sets and recursively merges the two nearest
clusters such that the number of groups is reduced by one (from iteration to iteration). In this study, hierarchical clustering is
based on the bottom-up or agglomerative approach. Additional details are available in [14,25].
The K-means algorithm is also a recursive algorithm. For each iteration, it generates a new cluster partition from an old
partition. The new partition is obtained by associating each object with the nearest cluster (in the old partition) or, equiv-
alently, the nearest centroid. For ‘ ¼ 1; . . . ; g, the ‘-th centroid and overall average features are, respectively, defined as

  X
Y ð‘Þ  Y 1 ð‘Þ; . . . ; Y n ð‘Þ ; where Y j ð‘Þ  #C1 g Y j ðiÞ; for j ¼ 1; . . . ; n; and
‘ g
i2C‘
ð3Þ
  X
k
Y  Y 1 ; . . . ; Y n ; where Y j  1
k
Y j ðiÞ; for j ¼ 1; . . . ; n:
i¼1

The hierarchical clustering algorithm is based on a distance measure between clusters (intercluster dissimilarity). For any
pair of clusters Cg‘ and Cg‘0 , the intercluster dissimilarity—denoted by Dð‘; ‘0 Þ—is computed from the collection of pairwise dis-
  
0
similarities d Y i ; Y i0 , where objects i and i belong to Cg‘ and Cg‘0 , respectively. On the many possible choices for intercluster
dissimilarity, see [20]. In this study, we adopt the complete linkage:

  
Dð‘; ‘0 Þ  max
g 0 g
d Y i ; Y i0 : ð4Þ
i2C‘ ;i 2C
‘0

In addition to being useful for formally describing the K-means and hierarchical clustering algorithms, the centroid def-
inition and intercluster dissimilarity concept also facilitate the description of a new hierarchical clustering algorithm, as pro-
posed in Section 4.4.
979
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

3. Related works

Several well-known approaches are available for estimating the number of clusters. In Section 5, we assess the perfor-
mances of the methods proposed in this study against 10 alternative methods: six classical methods (Calinski & Harabasz
[5]; gap [45]; Hartigan [22]; jump [43]; Krzanowski & Lai [28]; and silhouette [41]) and four newer methods (instability
[15]; quantization error modeling [26]; slope [19]; and utility [31]). Despite their long history, the classical methods remain
useful for estimating the number of clusters [15,29,19,10,26,48].
These methods usually compare the cluster partitions obtained from a specified clustering algorithm [5,45,31,19]. They
adopt a known CVI and define a transformation of it (cost function), which is used to evaluate the partitions. Then, the num-
ber of clusters is estimated as the cardinality of the optimal cluster partition—the partition that optimizes the cost function
or identifies any sharp changes [26].
Many classical methods use the WCSS as the CVI [5,22,28,45]. However, other metrics have been considered and gener-
alizations of the WCSS have been proposed. For example, the Euclidean distance can be replaced by the Mahalanobis [43] or
Minkowski [10] metric. Another well-known CVI is the silhouette statistic [41,19], which evaluates the distance between the
object and its own cluster as well as its nearest distinct cluster. Alternatively, a cluster instability measure [46] can be a use-
ful CVI. The cluster instability paradigm is based on the following: if an incorrect number of clusters is used to partition a set
of objects, the obtained cluster partitions tend to vary from dataset to dataset even if the datasets were drawn from the same
population, whereas the variability should be lower if the correct number of clusters is specified.
An evaluation graph expresses the association between the CVI and number of clusters, and several methods maximize
the curvature of an evaluation graph or a smoothed version of it [26,48]. In the latter case, the smoothed evaluation graph—
estimated from the observed evaluation graph—can be regarded as an estimate of the (true and unobserved) expected eval-
uation graph. A similar strategy is used in the gap [45] and jump [43] methods. The former focuses on the deviation of the
WCSS from its expected value, while the latter prioritizes the variation of the evaluation graph when the number of clusters
increases by one. Another procedure that operates similarly to the gap method was proposed by [27]. The difference is that
the latter replaces the WCSS with the pooled within-group scatter matrix.
Although several approaches are based on distinct choices for the CVI and evaluation graph, other factors have also been
considered in newer methods. Since various features may not be of equal importance to the clustering process, strategies for
correcting data if irrelevant features are present can improve the accuracy of traditional methods [10]. In another recent
approach, clusters are associated with hills in some terrain, and the proposed method mimics a human observing the total
number of peaks (an estimate of the number of clusters) [36]. Additionally, the outputs of various methods can be combined
using a consensus measure [49].
Section 3.1 reviews the 10 alternative methods considered in this study and Section 3.2 briefly reviews the models and
clustering algorithms for RD.

3.1. Alternative methods

As before, let Y kn be the FD matrix. Suppose that a clustering algorithm (e.g., K-means or hierarchical clustering) is used
to analyze Y kn . Then, for each g, which varies from g min to g max (integer numbers that satisfy 1 6 g min < g max 6 k), cluster par-
n o
titions are obtained: Cgmin ; . . . ; Cgmax , where Cg  Cg1 ; . . . ; Cgg . Let W ðg Þ be the associated WCSS.
Five classical methods (Calinski & Harabasz [5]; gap [45]; Hartigan [22]; jump [43]; and Krzanowski & Lai [28]) estimate g
as the argument that optimizes a function of W ðg Þ. The arbitrary constants g min and g max must belong to the domains of these
functions, which vary among methods. The sixth, namely, the silhouette method [41], is rather based on the silhouette statis-
tic defined in Eq. (3.1.3). Since these six classical methods are well-known approaches, we only briefly list their attributes in
Table 1. Short descriptions are provided in, for example, [19,10,26,48].
Various methods cannot estimate g if its true value is equal to one (equal to k; greater than or equal to k  1); in this case,
g min must be at least two (g max must be at most k  1; g max must be at most k  2), as specified in column ‘‘g min P” (‘‘g max 6”)
in Table 1. From a purely theoretical statistical perspective, there is no reason to suppose that the number of clusters may
differ from a specified integer that is between one and the total number of objects. This limitation is shared by all six classical
methods as well as by the newer alternative methods (as we discuss in Sections 3.1.1–3.1.4). Table 1 also indicates that

Table 1
Classical methods: A brief summary.

gmin P; gmax 6; Recursive? Fast?


Calinski & Harabasz 2 k1 NO YES
gap 1 k1 YES NO
Hartigan 1 k2 YES YES
jump 1 k1 NO YES
Krzanowski & Lai 2 k2 NO YES
silhouette 2 k1 NO YES

980
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

except for the gap method, the classical approaches are not time-consuming. Additionally, both the Hartigan and the gap
approaches are recursive in the sense that they do not require that all the cluster partitions have been obtained previously.
Sections 3.1.1–3.1.4 summarize the four newer procedures considered in this study: instability [15]; quantization error mod-
eling [26]; slope [19]; and utility [31].

3.1.1. Instability
Cluster instability [46] is based on the notion that if several random samples are drawn from the same population and if
the true number of clusters is used to partition the set of objects, then the cluster partitions obtained across the samples
   
should not vary excessively. Formally, suppose that Y kn and Y kn are two FD matrices with rows Y 1 ; . . . Y k and Y 1 ; . . . ; Y k ,
     
respectively. Let us assume that Y ; Y 1 ; . . .Y k ; Y ; Y 1 ; . . . ; Y k are independent and identically distributed (iid) n-dimensional ran-

dom vectors with a common distribution P . Typically, for an arbitrary number of groups g, a clustering wg can be regarded as
a mapping wg : Rn # f1; . . . ; gg. For a specified value of g, a clustering algorithm W associates with the FD matrices Y kn and
Y kn two mappings, namely, wgY kn and wgY kn , respectively. If I denotes the indicator function and E denotes the expectation
P

under the distribution P, then it is possible to define the cluster instability [46] of a clustering algorithm W by
n  o
I ðg Þ  EP D wgY kn ; wgY kn ; for g ¼ 2; . . . ; k  1; where

  n h    i h    io


 
D wgY kn ; wgY kn  EP I wgY kn Y ¼ wgY kn Y  I wgYkn Y ¼ wgY kn Y  :

Since P is typically unknown, I ðg Þ should be replaced by a bootstrap estimate [15]. The instability method leads us to
minimize
( )
k             
d 1X B
1X k X
 g g g g 
I ðg Þ  I wY kn Y i ¼ wY kn Y i0  I wY kn Y i ¼ wYkn Y i0  ;
B b¼1 k2 i¼1 0  b b b b
i ¼1
   
where Y kn
1 ; Y 1
kn kn
; . . . ; Y B ; Y B
kn
are pairs of bootstrap samples; Id
ðg Þ is defined for g ¼ 2; . . . ; k  1.

3.1.2. Quantization error modeling


The quantization error modeling method [26] minimizes the following parametrized cost function:
PCF ðg Þ  1k W ðg Þ  g bb ; for g ¼ 1; . . . ; k  1, where b
b is the ordinary least squares estimate of b, which is the slope coefficient
 
in the linear regression of ln W g on ln ðg Þ.

3.1.3. Slope
The silhouette statistic Sðg Þ [41] is an alternative CVI defined as

1X k
Sð g Þ  sg ; for g ¼ 2; . . . ; k  1; where ð5Þ
k i¼1 i

8 g g
NC i OC i
< ; if #Cg‘ðiÞ > 1;
max ðOC i ;NC i Þ
g g
sgi 
:
0; if #Cg‘ðiÞ ¼ 1:

OC gi is the distance between object i and its own cluster Cg‘ðiÞ and NC gi is the distance between object i and its nearest dis-
tinct cluster. For each i ¼ 1; . . . ; k; OC gi and NC gi are defined as
X    X   
d Y i ; Y i0 d Y i ; Y i0
i0 2Cg‘ðiÞ i0 2Cg‘
OC gi    and NC gi  min :
#Cg‘ðiÞ  1 ‘;16‘6g;‘–‘ðiÞ ð#Cg‘ Þ

The term OC gi is not defined if #Cg‘ðiÞ ¼ 1. The slope method [19] maximizes the slope statistic S ðg Þ, which is expressed as
follows:

S ðg Þ ¼ ½Sðg þ 1Þ  Sðg ÞSðg Þp ; for g ¼ 2; . . . ; k  1;

where SðkÞ  0 and p is a positive integer parameter that controls the trade-off between the silhouette level (global aspect)
and silhouette variation (local aspect). The number of clusters is estimated as one if Pearson’s correlation between the sil-
houette and number of clusters is positive.
981
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

3.1.4. Utility
The method proposed by [31] addresses data that contain both continuous and categorical features (i.e., mixed data).
Since the alternative methods in this study are used to analyze only SD matrices (continuous case), we consider the partic-
ular case of their approach in which all the features are continuous variables. In this case, the objective is to minimize the
utility function (we have named this approach the utility method), which is expressed as follows:
( )
1 X n Xg h i
UðgÞ  ð#Cg‘ Þ  d2j  d2‘j ; for g ¼ 1; . . . ; k  1;
g  k j¼1 ‘¼1

P  2 P  2
where d2‘j ¼ g
i2C‘ Y j ðiÞ  Y j ð‘Þ =ð#Cg‘ Þ and d2j ¼ ki¼1 Y j ðiÞ  Y j =k are the within-class variance and overall variance (j-th
feature), respectively and Y j ð‘Þ and Y j are defined in Eq. (3).

3.2. RD: Models and clustering

For the RD models, the methods introduced herein (cf. Section 4) are based exclusively on the classical PL approach
[33,40]. However, many other well-known RD models exist. According to [35], the pioneering model introduced by [44] is
similar to the PL model in the sense that the ranking can be understood as induced from the score ordering. The PL model
is also known for multiple comparisons [24], an extension of the Bradley–Terry model [4] for paired comparisons. As any
rank vector induces an order relationship between any pair of objects, paired comparison models are constructed by directly
modeling these paired order relationships.
An alternative motivation for the PL model lies in associating the ranking process with a recursive choice process, where
the judge starts by choosing the best object. Thereafter, the judge must choose the second-best (the best of the remaining),
the third-best, and so on. Hence, the PL model is also a multistage model. Unlike the PL proposal, the (relative) probabilities
can vary from stage to stage in other multistage models [17].
Another strategy defines the essence of so-called distance-based models. Such models postulate that the probability of
observing a specific rank vector can be written in terms of the distance to a certain (not necessarily known) modal rank. This
is the case of the extremely popular Mallows model [34], for example. For a detailed description of the aforementioned mod-
els as well other interesting approaches not mentioned here, see [18,35,1].
Several clustering techniques have recently been proposed to improve the clustering analysis of RD. Most clustering
methods designed to handle these kinds of data have focused on identifying group structures in the population of judges.
For example, a mixture of distance-based models is proposed by [38], where each specific population can be understood
as a distinct cluster. This approach is extended by replacing distances with weighted distances in [30]. Mixtures of paired
comparison models are considered in [9]. A new clustering algorithm is introduced in [12], which mimics the K-means pro-
cedure (cf. Section 2.3). The WCSS is replaced by a weighted average of the distances between all the pairs formed by each
ranking vector and each cluster center. More examples can be found in [6,39].
The recent proposal presented by [13] aims to estimate a median tied ranking. The ties induce a natural ordered cluster
partition; however, the number of clusters must be previously defined. Instead of clustering the judges, the authors aim to
generate a cluster partition of the set of objects (or items). Although the same idea is considered in our study, we focus on the
problem of estimating the number of clusters. To the best of our knowledge, no method has been specially proposed for RD,
motivating us to invest in this topic. The estimated numbers of clusters provided by our methods could be used as inputs for
the method proposed by [13].

4. HPL approach

In this study, we are interested in identifying clustering structures in the object universe rather than modeling eventual
grouping structures in the population of judges or rankers (see [35]). As discussed in Section 2, the RD matrix Rkn has n col-
 
umns R1 ; . . . ; Rn , where Rj ¼ Rj ð1Þ; . . . ; Rj ðkÞ denotes the rank vector of the j-th judge. Formally, as in several RD models
[18,1], we assume the following: (i) R1 ; . . . ; Rn are iid random vectors that all have the same distribution as the representative
rank vector R and (ii) for each judge j (j ¼ 1; . . . ; n), there are no (observed) tied ranks. R is a discrete random vector with the
probability mass function p : Pk # R, which is typically unknown. The most general model for R is the class of all probability
mass functions defined over Pk , for which the parameter space is the ðk!  1Þ-simplex. Since the dimension of this parameter
space grows rapidly with the number of objects k, several alternative parsimonious models have been proposed, such as the
PL model [35].

4.1. Original PL model

Order statistics models (OSMs) are popular models for RD [1]. An OSM assumes that rankings are induced by scores. As
 
defined in Section 2.1, let Sj ¼ Sj ð1Þ; . . . ; Sj ðkÞ be the j-th score vector. Then, for any generic nonrandom vector p (i.e.,
8p ¼ ðpð1Þ; . . . ; pðkÞÞ 2 Pk ), a generic OSM assumes that
982
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

        
Pr Rj ¼ p ¼ Pr Sj p1 ð1Þ > Sj p1 ð2Þ >    > Sj p1 ðkÞ :

Observe that p1 ðmÞ ¼ i if and only if pðiÞ ¼ m. Typically, an OSM is implicitly defined in terms of a probabilistic model for
the score vector S. A traditional OSM is the Thurstone model [44], which postulates that S is normally distributed.
The PL model, another well-established OSM [6,23], has been applied in many fields such as sporting competitions, psy-
chology/preference modeling, and politics/voting, as shown by [24,11,23,6,21]. This model assumes that S1 ; . . . ; Sn are iid ran-
dom vectors and associates a positive scalar parameter hðiÞ to each object i ¼ 1; . . . ; k. The parameter h ¼ ðhð1Þ; . . . ; hðkÞÞ is a
k-dimensional vector that takes values in H ¼ ð0; þ1Þk (the parameter space). The PL model also postulates that
Sj ð1Þ; . . . ; Sj ðkÞ are independent random variables, where for i ¼ 1; . . . ; k and j ¼ 1; . . . ; n, the difference Sj ðiÞ  lnðhðiÞÞ has a
standard Gumbel distribution, in which the location parameter is zero and the scale parameter is one [18]. We denote
Sj ðiÞ  lnðhðiÞÞ  Gumbelð0; 1Þ.
For arbitrary objects i1 ; . . . ; ik0 (1 6 i1 < i2 <    < ik0 6 k), the probability that object i—i 2 fi1 ; . . . ; ik0 g—is ranked first is
easily expressed in terms of the coordinates of the parameter h [1]. Formally, the PL model assumptions imply [35] that
  0 0 0     
Pr Rj ðiÞ < Rj i ; 8i ¼ i1 ; . . . ; ik0 ; i – i ¼ Pr Sj ðiÞ ¼ max Sj ði1 Þ; . . . ; Sj ðik0 Þ
h ð iÞ
¼ hði ; 8j ¼ 1; . . . ; n:
1 Þþþh ðik0 Þ
Consequently, ceteris paribus, the higher the value of the parameter hðiÞ, the higher the probability that object i is ranked
0   0    0 
as the best object. If we compare two generic objects, say, i with i ; Pr Rj ðiÞ < Rj i ¼ hðiÞ= hðiÞ þ h i . From a probabilistic
 0   0    0 
perspective, they are indistinguishable from each other if hðiÞ ¼ h i —in this case, Pr Rj ðiÞ < Rj i ¼ Pr Rj ðiÞ > Rj i ¼ 50%.
In the PL model, the probability mass function of the representative rank vector R has the following closed form [18]:
 
Y
k1
h p1 ðiÞ
pðpÞ ¼ PrðR ¼ pÞ ¼ ; 8p 2 Pk :
i¼1 X
k
hðp1 ðlÞÞ
l¼i

Let R1 ; . . . ; Rn be the observed rank vectors (cf. Section 2.1). The log-likelihood is
8 2 0 139
>
>   >
>
> C7>
n > k1 6 B >
1
X <X 6 B h Rj ðiÞ C7=
7 ; 8h 2 H:
‘ðhÞ  6 ln B C 7 ð6Þ
> i¼1 6
j¼1 > 4 @X
B k  C
A5>>
>
> h R 1
ðl Þ >
>
: j ;
l¼i

Since maximizing this likelihood function is nontrivial, an iterative minorization–maximization algorithm has been pro-
posed to approximate the maximum likelihood estimate [24]. Further, another estimation procedure, namely, the Weaver
algorithm [11], has recently been introduced.

4.2. HPL

Based on the PL model, we suggest the following highly intuitive classification rules:

1. If h has exactly g distinct coordinates, then g should be the true number of clusters.
0  0
2. Objects i and i should be classified in the same cluster if and only if hðiÞ ¼ h i .
3. The true (ordered) classification vector c ¼ ðcð1Þ;    ; cðkÞÞ should be defined as follows:
(a) cðiÞ is an integer between 1 and g; 8i ¼ 1; . . . ; k;
 0  0 0
(b) cðiÞ < c i () hðiÞ > h i , for every 1 6 i; i 6 k.
n o
4. The true cluster partition Cg  Cg1 ; . . . ; Cgg satisfies the following: for every ‘ ¼ 1; . . . ; g, and arbitrary objects i and
0 0  0
i ; i 2 Cg‘ and i 2 Cg‘ if and only if cðiÞ ¼ c i .
5. Finally, we should also obtain a distribution vector d ¼ ðdð1Þ; . . . ; dð g ÞÞ, where dð‘Þ ¼ #fi; cðiÞ ¼ ‘g, for ‘ ¼ 1; . . . ; g.

In these classification rules, the parameter h determines all the other parameters. Unfortunately, this parameter is typi-
cally unknown, along with the other parameters: the number of clusters g, classification vector c, cluster partition Cg , and
distribution vector d. In this study, we propose the HPL model in which g can be directly estimated from the RD. The relation
between the distribution of the rank vector and parameter g is established by introducing intermediate stages in which the
classification vector c and distribution vector d are set as parameters. In addition, we use the scale parameter s to control the
degree of separation between distinct clusters.
Let a ¼ ðh; c; d; s; g Þ. As is typical in Bayesian statistical models [3], we denote the likelihood function evaluated at a by
   
p Rkn ja ¼ p Rkn jh; c; d; s; g . Following Eq. (6), we express

983
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

8 2 39
>
>   > >
> >
n > k1 6 h Rj ðiÞ 7 >
1
      Y <Y 6 7 =
kn kn kn
p R ja ¼ p R jh; c; d; s; g ¼ p R jh  6 7 : ð7Þ
6
> i¼1 4Xk  7
5>
j¼1 >
> >
>
>
: h R1j ðl Þ >
;
l¼i

In the original PL model, the likelihood function does not depend on g. We introduce five intermediate stages—using a
hierarchical approach [8]—for associating h to the integer g. These stages are explained in sequence.
In the first stage, we adopt a prior distribution for h that depends on both c and s : pðhjc; sÞ. For the specified vector
 0  0
c ¼ ðcð1Þ; . . . ; cðkÞÞ, this prior distribution should satisfy cðiÞ < c i ) hðiÞ > h i . Insofar as c determines the ordering among
the coordinates in h, the parameter s is introduced to control their dispersion, as we discuss in Section 4.3.
For the specified classification vector c ¼ ðcð1Þ; . . . ; cðkÞÞ and for g ¼ max fcð1Þ; . . . ; cðkÞg, there is a unique distribution
vector d ¼ ðdð1Þ; . . . ; dð g ÞÞ, where dð‘Þ ¼ #fi; cðiÞ ¼ ‘g is the total number of objects in the ‘-th best group. Observe that
dð‘Þ 2 f1; . . . ; kg; 8‘ ¼ 1; . . . ; g and dð1Þ þ    þ dð g Þ ¼ k. Additionally, c is not necessarily defined based on d. In the second
stage, we recommend using a prior distribution for the classification vector c that depends on the distribution vector
d : pðcjdÞ. For a fixed d ¼ ðdð1Þ; . . . ; dð g ÞÞ, the domain of the prior distribution pðcjdÞ is the finite set C d , which is formally
described as

C d ¼ ðcð1Þ; . . . ; cðkÞÞ; cðiÞ 2 Ng ; i ¼ 1; . . . ; k and #fi; cðiÞ ¼ ‘g ¼ dð‘Þ; ‘ ¼ 1; . . . ; g : ð8Þ

If the distribution vector is a g-dimensional vector d ¼ ðdð1Þ; . . . ; dð g ÞÞ, then the number of clusters is g. In the third stage,
one should select a prior distribution for the distribution vector d that depends on g : pðdjg Þ. If g (g 2 Nk ) is known, the domain
of the prior distribution pðdjg Þ is the finite set Dg that contains all the distribution vectors that are compatible with g:
Dg ¼ fðdð1Þ; . . . ; dðg ÞÞ; dð‘Þ 2 Nk ; ‘ ¼ 1; . . . ; g and dð1Þ þ    þ dðg Þ ¼ kg: ð9Þ
These three stages enable the relation between the rank vector distribution and parameter g. We also suggest selecting
(independent) prior distributions for g (fourth stage) and s (fifth stage): pð g Þ and pðsÞ, respectively. The former has a domain
of Nk , whereas the latter has a domain of ð0; þ1Þ. Finally, from Eqs. (7)–(9) and prior distributions pðsÞ and pð g Þ, the (joint)
posterior distribution for a ¼ ðh; c; d; s; g Þ is expressed as
     
p ajRkn  p h; c; d; s; gjRkn / p Rkn jh  pðhjc; sÞ  pðcjdÞ  pðdjg Þ  pð g Þ  pðsÞ: ð10Þ
   
If /  ðh; c; d; sÞ, then a ¼ ð/; g Þ and the posterior distribution may be reformulated as p ajRkn ¼ p /; gjRkn .

The (marginal) posterior distribution of g can be obtained from


  Z  
p gjRkn ¼ p /; gjRkn @/: ð11Þ

In our approach, the number of clusters is estimated as the mode of the posterior distribution:
^
 
g ¼ arg max p gjRkn : ð12Þ
16g6k

Further, the model proposed in this section can provide additional useful information about the analyzed objects. For
 
example, we can determine the (marginal) posterior distribution of the pair (c; g)—p c; gjRkn —by integrating Eq. (10) with
respect to the remaining parameters along with the conditional posterior distribution of the classification vector c for a spec-
       
ified value of g as p cjg; Rkn ¼ p c; gjRkn =p gjRkn . The mode of p cjg; Rkn can be used as an estimate of the classifi-
cation vector conditional on a pre-specified number of clusters g, thereby resulting in a cluster partition that contains g
ordered clusters. Using a similar approach, it is possible to obtain an unconditional estimate for the classification vector
 
as the mode of the marginal posterior distribution of c, which is obtained by integrating p c; gjRkn with respect to g.

4.3. Practical issue: Prior distribution

The practical application of the HPL model requires the specification of the (conditional) prior distributions in all five
stages (cf. Section 4.2). This section introduces the prior distributions selected for analyzing both the simulated RD (Section 5
and a real dataset obtained from Formula One motor racing (Section 6). Simple and noninformative prior distributions [37]
are adopted in stages 2; 3, and 4. We assume that c (conditional on d), d (conditional on g), and g are uniformly distributed
over the sets C d ; Dg , and Nk , respectively. The conditional prior distributions in the second, third, and fourth stages are,
respectively,

984
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

 
I c 2 Cd 
k k!
pðcjdÞ ¼ ; where;#C d ¼ ¼ ; ð13Þ
#C d d1 ; . . . ; dg d1 ! . . . dg !

Iðd 2 Dg Þ k1
pðdjg Þ ¼ ; where;#Dg ¼ ; ð14Þ
#Dg kg
Iðg 2 Nk Þ
pð g Þ ¼ : ð15Þ
k
In contrast to the vague prior distributions specified in the second, third, and fourth stages, we impose restrictive assump-
tions in the first and fifth stages. In the first stage, we assume that for a specified classification vector c, the probability that
0  0
object i is ranked better than any other object i does not depend on their classifications levels (cðiÞ and c i ) if these objects
are classified in consecutive groups; we postulate that this probability is a decreasing function of the scalar parameter s
(s > 0). Our hypothesis can be precisely expressed as
hðiÞ 1  
 0  ¼ 1 þ ; if cðiÞ ¼ c i0  1: ð16Þ
h i s

The parameter s is introduced to control the proximity between distinct clusters. The higher the value of s, the more dif-
ficult it is to distinguish objects from distinct clusters. The assumption in Eq. (16), in turn, leads us to formulate the following
degenerate prior distribution for h conditional on the classification vector c ¼ ðcð1Þ; . . . ; cðkÞÞ and s (s > 0) (first stage):
   
pðhjc; sÞ ¼ I h ¼ hðcð1ÞÞ; . . . ; hðcðkÞÞ ; ð17Þ
  ðg1Þ  ðg2Þ       
where h ¼ 1 þ 1s ; 1 þ 1s ; . . . ; 1 þ 1s ; 1 . The coordinates of the g-dimensional vector h ¼ hð1Þ; . . . ; hð g Þ form a
decreasing geometric progression. In addition, c and s determine h.
  0   0
From Eq. (16), Pr R i < RðiÞ ¼ 1=½2 þ ð1=sÞ, when cðiÞ ¼ c i  1. This probability increases rapidly from 0 to 1=3 for
0 < s < 1 (and slowly from 1=3 to 1=2 for s > 1). The prior distribution selected for s is a uniform distribution over the
set S ¼ f0:1; 0:2; . . . ; 0:9; 1; 2; . . . ; 7g (fifth stage):
I ð s 2 S Þ
pðsÞ ¼ : ð18Þ
#S
Hereafter, the HPL method maximizes the (marginal) posterior distribution of the number of clusters (cf. Eq. (12)) if the
prior distributions are those specified in Eqs. (13)–(15) and (17), (18). Other prior distributions can be adopted for some
stages without changing the basic hierarchical structure that defines the HPL method.

4.4. FHPL method: A faster version of the HPL method

The FHPL procedure reduces the computation time by restricting the parameter space. While the HPL method postulates
that the classification vector c belongs to the set C ALL  [g2Nk [d2Dg C d , the FHPL requires that c be in a proper subset C FAST of
C ALL . The FHPL method estimates g as

  
  Z  
g ¼ arg max p gjRkn ; where p gjRkn  p /; gjRkn @/;
b ð19Þ
16g6k
 
~ /; gjRkn , is expressed as
in which / ¼ ðh; c; d; sÞ, and the new posterior distribution, p
       
~ ajRkn  p
p ~ /; gjRkn / p Rkn jh  pðhjc; sÞ  pðcjdÞ  I c 2 C FAST  pðdjg Þ  pð g Þ  pðsÞ: ð20Þ
   
~, then the posterior ratio p
According to Eqs. (10) and (20), if a and a0 belong to the support of p ~ a0 jRkn is equal
~ ajRkn =p
   
to the posterior ratio p ajRkn =p a0 jRkn .

Algorithm 1. Hierarchical ordinal classification algorithm


1. Choose integers g min and g max (1 6 g min < g max 6 k).
2. For each object i (i ¼ 1; . . . ; k), let Y i ¼ ðY 1 ðiÞÞ be its unidimensional feature vector that contains its mean
P
rank:Y 1 ðiÞ  1n nj¼1 Rj ðiÞ.
3. Without loss of generality (relabel the objects  if necessary),  assume that Y 1 ð1Þ < Y 1 ð2Þ <    < Y 1 ðkÞ and let the k-th
(ordinal)
n classification
o vector be ck ¼ ck ð1Þ; . . . ; ck ðkÞ ¼ ð1; . . . ; kÞ. Then, start with the cluster partition
Ck  Ck1 ; . . . ; Ckk of Nk , in which each cluster Ck‘ ¼ f‘g is a unit set.

(continued on next page)

985
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Algorithm 1. (continued)
4. For h ¼ k; k  1; . . . ; g min þ 1: n o
(a) For a specified cluster partition Ch  Ch1 ; . . . ; Chh of Nk into h clusters, evaluate all h  1 pairwise inter-cluster
dissimilarities Dð‘; ‘ þ 1Þ; 1 6 ‘ 6 h  1. Then, n identify o the most similar pair of consecutive clusters:Chk and Chkþ1 .
h1
(b) Obtain the new cluster partition C  C1 ; . . . ; Ch1 of Nk , where Ch1
h1 h1
‘ ¼ Ch‘ if ‘ ¼ 1; . . . ; k  1; Ch1
k ¼ Chk [ Chkþ1
and Ch1
‘1 ¼ C h
‘ if ‘ ¼ k þ 2; . . . ; h.  
(c) Let the ðh  1Þ-th (ordinal) classification vector be ch1 ¼ ch1 ð1Þ; . . . ; ch1 ðkÞ , where ch1 ðiÞ ¼ ‘ () i 2 Ch1 ‘ .
FAST
5. Define C  fcgmin ; . . . ; cgmax g.

We also present a strategy for obtaining the set C FAST from the RD matrix Rkn . Analogously to several alternative methods
(cf. Section 3.1), we associate a unique classification vector cg to each g from g min to g max . This procedure is described in Algo-
rithm 1. The approach is based on hierarchical clustering (see Section 2). As before, for a specified cluster partition
n o
Cg  Cg1 ; . . . ; Cgg , the (pairwise intercluster) dissimilarity between clusters Cg‘ and Cg‘0 is denoted by Dð‘; ‘0 Þ. Here, we have
adopted the complete linkage (cf. Eq. (4)). Additionally, Y j ð‘Þ is the average of the j-th feature of cluster Cg‘ (cf. Eq. (3)). Algo-
rithm 1 generates estimates of the classification vector conditional on a pre-specified number of clusters g for all
g ¼ g min ; . . . ; g max .
The HPL and FHPL methods are applied to several artificial datasets in Section 5 (a real dataset is also investigated in Sec-
tion 6). All the procedures are performed in one of the cores of an Intel i7-55000U in a 64-bit computer with 8 GB of RAM
running Windows 10, v. 1803. We utilize the R (3.4.0) language.

5. Simulations

To assess the performances of the HPL and FHPL for estimating the number of clusters, we conducted a large simulation
study in which 25,200 artificial RD matrices were analyzed, each of which was generated from a specified simulated SD
matrix (cf. Eq. (1)). The 25,200 synthetic SD matrices were also analyzed using the 10 alternative methods previously dis-
cussed: Calinski & Harabasz (CH), gap, Hartigan (Har), instability (ins), jump (jum), Krzanowski & Lai (KL), quantization error
modeling (qem), silhouette (sil), slope (sl0, sl1, sl2, sl3, and sl4 denote the slope method with p ¼ 0; 1; 2; 3; and 4, respec-
tively), and utility (uti). Section 5.1 describes the scenarios and evaluation metrics adopted. Sections 5.2 and 5.3 present
the main results obtained for scenarios with few objects (k ¼ 4 and 5) and with many objects (k ¼ 20; 50; 100; 200; 500,
and 1000), respectively.

5.1. Scenarios

To simulate the SD matrices, we considered four classes of dgps: PL (PL), PL with noise (PLN), Thurstone (T), and Thur-
stone with noise (TN). We specified distinct values for the number of objects k (k ¼ 4; 5; 20; 50; 100; 200; 500, and 1000), sca-
lar parameter s (s ¼ 1=8; 1=3, and 1=2), sample size n (n ¼ 50 and 200), and number of clusters g. If the number of objects is
small (k ¼ 4 or 5), g varies from 1 to k; otherwise, g varies from 1 to 10. A generic configuration or setup corresponds to a 5-
tuple ðdgp; k; s; n; g Þ. In total, we analyzed 1,656 distinct configurations (216 setups in the small-k case and 1,440 setups in
the large-k case).
For each setup ðdgp; k; s; n; g Þ, we generated 50 (10) replications of the SD matrices when k was small (large). We assumed
that the scores Sj ðiÞ; i ¼ 1; . . . ; k; j ¼ 1; . . . ; n, were (realizations of) independent random variables. Before obtaining the scores,
we simulated a distribution vector d ¼ ðdð1Þ; . . . ; dð g ÞÞ, compatible classification vector c ¼ ðcð1Þ; . . . ; cðkÞÞ and compatible
vector h ¼ ðhð1Þ; . . . ; hðkÞÞ(cf. Section 4). Then, for every (i; j), we defined the score by Sj ðiÞ  lnðhðiÞÞ þ eij , where the errors
eij are iid. Under the PL dgp, eij  Gumbelð0; 1Þ, whereas eij  N (0, p2 =6) (zero-mean normal distribution) under the Thur-
pffiffiffi 
stone dgp. Finally, under the remaining dgps (PLN and TN), we alternatively assumed that eij  nij þ 2  ij  -ij , where
nij ; ij , and -ij are jointly independent and distributed according to

(i) -ij  Ber(0.5);


(ii) nij  Gumbelð0; 1Þ in the PLN dgp, whereas nij  N (0, p2 =6) in the TN dgp;
   
(iii) ij and nij  E nij are equally distributed (where E nij is the expected value of nij ).

To evaluate the performance of each method in a specified setup, the following evaluation criteria were computed across
the replications: (i) the success rate, namely, the percentage of results in which the true number of clusters was identified;
(ii) the mean absolute deviation (MAD); and (iii) the normalized mean absolute deviation (NMAD), which is defined here as
NMAD  MAD=ðg max  1Þ, where g max ¼ k in the small-k case and g max ¼ 10 in the large-k case. Since there are many setups,
986
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

we averaged the evaluation criteria across all setups (ALL) to summarize them. Additionally, the averages across various sub-
sets of setups (e.g., the set of all setups for which n ¼ 200) were obtained.

5.2. Results for the small-k case

This section reports the main results obtained in the small-k case (k ¼ 4 or 5) for the HPL method and all the alternative
procedures. Here, the setups cover all the possible choices for the number of clusters (g varies from 1 to k). In total, for every
pair ð g; kÞ, there were 24 distinct setups—we considered 50 artificial SD/RD matrices per setup.
Table 2 presents the averages of MAD and NMAD. The column ‘‘ALL” displays the averages across all the setups, while in
the other columns, we restricted the setups. The HPL method realizes the smallest deviations on average regardless of the
selected dgp, scale parameter s, or number of judges n. The HPL deviations decrease as n increases and as s decreases, as
expected. Similarly, the HPL method performs better in non-noisy dgps (PL and T).
Table 3 displays, for k ¼ 4 and k ¼ 5, the average success rates obtained across the setups with a common number of clus-
ters g (g ¼ 1; . . . ; k). It is not possible to estimate the true number of clusters when g ¼ k using any alternative method, and
there are additional cases in which the number of clusters cannot be correctly estimated. For such cases, the success rates are
not available.
To assess the methods, we also computed the success rates for g from g min to g max (Avg min tog max ). Table 3 also reports these
global success rates; similarly, they are not available in various cases. For any k and g, if the HPL method is employed, then
the percentage of results in which the true number of clusters is identified exceeds 90%.
The HPL success rates often exceed those realized by the alternative methods, with some exceptions (e.g., the gap success
rate is 100% when k ¼ 4 and g ¼ 1). However, for every alternative method and a fixed number of objects, the HPL success
rate is exceeded at most once. In relation to the global success rate Avg min tog max , HPL outperforms the remaining methods
regardless of the selected values for g min and g max .
Since similar results were obtained when k ¼ 4 and k ¼ 5, we focus on the latter case (k ¼ 5). Table 4 presents the success
rates realized in the least favorable setups (s ¼ 1=2 and n ¼ 50), namely, the cases in which the HPL method provides less
accurate estimates for the number of clusters. Although the HPL success rates decrease (compared with those in Table 3),
the HPL method remains unbeaten. Moving forward, we computed Av1to4, Av2to4, and Av2to3 for all 24 combinations of

Table 2
MAD and NMAD in the small-k case: averages/standard errors.

All dgp = PL dgp = PLN dgp = T dgp = TN s ¼ 1=8 s ¼ 1=3 s ¼ 1=2 n ¼ 50 n ¼ 200
MAD
HPL 0.09/0.13 0.05/0.07 0.09/0.13 0.07/0.10 0.14/0.17 0.04/0.06 0.08/0.11 0.14/0.17 0.15/0.15 0.02/0.04
CH 0.75/0.60 0.64/0.59 0.88/0.56 0.64/0.64 0.82/0.61 0.52/0.56 0.82/0.62 0.90/0.56 0.78/0.53 0.71/0.67
gap 1.12/0.88 1.09/0.83 1.17/0.93 1.06/0.83 1.17/0.94 0.98/0.73 1.16/0.90 1.23/0.98 1.19/0.93 1.06/0.82
Har 0.80/0.75 0.80/0.76 0.80/0.76 0.80/0.76 0.80/0.76 0.80/0.76 0.80/0.76 0.80/0.76 0.80/0.75 0.80/0.75
ins 0.77/0.71 0.74/0.70 0.79/0.75 0.75/0.70 0.79/0.74 0.71/0.66 0.79/0.74 0.80/0.75 0.78/0.71 0.76/0.72
jum 1.29/1.03 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04
KL 0.46/0.36 0.38/0.35 0.60/0.37 0.36/0.36 0.50/0.35 0.27/0.31 0.49/0.37 0.61/0.34 0.55/0.33 0.37/0.38
qem 1.29/1.03 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04 1.29/1.04
sil 0.81/0.74 0.81/0.75 0.81/0.74 0.80/0.76 0.81/0.75 0.80/0.76 0.81/0.74 0.81/0.74 0.81/0.73 0.80/0.75
sl0 0.89/0.61 0.93/0.61 0.94/0.57 0.84/0.66 0.85/0.61 0.90/0.68 0.88/0.59 0.89/0.56 0.92/0.60 0.86/0.61
sl1 0.85/0.59 0.86/0.61 0.89/0.56 0.80/0.62 0.84/0.59 0.82/0.64 0.84/0.59 0.87/0.56 0.86/0.57 0.83/0.61
sl2 0.84/0.60 0.83/0.62 0.89/0.58 0.79/0.63 0.84/0.60 0.81/0.64 0.84/0.60 0.87/0.57 0.85/0.59 0.82/0.62
sl3 0.84/0.61 0.83/0.62 0.88/0.59 0.80/0.63 0.86/0.61 0.82/0.65 0.84/0.61 0.87/0.58 0.85/0.60 0.83/0.63
sl4 0.85/0.61 0.83/0.63 0.88/0.59 0.82/0.64 0.86/0.61 0.83/0.64 0.84/0.62 0.87/0.59 0.85/0.60 0.84/0.63
uti 1.37/0.89 1.35/0.90 1.39/0.87 1.37/0.92 1.39/0.89 1.27/0.96 1.43/0.84 1.42/0.87 1.34/0.84 1.41/0.94
NMAD
HPL 0.02/0.04 0.01/0.02 0.02/0.04 0.02/0.03 0.04/0.05 0.01/0.02 0.02/0.03 0.04/0.05 0.04/0.04 0.01/0.01
CH 0.20/0.15 0.17/0.15 0.24/0.13 0.17/0.16 0.22/0.15 0.14/0.14 0.22/0.15 0.24/0.13 0.21/0.13 0.19/0.17
gap 0.31/0.23 0.30/0.22 0.33/0.25 0.29/0.22 0.33/0.25 0.27/0.19 0.32/0.24 0.34/0.26 0.33/0.25 0.29/0.22
Har 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20 0.22/0.20
ins 0.21/0.18 0.20/0.18 0.21/0.19 0.20/0.18 0.21/0.19 0.19/0.17 0.21/0.19 0.22/0.19 0.21/0.18 0.20/0.18
jum 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28
KL 0.12/0.09 0.10/0.09 0.16/0.09 0.09/0.09 0.13/0.08 0.07/0.08 0.13/0.09 0.16/0.08 0.15/0.08 0.09/0.09
qem 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28 0.36/0.28
sil 0.22/0.19 0.22/0.20 0.22/0.19 0.22/0.20 0.22/0.19 0.22/0.20 0.22/0.19 0.22/0.19 0.22/0.19 0.22/0.19
sl0 0.25/0.16 0.26/0.16 0.26/0.15 0.24/0.18 0.24/0.16 0.25/0.18 0.24/0.16 0.25/0.15 0.26/0.16 0.24/0.16
sl1 0.24/0.16 0.24/0.16 0.25/0.15 0.22/0.17 0.23/0.15 0.23/0.17 0.23/0.15 0.24/0.14 0.24/0.15 0.23/0.16
sl2 0.23/0.16 0.23/0.16 0.25/0.15 0.22/0.17 0.23/0.16 0.23/0.17 0.23/0.16 0.24/0.15 0.24/0.15 0.23/0.16
sl3 0.23/0.16 0.23/0.16 0.24/0.15 0.22/0.17 0.24/0.16 0.23/0.17 0.23/0.16 0.24/0.15 0.24/0.16 0.23/0.17
sl4 0.23/0.16 0.23/0.17 0.24/0.15 0.23/0.17 0.24/0.16 0.23/0.17 0.23/0.16 0.24/0.15 0.24/0.16 0.23/0.17
uti 0.38/0.24 0.37/0.24 0.38/0.23 0.38/0.25 0.39/0.24 0.35/0.26 0.40/0.22 0.39/0.23 0.37/0.23 0.39/0.25

987
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Table 3
Average success rates (%) in the small-k case: Setups grouped by the number of clusters.

HPL CH gap Har ins jum KL qem sil sl0 sl1 sl2 sl3 sl4 uti

k¼4
g¼1 94 – 100 100 – 0 – 0 – 53 34 29 25 20 0
g¼2 91 77 0 0 13 0 80 0 98 65 85 90 92 93 31
g¼3 91 41 0 – 97 100 – 100 0 0 0 0 0 0 55
g¼4 97 – – – – – – – – – – – – – –
Av1to4 93 – – – – – – – – – – – – – –
Av2to4 93 – – – – – – – – – – – – – –
Av1to3 92 – 33 – – 33 – 33 – 39 40 40 39 38 29
Av2to3 91 59 0 – 55 50 – 50 49 32 42 45 46 47 43
k¼5
g¼1 94 – 100 100 – 0 – 0 – 55 32 26 20 19 0
g¼2 92 76 0 0 4 0 76 0 97 64 80 86 89 90 31
g¼3 92 23 0 0 0 0 20 0 1 15 13 12 10 8 36
g¼4 93 12 0 – 99 100 – 100 0 0 0 0 0 0 1
g¼5 97 – – – – – – – – – – – – – –
Av1to5 94 – – – – – – – – – – – – – –
Av2to5 94 – – – – – – – – – – – – – –
Av1to4 93 – 25 – – 25 – 25 – 33 31 31 30 29 17
Av2to4 92 37 0 – 34 33 – 33 32 26 31 32 33 33 23
Av2to3 92 50 0 0 2 0 48 0 49 39 47 49 49 49 34

s; n, and dgp for k ¼ 5. The boxplots in Figure 1 summarize the global success rates realized by each method, showing that
HPL outperforms the other methods regardless of the setup.
To analyze the impacts of the setup variables (dgp, s, and n) on the performance of HPL, we averaged the success rates
across all the setups in which a single setup variable had a fixed value, while the remaining setup variables were varied
(see Table 5). Both the number of judges n and the scalar parameter s affect HPL performance, as expected. The larger n,
the higher is the average success rate. Analogously, the lower s, the higher is the average success rate. With respect to
the selected dgp, the HPL method tends to perform better with non-noisy dgps (PL and T). The average success rates realized
with the Thurstone dgp are comparable to those obtained with the PL dgp. Regarding noisy dgps, namely, PLN and TN, the
HPL method performs better in the former; however, the difference between the average success rates obtained for these
two dgps is almost always less than 4 percentage points.
To conclude, in the small-k case, the FHPL method was employed. However, its success rates coincided with those realized
using the HPL method in 146 of the 216 setups (70%). In each of the remaining setups, the absolute difference between their
success rates was less than 3:5 percentage points. Since the performance of HPL and FHPL was approximately the same, we
omitted the latter. From a theoretical standpoint, the HPL method is ideal because it does not require a clustering algorithm
to have been previously run. However, FHPL is substantially faster than HPL. Table 6 presents the average elapsed time (in
seconds) for estimating the number of clusters in a modest processing platform (cf. Section 4.4) in the small-k case. Thus, we
consider only the FHPL method in the large-k case.

5.3. Results for the large-k case

In the large-k case (k ¼ 20; 50; 100; 200; 500, and 1000), the number of clusters g varies from 1 to 10 regardless of the
number of objects k. In total, there are 60 distinct pairs (g; k), 24 distinct setups per pair (g; k), and 10 artificial SD/RD matri-
ces per setup. We restricted the estimated number of clusters to 10 or less (g max ¼ 10). In our proposed approach, this is
easily realized by redefining the prior distribution of g (cf. Eq. (15)))) to the uniform distribution over N10 . As discussed
in Section 5.2, we assessed only the alternative methods and FHPL procedure in the large-k case. Table 7 presents the FHPL
running times. Even when k ¼ 1000 and n ¼ 200, FHPL shows competitive computation times; see, for example, [10,36].
First, we analyze the mean absolute deviations. In the large-k case, NMAD is proportional to MAD; hence, we computed
only the former. Table 8 presents the NMAD averages for the FHPL method and all the alternative methods. The FHPL devi-
ations are, on average, substantially smaller than those exhibited by the best alternative methods under the same criterion
(instability and Krzanowski & Lai).
Table 9 presents the average success rates realized across the setups with a common number of clusters g (for
g ¼ 1; . . . ; 10). We averaged the success rates across all the setups (Av1to10). As discussed in Section 5.2, it is impossible
to estimate the true number of clusters if g ¼ 1 using certain methods. Thus, we also present the average across the setups
that contain two or more clusters (Av2to10). The FHPL estimates are equal to the true number of clusters in more than 90%
of cases regardless of the number of clusters. FHPL outperforms the alternative methods in terms of global success rates
(Av1to10 and Av2to10).
For each combination of dgp, s; n, and k, we evaluated the overall averages Av1to10 and Av2to10 (varying only the num-
ber of clusters). The boxplots in Fig. 2 summarize these average success rates. The Krzanowski & Lai method is the best alter-
988
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Table 4
Success rates and averages (%) in the least favorable scenarios (s ¼ 1=2 and n ¼ 50) when k ¼ 5.

HPL CH gap Har ins jum KL qem sil sl0 sl1 sl2 sl3 sl4 uti
dgp: PL
g¼1 92 – 100 100 – 0 – 0 – 56 34 30 26 22 0
g¼2 80 52 0 0 2 0 44 0 94 46 64 68 70 76 6
g¼3 78 12 0 0 0 0 10 0 2 12 2 2 0 2 52
g¼4 86 12 0 – 100 100 – 100 0 0 0 0 0 0 0
g¼5 100 – – – – – – – – – – – – – –
Av1to5 87 – – – – – – – – – – – – – –
Av2to5 86 – – – – – – – – – – – – – –
Av1to4 84 – 25 – – 25 – 25 – 29 25 25 24 25 15
Av2to4 81 25 0 – 34 33 – 33 32 19 22 23 23 26 19
Av2to3 79 32 0 0 1 0 27 0 48 29 33 35 35 39 29
dgp: PL with Noise
g¼1 96 – 100 100 – 0 – 0 – 46 24 14 12 12 0
g¼2 74 24 0 0 0 0 24 0 82 28 50 58 64 68 0
g¼3 68 20 0 0 0 0 22 0 8 8 12 8 8 4 60
g¼4 68 10 0 – 100 100 – 100 0 0 0 0 0 0 4
g¼5 94 – – – – – – – – – – – – – –
Av1to5 80 – – – – – – – – – – – – – –
Av2to5 76 – – – – – – – – – – – – – –
Av1to4 76 – 25 – – 25 – 25 – 20 22 20 21 21 16
Av2to4 70 18 0 – 33 33 – 33 30 12 21 22 24 24 21
Av2to3 71 22 0 0 0 0 23 0 45 18 31 33 36 36 30
dgp: Thurstone
g¼1 92 – 100 100 – 0 – 0 – 58 38 32 28 26 0
g¼2 84 62 0 0 0 0 62 0 96 52 66 78 84 84 6
g¼3 88 10 0 0 0 0 10 0 0 10 8 6 4 0 44
g¼4 94 2 0 – 98 100 – 100 0 0 0 0 0 0 0
g¼5 98 – – – – – – – – – – – – – –
Av1to5 91 – – – – – – – – – – – – – –
Av2to5 91 – – – – – – – – – – – – – –
Av1to4 90 – 25 – – 25 – 25 – 30 28 29 29 28 12
Av2to4 89 25 0 – 33 33 – 33 32 21 25 28 29 28 17
Av2to3 86 36 0 0 0 0 36 0 48 31 37 42 44 42 25
dgp: Thurstone with Noise
g¼1 88 – 100 100 – 0 – 0 – 72 48 36 28 28 0
g¼2 66 48 0 0 0 0 38 0 86 30 56 66 72 76 2
g¼3 66 16 0 0 0 0 18 0 4 10 10 8 4 4 50
g¼4 56 10 0 – 100 100 – 100 0 0 0 0 0 0 14
g¼5 62 – – – – – – – – – – – – – –
Av1to5 68 – – – – – – – – – – – – – –
Av2to5 62 – – – – – – – – – – – – – –
Av1to4 69 – 25 – – 25 – 25 – 28 29 28 26 27 16
Av2to4 63 25 0 – 33 33 – 33 30 13 22 25 25 27 22
Av2to3 66 32 0 0 0 0 28 0 45 20 33 37 38 40 26

native method when g – 1, followed by the quantization error modeling method, which realizes an impressive average suc-
cess rate when g ¼ 1. Similarly, high success rates are realized by the jump method in various cases. As long as the average
success rates of the alternative methods are commonly below 80%, the average success rates realized by the FHPL method
fail to exceed this threshold in only a few cases.
Finally, we computed the percentage of cases in which the difference between the estimated number and true number of
clusters (error size) equals a specified value only for the FHPL method. Since the estimates were restricted to vary from 1 to
10, along with the true values, the error size can assume only integer values between 9 and 9. However, we observed no
error sizes outside the interval ½4; 7. Table 10 presents the frequencies obtained in this restricted range. Regardless of the
restriction imposed on the setups, the error is typically zero. Moreover, in the cases in which FHPL fails to estimate the num-
ber of clusters, the error tends to be 1 or 1. The percentages decrease rapidly as the absolute value of the error size
increases.

6. Estimating the number of clusters in Formula One data

We applied the HPL and FHPL methods to a real dataset of RD from Formula One motor racing. While the drivers are the
objects to be ranked, each Grand Prix (GP) corresponds to a specified judge. Similar RD have been analyzed in the literature,
including using the PL model [24,11,23].
989
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Fig. 1. Average success rates (%) across the setups with distinct numbers of clusters when k ¼ 5.

Table 5
HPL average success rates (%) when k ¼ 5: fixed dgp, fixed s, and fixed n.

g¼1 g¼2 g¼3 g¼4 g¼5 Av1to5 Av2to5 Av1to4 Av2to4 Av2to3
varying the dgp
PL 94 94 95 96 100 96 96 95 95 95
PLN 94 92 89 91 99 93 93 92 91 91
T 95 95 95 97 99 96 97 96 96 95
TN 91 87 88 87 91 89 88 88 87 88
varying the scale parameter s
s ¼ 1=8 94 98 97 98 100 97 98 96 98 97
s ¼ 1=3 94 92 92 94 98 94 94 93 93 92
s ¼ 1=2 94 86 87 86 94 89 88 88 87 87
varying the number of judges n
n ¼ 50 92 86 84 86 95 89 88 87 85 85
n ¼ 200 95 99 100 99 100 98 99 98 99 99

Table 6
Running time (in seconds) to estimate the number of clusters using a k  n data matrix.

ðk ¼ 4; n ¼ 50Þ ðk ¼ 4; n ¼ 200Þ ðk ¼ 5; n ¼ 50Þ ðk ¼ 5; n ¼ 200Þ


HPL 7.489 28.862 52.008 238.954
FHPL 0.556 1.789 0.634 2.020

Table 7
Running time (in seconds) to estimate the number of clusters using a k  n data matrix—FHPL.

k ¼ 20 k ¼ 50 k ¼ 100 k ¼ 200 k ¼ 500 k ¼ 1000


n ¼ 50 1.264 1.318 1.401 1.605 2.131 2.978
n ¼ 200 4.255 4.461 4.685 5.607 6.840 9.847

RD directly observed in Formula One races were analyzed by [23]. In contrast to their study, for each GP (e.g., the US GP
2019), which is indexed by location (United States) and season (2019), we rank the drivers according to their fastest lap

990
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Table 8
NMAD in the large-k case: averages/standard errors.

All dgp = PL dgp = PLN dgp = T dgp = TN s ¼ 1=8 s ¼ 1=3 s ¼ 1=2 n ¼ 50 n ¼ 200
FHPL 0.00/0.02 0.00/0.01 0.01/0.01 0.00/0.01 0.01/0.03 0.00/0.00 0.00/0.01 0.01/0.03 0.01/0.02 0.00/0.00
CH 0.39/0.35 0.37/0.34 0.41/0.37 0.37/0.34 0.41/0.37 0.34/0.31 0.41/0.37 0.41/0.38 0.39/0.35 0.39/0.35
gap 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35 0.33/0.35
Har 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40 0.38/0.40
ins 0.18/0.19 0.13/0.14 0.21/0.21 0.14/0.15 0.22/0.22 0.12/0.11 0.17/0.17 0.24/0.24 0.22/0.22 0.13/0.13
jum 0.47/0.42 0.47/0.42 0.48/0.42 0.46/0.42 0.47/0.42 0.46/0.42 0.47/0.42 0.48/0.42 0.56/0.42 0.38/0.40
KL 0.28/0.34 0.26/0.33 0.30/0.36 0.26/0.33 0.29/0.36 0.22/0.30 0.29/0.35 0.32/0.37 0.29/0.35 0.26/0.34
qem 0.34/0.36 0.32/0.35 0.38/0.37 0.31/0.36 0.37/0.37 0.28/0.34 0.36/0.37 0.39/0.36 0.34/0.36 0.34/0.36
sil 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38 0.42/0.38
sl0 0.36/0.41 0.36/0.41 0.36/0.41 0.36/0.41 0.36/0.41 0.35/0.42 0.36/0.41 0.36/0.40 0.37/0.40 0.35/0.41
sl1 0.37/0.33 0.36/0.33 0.37/0.33 0.36/0.33 0.38/0.33 0.36/0.33 0.37/0.33 0.38/0.33 0.37/0.33 0.37/0.33
sl2 0.37/0.33 0.36/0.33 0.37/0.33 0.37/0.33 0.38/0.33 0.36/0.33 0.37/0.33 0.37/0.33 0.37/0.33 0.37/0.33
sl3 0.37/0.33 0.36/0.34 0.37/0.33 0.37/0.33 0.37/0.34 0.36/0.33 0.37/0.33 0.37/0.33 0.37/0.33 0.37/0.34
sl4 0.36/0.33 0.36/0.34 0.37/0.33 0.37/0.34 0.36/0.34 0.36/0.33 0.37/0.34 0.37/0.33 0.36/0.33 0.37/0.34
uti 0.43/0.33 0.43/0.33 0.43/0.33 0.42/0.33 0.42/0.33 0.42/0.33 0.42/0.33 0.43/0.33 0.40/0.32 0.45/0.33

Table 9
Average success rates (%) in the large-k case: Setups grouped by the number of clusters.

FHPL CH gap Har ins jum KL qem sil sl0 sl1 sl2 sl3 sl4 uti
g¼1 100 – 0 100 – 50 – 77 – 99 24 25 26 27 0
g¼2 98 100 100 0 0 8 97 80 99 44 93 95 96 96 97
g¼3 98 4 0 0 30 12 83 83 0 14 17 16 15 14 1
g¼4 96 0 0 0 4 16 59 39 0 0 0 0 0 0 0
g¼5 95 0 0 0 4 16 60 13 0 0 0 0 0 0 0
g¼6 94 0 0 0 0 18 51 0 0 0 0 0 0 0 0
g¼7 92 0 0 0 0 19 39 0 0 0 0 0 0 0 0
g¼8 91 0 0 0 2 20 35 0 0 0 0 0 0 0 0
g¼9 91 0 0 0 9 14 14 0 0 0 0 0 0 0 0
g ¼ 10 94 0 0 0 77 35 0 0 0 0 0 0 0 0 0
Av1to10 95 – 10 10 – 21 – 29 – 16 14 14 14 14 10
Av2to10 94 12 11 0 14 17 49 24 11 7 12 12 12 12 11

Fig. 2. Average success rates (%) across the setups with distinct numbers of clusters in the large-k case.

times in the associated qualifying sections. The data were obtained from www.formula1.com/en/results.html in June 2019.
Our analysis considered four Formula One seasons (2015–2018) and we restricted the investigation to drivers who partic-
ipated in all four seasons without changing their constructor team. The selected drivers are Alonso (McLaren), Ericsson (Sau-
ber), Hamilton (Mercedes), Perez (Force India), Raikkonen (Ferrari), Ricciardo (Red Bull), and Vettel (Ferrari). Although 81
GPs were held between 2015 and 2018, we considered only the 72 GPs in which all these drivers participated; however,
the missing values can be easily incorporated (in the HPL and FHPL methods) if we subtly change the likelihood function
in Eq. (7) to allow the set of objects to vary from judge to judge (cf. [24]).
Since the number of objects (selected drivers) is seven, the true number of clusters g may assume any integer value
between one and seven. The number of judges (selected GPs) is 72, which exceeds the minimum value considered in the
simulation study. We estimated the number of clusters using the observed (7  72) RD matrix by applying both the HPL
and FHPL methods. Fig. 3 presents the logarithms of the marginal posterior distributions of g obtained using these two meth-

991
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Table 10
FHPL error size distribution (%) in the large-k case.

errorsize 4 3 2 1 0 1 2 3 4 5 7

varying the dgp


PL – 0.0 0.0 0.8 98.5 0.4 0.1 0.1 0.0 0.0 –
PLN – 0.1 0.2 1.2 94.8 1.7 1.3 0.3 0.2 0.1 0.0
T – – 0.0 0.5 98.1 0.9 0.3 0.0 0.1 0.0 –
TN 0.1 0.2 0.8 2.8 87.5 6.1 2.1 0.3 0.1 0.0 –
varying the scale parameter s
s ¼ 1=8 – – 0.0 0.2 99.7 0.1 – – – – –
s ¼ 1=3 – – 0.0 1.3 96.2 1.5 0.7 0.1 0.1 0.0 0.0
s ¼ 1=2 0.1 0.2 0.7 2.5 88.3 5.2 2.2 0.4 0.2 0.1 –
varying the number of judges n
n ¼ 50 0.0 0.1 0.5 2.5 90.7 3.6 1.8 0.4 0.2 0.1 0.0
n ¼ 200 – – – 0.2 98.8 0.9 0.1 0.0 – – –
varying the number of objects k
k ¼ 20 0.1 0.1 0.7 3.3 92.3 2.7 0.7 0.0 0.0 – –
k ¼ 50 – 0.2 0.3 1.8 94.2 2.9 0.5 0.2 – 0.0 –
k ¼ 100 – – 0.2 1.0 95.7 2.6 0.5 0.0 0.0 – –
k ¼ 200 0.0 0.0 0.0 0.5 96.2 2.3 0.7 0.1 0.0 – –
k ¼ 500 – 0.0 0.2 0.7 96.0 1.5 1.2 0.3 0.1 0.0 –
k ¼ 1000 – 0.0 0.1 0.6 94.1 1.8 2.3 0.5 0.4 0.2 0.0
varying the number of clusters g
g¼1 – – – – 100 – – – – – –
g¼2 – – – – 97.5 1.2 1.2 – 0.1 – –
g¼3 – – – – 97.6 1.4 0.6 0.2 0.2 – 0.1
g¼4 – – – – 95.8 3.3 0.6 – 0.1 0.1 –
g¼5 – – 0.1 1.0 94.9 1.5 1.5 0.4 0.3 0.3 –
g¼6 – – 0.1 1.6 93.6 2.5 1.4 0.5 0.3 – –
g¼7 – – 0.2 1.3 92.2 3.3 2.2 0.8 – – –
g¼8 – 0.1 0.1 2.3 91.2 4.0 2.4 – – – –
g¼9 0.1 0.1 0.6 2.6 91.0 5.5 – – – – –
g ¼ 10 0.1 0.5 1.4 4.4 93.5 – – – – – –

Bold values indicate the average success rate across the restricted setups (fixed dgp, fixed s, fixed n, fixed k, and fixed g).

Fig. 3. Marginal posterior distributions (HPL and FHPL) of the number of clusters in Formula One.

ods. When we vary the number of clusters, the maximal absolute difference between the posterior probabilities obtained
from these two curves is less than 5  103 . In both cases, the estimated number of clusters is seven since it is the mode
of such posterior distributions (the posterior probability that the true number of clusters is seven is approximately 80%).
We conclude that there are no tied drivers.
The observed proximity between the two marginal posterior distributions (Fig. 3) is not necessarily expected for every
real dataset. To apply the FHPL method, we restricted the parameter space by pre-specifying seven classification vectors—
each associated with a specified value of the number of clusters g (g ¼ 1; . . . ; 7)—using Algorithm 1. Part (b) of Fig. 4 plots
these (estimated) classification vectors. By contrast, as discussed at the end of Section 4.2, it is possible to obtain for each
g (g ¼ 1; . . . ; 7) alternative estimates of the classification vector using the HPL approach (see part (a) of Fig. 4).
If we use either Algorithm 1 (in the FHPL method) or the conditional posterior distribution of the HPL method, we
obtain—for g ¼ 7—the same ‘‘estimated” classification vector or statistical ranking: (1st) Hamilton (best driver), (2nd) Vettel,

992
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

Fig. 4. Best classification vectors obtained in Formula One for different numbers of clusters: HPL (a) and FHPL (b).

(3rd) Raikkonen, (4th) Ricciardo, (5th) Perez, (6th) Alonso, and (7th) Ericsson. The same classification vector maximizes the
unconditional marginal posterior distribution obtained in the HPL method. When we choose a different value for g (g < 7),
the best classification vectors differ from the specified statistical ranking since ties must be included. In the FHPL method, the
best classification vectors must satisfy an order condition: the classifications of any pair of objects can never be inverted
when we vary g. For example, regardless of the number of clusters, Hamilton is always in the best group, whereas Ericsson
is always in the worst. The same order restriction was observed in the HPL results. Since the posterior probability that the
true number of clusters is at least five is greater than 99; 9%, we found strong evidence that Hamilton is the single best dri-
ver. Although there are subtle differences between the best classification vectors obtained using the HPL and FHPL methods
(for g ¼ 3; 4; 5; 6), the estimated number of clusters is the same in both cases, as is the estimated classification vector asso-
ciated with it.

7. Conclusions

This study introduced a novel strategy for estimating the number of clusters designed to identify the quantity of groups in
a finite population of objects ranked by several distinct judges. This strategy is based on a hierarchical version of the classical
PL model in which the number of clusters and other interpretable parameters are introduced. Two distinct methods are
derived: HPL is based on less restrictive assumptions, whereas FHPL is much faster since it operates with a restricted param-
eter space (with a small number of classification vectors to be pre-specified) analogous to several alternative methods
[28,45,31,15,19,26].
Our methods were assessed in a large simulation study. In the small-k case, 10,800 artificial datasets were analyzed using
HPL, FHPL, and all 10 alternative methods. By contrast, in the large-k case (where 14,400 artificial datasets were analyzed),
only FHPL and the 10 alternative methods were employed. Even for dgps with assumptions that differed from the classical PL
model, the proposed methods correctly estimated the number of clusters in most instances. When they failed to identify the
true number of clusters, the difference between the true number and estimated value was typically small. In comparison
with various well-established methods (Calinski & Harabasz, gap, Hartigan, instability, jump, Krzanowski & Lai, quantization
error modeling, silhouette, slope, and utility), our methods realized the highest success rates and smallest errors. Based on
the previous considerations, we thus recommend the use of the FHPL method for datasets with a large number of objects
despite its simplicity. Indeed, both HPL and FHPL could be equally useful when the number of objects is small.
To demonstrate the methodologies presented in this study, we applied them to a real Formula One dataset that contained
RD (derived from the fastest lap times in the associated qualifying sections) from the only seven drivers who participated in
the 2015–2018 Formula One seasons. For each possible number of clusters g, we presented estimates of the statistical clas-
sifications of the selected drivers conditional on g. Hamilton (Mercedes) was the best driver in all the cases. The HPL and
FHPL methods provided the same estimate for the number of clusters: seven (indicating no statistical tie between any
two drivers).
Typically, the analysis of the RD focused on the heterogeneity of the population of judges [18,35]. However, in this study,
we assumed that the rank vectors were iid (homogeneity in the population of judges) similar to [24]. In future works, it
would be desirable to relax such a supposition (e.g., the parameters of the PL model may depend on both the judges and
the objects; see [1]). Additionally, although specified prior distributions were used in this study, other choices could be made
in future studies (e.g., relaxing the assumptions in Eqs. (16)–(18)).
993
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

To reduce the computation time of HPL, simulation methods commonly used in Bayesian inference to approximate the
posterior distribution (e.g., [37,8]) should be adopted in future research. Another possibility to be explored further is the
combination of the HPL model proposed here with multistage models [35,1]. Finally, as discussed in Section 4, both the FHPL
method (Algorithm 1) and the HPL method can provide ordered clusters; thus, future researchers could compare the clus-
tering partitions obtained by employing the alternative method presented by [13] with those obtained using our methods.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

CRediT authorship contribution statement

Wilson Calmon: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data curation, Writing -
original draft, Writing - review & editing, Visualization. Mariana Albi: Conceptualization, Methodology, Validation, Formal
analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgements

We acknowledge the Editor Witold Pedrycz and the Referees since their comments, suggestions, and requests greatly con-
tributed to improving the paper. This manuscript was edited by Elsevier Language Editing services (webshop.elsevier.com).

References

[1] M. Alvo, P. Yu, Statistical Methods for Ranking Data, Springer, New York, 2014.
[2] G. Ball, D. Hall, A clustering technique for summarizing multivariate data, Behavioral Science 12 (2) (1967) 153–155.
[3] J. Bernardo, A. Smith, Bayesian Theory, Wiley, Chichester, UL, 1994.
[4] R. Bradley, M. Terry, Rank analysis of incomplete block designs: I. The method of paired comparisons, Biometrika 39 (3/4) (1952) 324–345.
[5] T. Calinski, J. Harabasz, A dendrite method for cluster analysis, Communications in Statistics-Simulation and Computation 3 (1) (1974) 1–27.
[6] F. Caron, Y. Teh, T. Murphy, Bayesian nonparametric plackett-luce models for the analysis of preferences for college degree programmes, The Annals of
Applied Statistics 8 (2) (2014) 1145–1181.
[7] Y.-M. Cheung, H. Jia, Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number, Pattern
Recognition 46 (8) (2013) 2228–2238.
[8] P. Congdon, Applied Bayesian Hierarchical Methods, CRC Press, Boca Raton, FL, 2010.
[9] M. Crispino, E. Arjas, V. Vitelli, N. Barret, A. Frigessi, A bayesian mallows approach to nontransitive pair comparison data: how human are sounds?, The
Annals of Applied Statistics 13 (1) (2019) 492–519
[10] R. De Amorim, C. Hennig, Recovering the number of clusters in data sets with noise features using feature rescaling factors, Information Sciences 324
(2015) 126–145.
[11] F. Dong, G. Yin, Maximum likelihood estimation for incomplete multinomial data via the weaver algorithm, Statistics and Computing 28 (5) (2018)
1095–1117.
[12] A. D’Ambrosio, W. Heiser, A distribution-free soft-clustering method for preference rankings, Behaviormetrika 46 (2018) 333–351.
[13] A. D’Ambrosio, C. Iorio, M. Staiano, R. Siciliano, Median constrained bucket order rank aggregation, Computational Statistics 34 (2) (2019) 787–802.
[14] B. Everitt, S. Landau, M. Leese, D. Stahl, Cluster Analysis, fifth ed., Wiley, New York, 2011.
[15] Y. Fang, J. Wang, Selection of the number of clusters via the bootstrap method, Computational Statistics and Data Analysis 56 (3) (2012) 468–477.
[16] A. Filipcic, A. Panjan, N. Sarabon, Classification of top male tennis players, International Journal of Computer Science in Sport 13 (1) (2014) 36–42.
[17] M. Fligner, J. Verducci, Multistage ranking models, Journal of the American Statistical Association 83 (403) (1988) 892–901.
[18] M. Fligner, J. Verducci, Probability Models and Statistical Analyses for Ranking Data, Springer, New York, 1993.
[19] A. Fujita, D. Takahashi, A. Patriota, A non-parametric method to estimate the number of clusters, Computational Statistics and Data Analysis 73 (2014)
27–39.
[20] A. Gordon, Classification, second ed., Chapman and Hall/CRC, 1999.
[21] I. Gormley, T. Murphy, Exploring voting blocs within the irish electorate: a mixture modeling approach, Journal of the American Statistical Association
103 (483) (2008) 1014–1027.
[22] J.A. Hartigan, Clustering Algorithms, Wiley, New York, 1975.
[23] D. Henderson, L. Kirrane, A comparison of truncated and time-weighted plackett–luce models for probabilistic forecasting of formula one results,
Bayesian Analysis 13 (2) (2018) 335–358.
[24] D. Hunter, MM algorithms for generalized bradley-terry models, The Annals of Statistics 32 (1) (2004) 384–406.
[25] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, Springer, New York, 2014.
[26] A. Kolesnikov, E. Trichina, T. Kauranne, Estimating the number of clusters in a numerical data set via quantization error modeling, Pattern Recognition
48 (2015) 941–952.
[27] J. Kou, Estimating the number of clusters via the gud statistic, Journal of Computational and Graphical Statistics 23 (2) (2014) 403–417.
[28] W. Krzanowski, Y. Lai, A criterion for determining the number of groups in a data set using sum-of-squares clustering, Biometrics 44 (1) (1988) 23–34.
[29] J.-S. Lee, S. Olafsson, A meta-learning approach for determining the number of clusters with consideration of nearest neighbours, Information Sciences
232 (2013) 208–224.
[30] P. Lee, P. Yu, Mixtures of weighted distance-based models for ranking data with applications in political studies, Computational Statistics & Data
Analysis 56 (8) (2012) 2486–2500.

994
W. Calmon and M. Albi Information Sciences 546 (2021) 977–995

[31] J. Liang, X. Zhao, D. Li, F. Cao, C. Dang, Determining the number of clusters using information entropy for mixed data, Pattern Recognition 45 (6) (2012)
2251–2265.
[32] E. Lord, M. Willems, F.-J. Lapointe, V. Makarenkov, Using the stability of objects to determine the number of clusters in datasets, Information Sciences
393 (2017) 29–46.
[33] R. Luce, Individual Choice Behavior: A Theoretical analysis, Wiley, New York, 1959.
[34] C. Mallow, Non-null ranking models. i, Biometrika 44 (1–2) (1957) 114–130.
[35] J.I. Marden, Analyzing and Modeling Rank Data, Chapman and Hall/CRC, New York, 1995.
[36] M. Masud, J. Huang, C. Wei, J. Wang, I. Khan, M. Zhong, I-nice: A new approach for identifying the number of clusters and initial cluster centres,
Information Sciences 466 (2018) 129–151.
[37] H. Migon, D. Gamerman, F. Louzada-Neto, Statistical Inference: An Integrated Approach, second ed., Chapman and Hall/CRC, London, 2014.
[38] T. Murphy, D. Martin, Mixtures of distance-based models for ranking data, Computational Statistics & Data Analysis 41 (2003) 645–655.
[39] D. Müllensiefen, C. Hennig, H. Howells, Using clustering of rankings to explain brand preferences with personality and socio-demographic variables,
Journal of Applied Statistics 45 (6) (2018) 1009–1029..
[40] R. Plackett, The analysis of permutations, Journal of the Royal Statistical Society: Series B 24 (2) (1975) 193–202.
[41] P. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics
20 (1987) 53–65.
[42] Y. Silva, A. Herthel, A. Subramanian, A multi-objective evolutionary algorithm for a class of mean-variance portfolio selection problems, Expert
Systems with Applications 133 (1) (2019) 225–241.
[43] C. Sugar, G. James, Finding the number of clusters in a dataset: an information-theoretic approach, Journal of the American Statistical Association 98
(463) (2003) 750–763.
[44] L. Thurstone, A law of comparative judgement, Psychological Review 34 (4) (1927) 273–286.
[45] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society: Series B
63 (2) (2001) 411–423.
[46] J. Wang, Consistent selection of the number of clusters via crossvalidation, Biometrika 97 (4) (2010) 893–904.
[47] J. Ward, Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58 (301) (1963) 236–244.
[48] Y. Zhang, J. Mandziuk, C.H. Quek, B.W. Goh, Curvature-based method for determining the number of clusters, Information Sciences 415–416 (2017)
414–428.
[49] R. Ünlü, P. Xanthopoulos, Estimating the number of clusters in a dataset via consensus clustering, Expert Systems with Applications 125 (1) (2019) 33–
39.

995

You might also like