Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

SYNOPSIS

Name of the Candidate Harish Guruprasad Ramaswamy


SR Number 04-04-00-16-12-11-1-08562
Title of the Thesis Design and Analysis of Consistent Algorithms
for Multiclass Learning Problems.
Research Supervisor Prof. Shivani Agarwal
Degree Registered Doctor of Philosophy
Department Computer Science and Automation
Institute Indian Institute of Science, Bangalore

1 Introduction
We consider the broad framework of supervised learning, where one gets examples of objects
together with some labels (such as tissue samples labeled as cancerous or non-cancerous, or
images of handwritten characters labeled with the correct character in a-z), and the goal
is to learn a prediction model which given a new object, makes an accurate prediction.
The notion of accuracy depends on the learning problem under study and is measured by
an evaluation metric of interest. A supervised learning algorithm is said to be statistically
consistent or simply consistent if it returns an ‘optimal’ prediction model with respect to the
desired evaluation metric in the limit of infinite data. Statistical consistency is a fundamental
notion in supervised machine learning, and therefore the design of consistent algorithms for
various learning problems is an important question.
Consistency has been studied for certain specific multiclass learning problems and eval-
uation metrics, for example, binary classification with the 0-1 loss and F-measure [2, 29,
14, 11, 16]; multiclass classification with the 0-1 loss [28, 24]; ranking with pairwise dis-
agreement (PD) and normalized discounted cumulative gain (NDCG) losses [7, 3, 21, 5].
However, the unifying theme in these problems is largely unexplored. In this thesis, we give
the foundations of a unified framework for studying the problem of consistency for a gen-
eral multiclass learning problem, thereby generalizing many known past results for specific
learning problems.
A majority of multiclass learning problems in practice use an evaluation metric based on
a loss matrix (e.g. the fraction of wrongly classified instances is based on the 0-1 loss ma-
trix), and most prevalent algorithms for such problems are surrogate minimizing algorithms,
which are characterized by a surrogate loss (e.g. the support vector machine is a surrogate
minimizing algorithm with the hinge surrogate loss). If the surrogate loss is convex, then the
resulting surrogate minimizing algorithm can be framed as a convex optimization problem
and be solved efficiently.
In the first part of the thesis, we study calibrated surrogate losses which leads to a
consistent surrogate minimizing algorithm for a given loss matrix. This is the largest and

1
most important part of the thesis, and can be seen to have three distinct sections. The first
section gives necessary conditions and sufficient conditions under which calibration happens,
based on geometric properties of the surrogate loss and true loss. The second section defines
a new notion called the convex calibration dimension that characterizes the intrinsic difficulty
of achieving consistency for a learning problem. The third section gives a generic procedure
to construct convex calibrated surrogates, and considers two ways to weaken the notion
of consistency, which are applicable when the generic procedure does not yield an efficient
algorithm.
In the second part of the thesis, we study calibrated surrogates for a specific family of loss
matrices that arise in hierarchical classification. First, we reduce the problem of hierarchical
classification to the problem of multiclass classification with an abstain option. We then
derive novel convex calibrated surrogates for the problem of multiclass classification with an
abstain option, and use these as a sub-routine to construct novel convex calibrated surrogates
for hierarchical classification.
Several evaluation metrics, like the F-measure used in information retrieval and Har-
monic mean measure used in classification problems with severe class imbalance, cannot be
expressed using a loss matrix and require a more general formulation.
In the third part of our work, we consider such evaluation metrics that are based on com-
plicated functions of the confusion matrix and design consistent algorithms. In particular,
we take an optimization viewpoint which shows that, finding the classifier with the smallest
such error is equivalent to a finite dimensional optimization problem, and use this view-
point to construct efficient consistent algorithms for a large family of such general evaluation
metrics.

2 Setup
Let the instance space be X (e.g. the set of all possible handwritten character images), and
let the finite set in which the class labels lie, be identified with [n] = {1, 2, . . . , n} (e.g. in the
character recognition example n = 26). Let the set of possible predictions on an instance,
be identified with [k] = {1, 2, . . . , k}. For example, in the standard character recognition
case we have k = n = 26, but in a variant of this setting, where the character recognizer is
required to output (say) the two top matches for a given image, we have k = 26 × 25.
For most of the thesis, we shall consider evaluation metrics based on a loss matrix ` :
[n] × [k]→R+ . For a distribution D over the instance-label space X × [n], the error metric
for a [k]-valued classifier h (which is simply a function h : X →[k]) is called the `-risk and is
given as,
er`D [h] = E(X,Y )∼D [`(Y, h(X))] .
The value `(y, t) for some y ∈ [n], t ∈ [k] represents the loss on predicting t when the true
label is y. For example, the simple 0-1 loss is given as `(y, t) = 1(y 6= t). For more examples,
see Figure 1. Any classifier that achieves the smallest possible `-risk is called an `-Bayes
classifier.

2
1 2 3 1 2 3 1 2 3 4
1 0 1 1 1 0 1 2 1 0 1 1 1/2
2 1 0 1 2 1 0 1 2 1 0 1 1/2
3 1 1 0 3 2 1 0 3 1 1 0 1/2
(a) `0-1 (b) `abs (c) `(?)

Figure 1: Example loss matrices with rows representing the class labels and columns rep-
resenting predictions. (a) Zero-one loss with n = k = 3. (b) Absolute difference loss with
n = k = 3. (c) Abstain loss with n = 3 and k = 4.

The objective of a learning algorithm, in solving a learning problem given by a loss matrix
`, is simply to use a training set S = {(x1 , y1 ), . . . , (xm , ym )}, with each (xi , yi ) ∈ X × [n]
drawn i.i.d. from the distribution D, to return a classifier h with low `-risk. A learning
algorithm is said to be consistent with respect to ` (or simply `-consistent), if the `-risk of the
returned classifier approaches the minimum possible `-risk as the sample size m approaches
infinity.
The standard algorithm of `-empirical risk minimization (`-ERM) which minimizes the
`-risk over the set of classifiers Hm ⊆ [k]X , with the distribution D being the empirical
distribution induced by the sample S, can be shown to be consistent w,r,t, `, if the sequence
of the set of classifiers Hm is chosen appropriately. However, due to the discrete nature of the
optimization problem involved, the `-ERM algorithm is computationally hard to implement.
The prevalent way to circumvent the above issue is to minimize a continuous and convex
surrogate loss instead of the discrete true loss, for example the SVM algorithm for binary
classification minimizes the hinge loss instead of the 0-1 loss. Mathematically, the surrogate
ψ is a function given as ψ : [n] × Rd →R+ , with surrogate dimension d ∈ Z+ . The ψ-risk of
a function f : X →Rd is given as
erψD [f ] = E(X,Y )∼D [ψ(Y, f (X))] .
Unlike the case of the `-ERM, the surrogate empirical risk minimization algorithm ψ-ERM,
which minimizes the empirical version of the ψ-risk over a function class Fm ⊆ (Rd )X , can
be done efficiently if ψ is convex and Fm is chosen appropriately. The result of such a min-
imization, a function f : X →Rd , is converted into a classifier from X to [k] by composing
it with a predictor pred : Rd →[k]. For example, in the SVM algorithm for binary classi-
fication where d = 1, the predictor pred is simply based on the sign of the real value; in
an n-class classification problem most surrogate minimizing algorithms (e.g. One-vs-all [22],
Crammer-Singer [6]) have d = n, and the predictor of choice is simply the argmax function,
i.e. pred(u) = argmaxi∈[n] ui for all u ∈ Rn . The key question then, is the consistency w.r.t.
` of such a procedure.
A surrogate ψ, and predictor pred are said to be calibrated with respect to a loss ` (or
simply `-calibrated) if there exists a monotone increasing function ξ : R+ →R+ continuous
at 0 with ξ(0) = 0 such that for all f : X →Rd ,
 
er`D [pred ◦ f ] − inf0 er`D [h0 ] ≤ ξ erψD [f ] − inf0 erψD [f 0 ] . (1)
h f

3
Just the surrogate ψ is said to be `-calibrated, if there exists any predictor pred along with
which it is `-calibrated. Such a bound is called an excess risk bound. In particular this
implies that, if the ψ-risk of the function fm approaches the minimum ψ-risk, the `-risk
of the classifier pred ◦ fm approaches the minimum `-risk. Thus, the surrogate minimizing
algorithm can be shown to be consistent with respect to `. Studying calibrated surrogates
for general loss matrices and certain specific loss matrices is the main focus in the first and
second parts of the thesis.
In the third and final part of the thesis, we consider more general evaluation metrics
than those which can be expressed using loss matrices. In particular, we consider evaluation
metrics that can be expressed as an arbitrary function of the confusion matrix of the classifier.
Consistency can be defined for such evaluation metrics as well in a natural way. In addition
to loss matrix based evaluation metrics, which arise as a special case if the evaluation metric
is a linear function of the confusion matrix, these general evaluation metrics also include
performance measures like the F-measure used in information retrieval and the harmonic
mean measure used in classification under settings of class imbalance.

3 Contributions of the Thesis


The main contributions of the thesis can be divided into three parts, the details of which are
given below.

3.1 Consistency and Calibration – Conditions, Difficulty and Con-


struction
Consistency with respect to loss matrix based evaluation metrics for surrogate minimizing
algorithms, essentially reduces to calibration of the surrogate w.r.t. the loss matrix. In the
first part of the thesis, we give several results on calibration for a general learning problem
given by an arbitrary loss matrix, as opposed to most past work which give results on cali-
bration for a particular learning problem/loss matrix. We also demonstrate the applicability
of these results by instantiating them to specific loss matrices in some practical learning
problems. This part of the thesis can be further divided into the following three sections.

3.1.1 Conditions for Calibration


The question

“When is a given surrogate calibrated w.r.t a given loss matrix?”

has been studied for specific loss matrices, like the 0-1 loss in binary and multiclass classifica-
tion [2, 24] and the pairwise disagreement and NDCG loss in ranking [7, 21]. We answer this
question for a general loss matrix, by giving necessary conditions and sufficient conditions
for calibration [17].

4
We define a property of the loss matrix known as trigger probability sets which is simply
a collection of k subsets of the set of all distributions over [n], indicating what predictions
t ∈ [k] are optimal (w.r.t. `) under different conditional probability distributions over the
class labels (recall [n] is the set of classes, and [k] is the set of predictions). Formally, for
any t ∈ [k] the trigger probability set associated with it is
n o
Q`t = p ∈ ∆n : hp, `t i = min 0
hp, ` t0 )i ,
t ∈[k]

where ∆n is the set of all distributions over [n], and `t = [`(1, t), . . . , `(n, t)]> ∈ Rn+ is the
vector of losses for predicting t ∈ [k], it can also be viewed as the tth column of the loss
matrix `.
Analogous to the trigger probabilities of a loss matrix, one can define positive normals
[24] of a surrogate, which is also a collection of subsets of the set of all distributions over
[n], indicating what (surrogate) predictions u ∈ Rd are optimal (w.r.t. ψ) under different
conditional probability distributions over the class labels (recall d is the dimension of the
surrogate). Formally, for any u ∈ Rd the positive normal set associated with it is
n o
N ψ (u) = p ∈ ∆n : hp, ψ(u)i = inf hp, ψ(u0 )i ,
u0 ∈Rd

where ψ(u) = [ψ(1, u), . . . , ψ(n, u)]> ∈ Rn+ is the vector of surrogate losses for predicting
u ∈ Rd .
We give necessary conditions and sufficient conditions for a surrogate ψ to be calibrated
w.r.t. a loss matrix ` based on the trigger probabilities of ` and positive normals of ψ.
This is covered in Chapter 3 of the thesis.

3.1.2 Convex Calibration Dimension


A natural question to ask is, whether some learning problems are ‘easier’ than others, in
other words,
What is the difficulty of attaining consistency (using surrogate minimizing algorithms) for
the learning problem given by loss matrix `?
Optimizing a surrogate with dimension d requires computing d real valued functions over the
instance space X , and hence the difficulty of optimization in the surrogate ERM increases
with d. Also, given that only convex surrogates lead to convex optimization problems1 ,
the smallest d such that there exists a convex `-calibrated surrogate with dimension d, is a
natural notion measuring the intrinsic difficulty of achieving consistency w.r.t. `. We call
this the convex calibration dimension of the loss matrix [17]. Formally, it is denoted as
CCdim(`) and given as

CCdim(`) = min d ∈ Z+ : ∃ a convex surrogate ψ : [n] × Rd →R+ that is `-calibrated .




1
We really don’t have any other large class of optimization problems with efficient algorithms.

5
We give lower bounds for this object based on the geometric properties of the trigger
probabilities of the loss matrix, and an upper bound based on the rank of the loss matrix.

CCdim(`) ≤ rank(`)
CCdim(`) ≥ ||p||0 − 1 − µQ`t (p) ∀t ∈ [k], p ∈ Q`t

where rank(`) is the linear algebraic rank of ` viewed as a matrix in Rn×k


+ , ||p||0 is the number
n
of non-zero entries in p and µQ (p) for any set Q ⊆ R and p ∈ Q is simply the dimension
of the smallest face of Q containing p.
Simply by definition of CCdim(`), we immediately have that upper bounds give existence
results, and lower bounds give impossibility results. We apply these bounds to several
label/subset ranking losses such as normalized discounted cumulative gain (NDCG), mean
average precision (MAP) and pairwise disagreement (PD) and get some interesting existence
and impossibility results that generalize several previously known results [5, 7, 3, 21].
This is covered in Chapter 4 of the thesis.

3.1.3 Generic Convex Calibrated Surrogates


The upper bound on CCdim(`) given above, immediately shows the existence of a rank(`)-
dimensional convex `-calibrated surrogate, but it does not give an explicit construction. A
natural question in this case is

Can one construct an explicit rank(`)-dimensional convex `-calibrated surrogate ψ and


predictor pred ?

We answer the above question affirmatively [18, 20]. We even give an excess risk bound
of the form in Equation (1) for the constructed surrogate ψ and predictor pred. Under an
appropriate setting, the surrogate ψ, takes the form of a least-squares style surrogate, with
the predictor simply corresponding to a discrete optimization problem. More precisely, if
the loss matrix ` is of the form

`(y, t) = hay , bt i + cy + dt ,

for some vectors a1 , a2 , . . . , an ∈ [0, 1]d and b1 , b2 , . . . , bk ∈ Rd and scalars c1 , c2 , . . . , cn ∈ R


and d1 , d2 , . . . , dk ∈ R, then the surrogate ψ : [n] × Rd →R+ and predictor pred : Rd →[k]
such that
d
X
ψ(y, u) = (ui − ay,i )2
i=1
pred(u) ∈ argmint∈[k] hu, bt i + dt ,

is `-calibrated.
We apply this surrogate and predictor to several ranking and multilabel prediction losses
which have large label and prediction spaces (i.e. large n and k) but small rank. In some cases

6
this yields efficient surrogates and predictors, but in some cases like the PD loss and MAP
loss in ranking, it gives an efficient surrogate but a complicated predictor whose complexity
scales badly with k, thus precluding an overall efficient algorithm.
In order to circumvent the above difficulty, we relax the requirement of consistency to a
weaker notion of consistency. We consider two weak notions of consistency called consistency
under noise conditions and approximate consistency, and show that in many cases including
the PD and MAP losses, one can get efficient surrogates and predictors, if the requirements
of consistency are relaxed to one of these weak notions of consistency.
This is covered in Chapters 5 and 6 of the thesis.

3.2 Applications to Hierarchical Classification


Hierarchical classification is an important learning problem in which there is a pre-defined
hierarchy over the class labels, and has been the subject of many studies [26, 4, 8, 1]. A
natural loss matrix in this case is simply based on the tree distance between the class labels.
The tree distance loss matrix `H : [n] × [n]→R+ , for a tree H over [n], is simply given as

`H (y, y 0 ) = Path length in H between y and y 0 .

Despite the importance and popularity of hierarchical classification, the following ques-
tion has not been studied in past work.

Are there efficient convex calibrated surrogates for the tree distance loss?

In the second part of the thesis, we answer this question positively [19], by constructing a
family of efficient convex calibrated surrogates for the tree distance loss.
We show that the Bayes classifier for the tree distance loss is the classifier which predicts
the deepest node whose sub-tree has a conditional probability greater than half. More
precisely, a classifier g ∗ : X →[n] satisfying the conditions below for all x ∈ X is a Bayes
classifier for the tree distance loss.
X 1
pi (x) ≥

2
i∈Des(g (x))
X 1
pi (x) ≤ , ∀y ∈ Chd(g ∗ (x)) ,
2
i∈Des(y)

where pi (x) is the conditional probability according to distribution D, for drawing label
i ∈ [n] given that the instance is x ∈ X , and Des(y), Chd(y) are the descendants and
children respectively of a node y ∈ [n] in the tree H.
We consider a separate but related loss matrix called the abstain loss, used in the problem
of multiclass classification with an abstain option, in which the prediction space contains the
set of class labels and a special option (denoted by n + 1) called the ‘abstain’ option. It is

7
given as `? : [n] × [n + 1]→R+

0
 if y = t
? 1
` (y, t) = if t = n + 1 .
2
1 otherwise

Its Bayes classifier h∗ : X →[n + 1] is given as


(
1
∗ argmaxy∈[n] py (x) if maxy∈[n] py (x) ≥ 2
h (x) = .
n+1 Otherwise

Using the expressions for the Bayes classifiers of the tree distance loss g ∗ and the abstain
loss h∗ , we show that the hierarchical classification problem can be reduced to ‘depth of tree’
number of sub-problems, in each of these sub-problems, one is required to solve a multiclass
classification problem with an abstain option.
We then develop several convex calibrated surrogates for the problem of multiclass clas-
sification with an abstain option, which are also of independent interest, and use these sur-
rogates as a sub-routine to construct convex calibrated surrogates for the tree distance loss.
One such surrogate, whose surrogate minimization procedure simply requires solving multi-
ple binary SVM problems, also gives superior empirical performance on several benchmark
hierarchical classification datasets.
This is covered in Chapters 7 and 8 of the thesis.

3.3 Complex Evaluation Metrics


So far, we have considered learning problems with loss matrix based evaluation metrics and
consistent algorithms for such learning problems. In this last part of this thesis, we consider
learning problems with more complicated evaluation metrics like the F1-measure in binary
classification that cannot be expressed via a loss matrix.
The evaluation metrics we consider depend on the confusion matrix of a classifier. The
confusion matrix of a (randomized) classifier h : X →∆k w.r.t. a distribution D, is denoted
by CD [h] ∈ [0, 1]n×k , and has entries given by2
D

Ci,j [h] = P Y = i, h(X) = j ,

where the probability is over the randomness in X, Y and also the classifier h, which is
such that h(x) ∼ h(x). For any continuous penalty function ϕ : [0, 1]n×k →R+ , define the
evaluation metric called ϕ-risk of h w.r.t. D as follows

LϕD [h] = ϕ(CD [h]) .

The ϕ-risk defined above is exactly analogous to the `-risk defined for a loss matrix `. In
n×k
particular, if the penalty function ϕ(C) = hL, Ci for some matrix L ∈ R+ , then the
2
Randomized classifiers h : X →∆k are in general necessary for these more complicated evaluation metrics.

8
ϕ-risk is exactly equal to `-risk, with values of ` being the appropriate entries in L. For
other penalty functions ϕ we get other interesting evaluation metrics like the harmonic-
mean measure, geometric mean measure, and quadratic mean measure used in multiclass
and binary problems with class imbalance; the Fβ measures used in information retrieval;
and the min-max measure used in hypothesis testing [23, 27, 9, 10, 12, 13, 25]. The notion
of consistency is very much valid even for such evaluation metrics.
A natural question then is the following:

Can one construct efficient consistent algorithms for such complex evaluation metrics given
by an arbitrary penalty function ϕ?

While this question has been studied for the special case of binary classification [11, 16],
it remains unanswered for multiclass problems with n > 2. We answer this question in
the affirmative for a large family of such complex multiclass evaluation metrics [15], by
constructing consistent algorithms, in the third and final part of the thesis.
Let CD ⊆ [0, 1]n×k be the set of feasible confusion matrices for the distribution D

CD = CD [h] : h : X →∆k .


It can be seen that finding the best classifier for such complex evaluation metrics is equivalent
to optimizing the penalty function over CD , i.e.

inf LϕD [h] = inf ϕ(C) .


h:X →∆k C∈CD

It is difficult to construct membership and separation oracles, but easy to construct


linear minimization oracles for CD . Hence, standard optimization methods such as projected
gradient descent are not possible, but the Frank-Wolfe algorithm is a viable option. We
adapt the Frank-Wolfe algorithm for this problem, and show that the resulting algorithm is
consistent for such complex evaluation metrics, if the corresponding penalty function ϕ is
convex.
This is covered in Chapter 9 of the thesis.

4 Conclusion
In the first part of the thesis, we presented the foundations of a framework to study consistent
surrogate minimizing algorithms for general multiclass learning problems with arbitrary loss
matrix based evaluation metrics. The framework constructed includes several important
and useful tools that can be used to check whether a surrogate is calibrated w.r.t. a given
loss matrix, used to characterize the difficulty of constructing convex calibrated surrogates
for a given loss matrix and most importantly used to motivate and construct novel convex
calibrated surrogates for specific learning problems.
In the second part of the thesis, we focused particularly on the problem of hierarchical
classification, with the tree distance loss matrix, and gave a template to design convex

9
calibrated surrogates. In particular, the reduction to the problem of multiclass classification
with an abstain option allows the construction of several SVM-like consistent algorithms
for hierarchical classification, one of which also performs well empirically on benchmark
hierarchical classification datasets.
In the third part of the thesis, we considered complex evaluation metrics more general
than loss matrix based evaluation metrics. We showed that finding the classifier with the
smallest such error is equivalent to a finite dimensional optimization problem with the linear
minimization oracle being the only useful primitive available, and constructed a learning
algorithm based on the Frank-Wolfe optimization algorithm that is consistent for a large
family of such complex evaluation metrics.
In conclusion, in this thesis, we have developed a deep understanding and fundamental
results in the theory of supervised multiclass learning. These insights have allowed us to
develop computationally efficient and statistically consistent algorithms for a variety of mul-
ticlass learning problems of practical interest, in many cases significantly outperforming the
state-of-the-art algorithms for these problems.

References
[1] Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih-Reza Amin. On flat versus
hierarchical classification in large-scale taxonomies. In Advances in Neural Information
Processing Systems, 2013.

[2] Peter L. Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification and risk
bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.

[3] David Buffoni, Clément Calauzènes, Patrick Gallinari, and Nicolas Usunier. Learning
scoring functions with order-preserving losses and standardized supervision. In Proceed-
ings of International Conference on Machine Learning, 2011.

[4] Lijuan Cai and Thomas Hofmann. Hierarchical document categorization with support
vector machines. In International Conference on Information and Knowledge Manage-
ment (CIKM), 2004.

[5] Clément Calauzènes, Nicolas Usunier, and Patrick Gallinari. On the (non-)existence
of convex, calibrated surrogate losses for ranking. In Advances in Neural Information
Processing Systems, 2012.

[6] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass
kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.

[7] John Duchi, Lester Mackey, and Michael Jordan. On the consistency of ranking algo-
rithms. In Proceedings of International Conference on Machine Learning, 2010.

10
[8] Siddharth Gopal, Bing Bai, Yiming Yang, and Alexandru Niculescu-Mizil. Bayesian
models for large-scale hierarchical classification. In Advances in Neural Information
Processing Systems 25, 2012.

[9] K. Kennedy, B.M. Namee, and S.J. Delany. Learning without default: A study of
one-class classification and the low-default portfolio problem. In ICAICS, 2009.

[10] J-D. Kim, Y. Wang, and Y. Yasunori. The genia event extraction shared task, 2013
edition - overview. Association of Computational Linguistics, 2013.

[11] Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit Dhillon.
Consistent binary classification with generalized performance metrics. In Advances in
Neural Information Processing Systems, 2014.

[12] S. Lawrence, I. Burns, A. Back, A-C. Tsoi, and C.L. Giles. Neural network classification
and prior class probabilities. In Neural Networks: Tricks of the Trade, LNCS, pages
1524:299–313. 1998.

[13] D.D. Lewis. Evaluating text categorization. In Proceedings of the Workshop on Speech
and Natural Language, HLT ’91, 1991.

[14] Ye Nan, Kian Ming Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing F-measures:
A tale of two approaches. In Proceedings of International Conference on Machine Learn-
ing, 2012.

[15] Harikrishna Narasimhan*, Harish G. Ramaswamy*, Aadirupa Saha, and Shivani Agar-
wal. Consistent multiclass algorithms for complex performance measures. In Proceedings
of International Conference on Machine Learning, 2015.

[16] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consis-
tency of plug-in classifiers for non-decomposable performance measures. In Advances in
Neural Information Processing Systems, 2014.

[17] Harish G. Ramaswamy and Shivani Agarwal. Classification calibration dimension for
general multiclass losses. In Advances in Neural Information Processing Systems, 2012.

[18] Harish G. Ramaswamy, Shivani Agarwal, and Ambuj Tewari. Convex calibrated surro-
gates for low-rank loss matrices with applications to subset ranking losses. In Advances
in Neural Information Processing Systems, 2013.

[19] Harish G. Ramaswamy, Shivani Agarwal, and Ambuj Tewari. Convex calibrated sur-
rogates for hierarchical classification. In Proceedings of International Conference on
Machine Learning, 2015.

[20] Harish G. Ramaswamy, Balaji S. Babu, Shivani Agarwal, and Robert C. Williamson.
On the consistency of output code based learning algorithms for multiclass learning
problems. In Proceedings of International Conference on Learning Theory, 2014.

11
[21] Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On NDCG consistency of listwise
ranking methods. In International Conference on Artificial Intelligence and Statistics,
2011.

[22] Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal of
Machine Learning Research, 5:101–141, 2004.

[23] Y. Sun, M.S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbal-
anced class distribution. In Proceedings of International Conference on Data Mining,
2006.

[24] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification
methods. Journal of Machine Learning Research, 8:1007–1025, 2007.

[25] P.H. Vincent. An Introduction to Signal Detection and Estimation. Springer-Verlag


New York, Inc., 1994.

[26] K. Wang, S. Zhou, and S. C. Liew. Building hierarchical classifiers using class proximity.
In International Conference on Very Large Data Bases, 1999.

[27] S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solu-
tions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,
42(4):1119–1130, 2012.

[28] Tong Zhang. Statistical analysis of some multi-category large margin classification
methods. Journal of Machine Learning Research, 5:1225–1251, 2004.

[29] Tong Zhang. Statistical behavior and consistency of classification methods based on
convex risk minimization. Annals of Statistics, 32(1):56–134, 2004.

12

You might also like