Professional Documents
Culture Documents
A Supervised Learning Approach For Heading Detection
A Supervised Learning Approach For Heading Detection
A Supervised Learning Approach For Heading Detection
Detection
1 Introduction
As the amount of information stored within PDF documents increases world-
wide, the opportunities for large scale text based analysis requires increasingly
automated processes, as the amount of document processing is time consuming
and labour intensive for human professionals. Systematic processing and extrac-
tion of textual structure is increasingly necessary and useful as demonstrated
in El-Haj et al.’s work involving 1500 financial statements[7]. Categorizing data
into seperate sections is quite easy for humans, as they rely on visual cues such as
headings to process textual information. Machines, despite being able to process
large amounts information at high speeds, require effort to classify and interpret
text based data. This paper explores the application of supervised classifiers to
operationalize a system that would aid in the identification of headings. PDF
documents are a visually exact digital copy that displays text by drawing char-
acters on a specific location [10] and present a challenge in analysis because the
files do not provide enough information on how the text is organized and for-
matted.
A supervised classifier that is trained on labelled data provides one solution to
categorizing PDF text as it tells the classifier how to make predictions based on
the data provided. This research involved comparing and systematically testing
2 S. Budhiraja and V. Mago
a variety of classifiers for the purpose of selecting classifiers best suited to this
application. Recursive feature elimination[8]is used to ensure the classifiers only
use the best and minimum number of features for making predictions. Cross
validation is used to tune the hyper parameters of a given machine learning al-
gorithm for increased performance before testing it out on test data. The final
trained classifier is currently being applied to detect headings in course outline
documents and extract learning outcomes. The extracted learning outcomes are
being used for automating the process of developing university/college transfer
credit agreements by using semantic similarity algorithms[3].
2 Related Work
Previous research provides insight into processes related to extracting the head-
ing layout of a HTML document[6]. In Manabe’s work, headings are used to
divide a document at certain locations that indicate a change in topic. Docu-
ment Object Model(DOM) trees are used to sort candidate headings based on
their significance and to define blocks. A recursive approach is applied for docu-
ment segmentation using the list of candidate headings and evaluate with good
results using a manually labelled dataset.
Current research has taken steps towards a system which analyses a document’s
textual structure. But there is a need to have an approach that can efficiently
and accurately analyse the textual layout of a document and divide it into con-
A Supervised Learning Approach For Heading Detection 3
tent sections to automate the process of extracting text from a PDF documents.
We present out supervised learning approach for heading detection as a solution
for it.
3 Methodology
Our data set consisted of 500 documents1 downloaded from Google using Google
Custom Search API [11]. To extract the correspoding formatting/style informa-
tion the documents were converted from PDF to HTML using pdf2txt, which is
a PDFMiner wrapper available in Python [12]. This is illustrated in Fig 1 which
shows some sample text and its corresponding HTML tags generated using the
conversion process. The final data points are also shown in the Fig 1, which
was generated by parsing the HTML tags using regular expressions. A regular
expression is string of characters used to define a search pattern[13]. The regular
expressions used for parsing the tags are as follows:
r‘< \s∗?span[∧ >]∗f ont−size : (\w∗)px[> ]∗ > (.∗?) < \/span\b[> ]∗ > ’
To check if text is bold we look for the following regular expression for
the word bold in the starting tag:
r‘[Bb]old’
Each data point contains some text, font size and a flag which is either 1 or
0 depending on the corresponding text being bold or not. The whole process
yielded 83,194 data points, which was then exported into an Excel file for further
pre-processing.
The process of transforming raw data into usable training data is referred to
as data preprocessing. The steps of data preprocessing for this research are as
follows:
Data Labelling: Data labelling refers to the process of assigning data points
labels, this makes the data suitable for training supervised machine learning
models. All the 83914 data points are manually labelled by cross referring to the
1
Repository available at: https://github.com/sahib-s/Heading-Detection-PDF-Files
4 S. Budhiraja and V. Mago
documents as both training and testing data needs to be labelled. If the text in
the data point was a heading the label was set to 1 otherwise 0. Labelling data
is one of the most important steps of preprocessing because the performance of
the model depends on how well the data is labelled. Example of labelled data
points is provided in Fig 1(c).
(all letters in upper case), lower case (all letters in lower case), title case
(first letter of all words in uppercase) or sentence case (only the first
letter of the text in uppercase).
• Features From Parts of speech(POS) Tagging: POS Tagging is the pro-
cess of assigning parts of speech (verb, adverb, adjective, noun) to each
word, which are referred to as tokens. The text from each data point is
first tokenized and then each token is assigned a POS label [9].
The POS frequencies provides the model with information on the grammati-
cal aspect of the text and can be used to exploit the frequency of these labels
in a text to identify headings and contribute to the accuracy of the model.
For example, headings tend to have no verbs in them, though some might
have them but absence of verbs increases the probability of the text being
an heading. All frequency data collected from POS tagging is analysed in
the feature selection process to differentiate between useful and irrelevant
features collected through it. The frequency for each POS label is calculated
and used to calculate the frequency of each POS tag in the text for each
data point. These frequencies serve as potential features for the model.
All these features brings the count of total number of features generated using
the text to 11, 9 from POS tagging the text and 2 using its physical properties.
All features are integers, except for Bold or Not and Font Threshold Flag which are
binary.
Feature Name Description
Characters Number of characters in the text.
Words Number of words in the text.
Text Case Assumes the value 0,1,2 or 3 depending on the text being be-
ing in lower case, upper case, title case or none of the three
respectively.
Bold or Not Assumes the value 1 or 0 depending on the text being bold or
not.
Font Threshold Flag Assumes the value 1 or 0 depending on the font size of the text
being greater than the threshold or not.
Verbs Number of verbs in the text.
Nouns Number of nouns in the text.
Adjectives Number of adjectives in the text.
Adverbs Number of adverbs in the text.
Pronouns Number of pronouns in the text.
Cardinal Numbers Number of cardinal numbers in the text.
Coordinating Con- Number of coordinating conjunctions in the text.
junctions
Predeterminers Number of predeterminer in the text.
Interjections Number of Interjections in the text.
8 S. Budhiraja and V. Mago
Cross validation is done by making 10 folds in the training set where one feature
is removed per iteration. As per this analysis the accuracy does not increase on
choosing to train the Decision Tree classifier with more than the following seven
features:
– Bold or Not
– Font Threshold Flag
– Number of words
– Text Case
– Verbs
– Nouns
– Cardinal Numbers
The same process is repeated for all the classifiers and their individual set of
chosen features are listed in Table 2.
SVM Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Adjectives, Adverbs
k-Nearest Neaigh- Bold or Not, Font Threshold Flag, Words, Verbs, Nouns, Adjec-
bors tives, Cardinal Numbers, Coordinating Conjunctions
Random Forest Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Adverbs, Cardinal Numbers, Coordinating Conjunctions
Gaussian Naive Bold or Not, Font Threshold Flag, Words, Verbs, Nouns, Adjec-
Bayes tives, Cardinal Numbers, Coordinating Conjunctions
Quadratic Discrimi- Bold or Not, Font Threshold Flag, Words, Verbs, Nouns, Adjec-
nant Analysis tives, Coordinating Conjunctions
Logistic Regression Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Adverbs, Coordinating Conjunctions
Gradient Boosting Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Cardinal Numbers
Neural Net Bold or Not, Font Threshold Flag, Words, Text Case, Verbs,
Nouns, Cardinal Numbers
3.5 Training
After the most suitable features and parameters for each classifier have been
selected, we can proceed with training the classfiers using scikit-learn [18].
Decision Tree Decision trees are the most widely used amongst classifiers
as they have a simple flow-chart like structure starting from a root node. It
branches off to further nodes and terminating at a leaf node. At each non-leaf
node a decision is made, which selects the branch to follow. The process contin-
ues to the point where a leaf node is reached, which contains the corresponding
decison[14]. Gini impurity is used as a measure for quality of a split, which tells
if the split made the dataset more pure. Using Gini makes it computationally
less expensive as compared to entropy which involves computation of logarithmic
functions. The “best” option for strategy chooses the best split at each node.
The minimum number of samples required to split an internal node is set to 2
and the minimum number of samples needed to be at a leaf node is set to 3.
10 S. Budhiraja and V. Mago
The code snippet for training this classifier with the chosen parameters is given
in Box 1
Random Forest This classifier works by choosing random data points from the
training set and creating a set of decision tress. The final decision regarding the
class is made by aggreggation of the outputs from all the trees[19]. The number
of trees in the forest is set to 2 and ‘gini’ is used as a measure for quality of
a split. The maximum depth of trees is set to 5 and the maximum number of
features to be considered while searching for the best split is se to ‘auto’. The
minimum number of samples required to split an internal node is set to 2 and
the minimum number of samples needed to be at a leaf node is set to 3. The
number of parallel jobs to running for both fit and predict is set to 1. The code
snippet for training this classifier with the chosen parameters is given in Box 4.
Gaussian Naive Bayes This classifier works by using Bayesian theorem with
assumption of strong independence between the predictors(features). It is very
useful for large data sets as it is quite simple to build and has no complicated
iterative parameters[22]. This classifier does not have much to set when it comes
to configuring parameters. Prior probabilities of the classes is set to [0.5, 0.5] as
the number of headings is less as compared to other text. The code snippet for
training this classifier with the chosen parameters is given in Box 5.
Neural Net This classifier works by imitating the neural structure of the brain.
One data point is processed at a time and the actual classification is compared to
the classification made by the classifier. Any errors recorded in the classification
process are looped back into algorithm to improve classification performance in
future iterations[27,25]. The classifier is configured to have one hidden layer with
100 units. The activation function used for the hidden layer is ‘tanh’. The solver
used for weight optimization is ‘lbfgs’. The batch rate is set to ‘auto’ and the
initial learning rate is set to 0.001. The parameter ‘max iter’ is set to 300, which
for ‘adam’ solver defines the number of epochs. Sample shuffle is set to true,
which enables sample shuffling in each iteration. The exponential decay rates for
estimates of the first and second moment vector is set to 0.9 and 0.999 respec-
tively. The code snippet for training this classifier with the chosen parameters
is given in Box 9.
4 Test Results
Training and Prediction Time: When dealing with a large number of doc-
uments, the time required to train a model and make predictions is important
and is dependant on the type of classifier used, the number of features and the
amount of data points. In this research all classifiers are trained using the same
number of features and data points, therefore ‘time taken’provides a good mea-
sure of variations in training and prediction speed associated with each different
classifier being used. Of note, the training time for a classifier should be consid-
ered in context, as training only needs to be performed once and can be saved
for later use. Therefore, a model that takes a long time to train can still be
practical so long as it does not take a lot of time to make predictions. Fig 3.
shows time required for training and making predictions using these classifiers.
Time shown is average of 10 observations, which is done to reduce the effect of
programs running in the background on the comparison.
Fig. 3. Time required to train classifiers and run predictions on test data
classifiers used in this research. The discussion section provides more information
on how we used AUC score to select the best classifier.
Classifier AUC
Decision Tree 0.98
SVM 0.97
K-Nearest Neighbors 0.96
Random Forest 0.97
Gaussian Naive Bayes 0.96
Quadratic Discriminant Analysis 0.96
Logistic Regression 0.95
Gradient Boosting 0.98
Neural Net 0.98
5 Discussion
We recorded the time (in seconds) required for training each classifier and also
time for making predictions as shown in Fig 3. Time taken by a classifier to make
predictions is important when processing documents in bulk as it can increase
the processing time. Time taken to train a classifier only has to be done once
therefore it is not given that much importance. The Decision Tree Classifier took
the least time for training while Gradient Boosting took the most. On compar-
ing the prediction time Logistic Regression takes the least time and Random
Forest takes the most. While prediction time is not the most important factor
while choosing a classifier we take it into consideration when two classifiers are
performing approximately the same.
The top three classifiers based on net accuracy are Decision Tree, Gradient
Boosting, and Neural Network, however classifier selection can not solely rely on
accuracy[28,29]. Therefore, we also weigh the metrics like AUC, F1 score, sensi-
tivity, and specificity to choose the best suited classifier for detecting headings.
The top three classifiers in terms of F1 score, precision, sensitivity and speci-
ficity are Decision Tree, Gradient Boosting, and Neural Network and the top 3
in terms of AUC as shown in Table 4 are again Decision Tree, Gradient Boost-
ing, and Neural Network. The system is going to be dealing with documents in
bulk and the prediction time for Decision Tree is better when compared to both
Gradient Boosting and Neural Network. Therefore, we would be choosing our
configuration of the Decision Tree for making the classifications.
16 S. Budhiraja and V. Mago
Category Value
Total Data points 12919
Sensitivity 0.928
Specificity 0.966
Precision 0.964
F1 SCORE 0.946
Accuracy 94.73 %
AUC 0.97
Table 6. Pearson Correlation Coefficient Between Each Feature Used in the Selected
Classifier and Final Decision Labels
for all the features used in the selected classifier and final decision label. The
list is in descending order of pearson correlation coefficient, therefore the top
feature in the table contribute the most towards the final decision. Each feature
was removed from the classifier one at a time and drop in evaluation metrics
also verify the order of contribution presented by using the pearson correlation
coefficient. Therefore, the top three contributing features are the ones that rely
on the physical attributes of the text.
9 Conclusion
This research has provided a structured methodology and systematic evaluation
of a heading detection system for PDF documents. The detected headings pro-
vide information on how the text is structured in a document. This structural
information is used for extracting specific text from these documents based on
the requirements of the field of application. This supervised learning approach
has demonstrated good results and we are currently applying our configuration
of the Decision Tree classifier in the field of post-secondary curriculum analysis
to identify headings and extract learning outcomes from course outlines for a
research being conducted at DATALAB, Lakehead University, Canada.
Acknowledgment
This research would not have been possible without the financial support pro-
vided by Ontario Council on Articulation and Transfer (ONCAT) through Project
Number-2017-17-LU. We would also like to express our gratitude towards the
datalab.science team and Andrew Heppner for their support.
References
1. Khusro, Shah, Asima Latif, and Irfan Ullah. ”On methods and tools of table detec-
tion, extraction and annotation in PDF documents.” Journal of Information Science
41.1 2015.: 41-57.
2. Jiang, Deliang, and Xiaohu Yang. ”Converting PDF to HTML approach based on
Text Detection.” Proceedings of the 2nd International Conference on Interaction
Sciences: Information Technology, Culture and Human. ACM, 2009.
18 S. Budhiraja and V. Mago
26. Fawcett, Tom. ”An introduction to ROC analysis.” Pattern recognition letters 27.8
2006.: 861-874.
27. Mago, Vijay Kumar, ed. Cross-Disciplinary Applications of Artificial Intelligence
and Pattern Recognition: Advancing Technologies: Advancing Technologies. IGI
Global, 2011.
28. Huang, Jin, and Charles X. Ling. ”Using AUC and accuracy in evaluating learning
algorithms.” IEEE Transactions on knowledge and Data Engineering 17.3 2005.:
299-310.
29. Ling, Charles X., Jin Huang, and Harry Zhang. ”AUC: a better measure than
accuracy in comparing learning algorithms.” Conference of the canadian society for
computational studies of intelligence. Springer, Berlin, Heidelberg, 2003.
A snapshot on nonstandard supervised learning problems
Taxonomy, relationships and methods
This is a pre-print of an article published in Progress in Artificial Intelligence. The final authenticated version is available online at: https:
//doi.org/10.1007/s13748-018-00167-7
Abstract Machine learning is a field which studies how ma- Keywords Machine learning · Supervised learning ·
chines can alter and adapt their behavior, improving their Nonstandard learning
actions according to the information they are given. This
Mathematics Subject Classification (2010) MSC 68T05 ·
field is subdivided into multiple areas, among which the
MSC 68T10
best known are supervised learning (e.g. classification and
regression) and unsupervised learning (e.g. clustering and
association rules).
1 Introduction
Within supervised learning, most studies and research
are focused on well known standard tasks, such as binary According to Mitchell [80], a machine is said to learn from
classification, multiclass classification and regression with experience E related to a class of tasks T and performance
one dependent variable. However, there are many other less metric P, when its performance at tasks in T improves ac-
known problems. These are what we generically call non- cording to P after experience E.
standard supervised learning problems. The literature about Supervised learning is one of the fundamental areas of
them is much more sparse, and each study is directed to a machine learning [78]. From object detection to ecological
specific task. Therefore, the definitions, relations and appli- modeling to emotion recognition, it covers all kinds of ap-
cations of this kind of learners are hard to find. plications. It essentially consists in learning a function by
The goal of this paper is to provide the reader with a training with a set of input-output pairs. The training stage
broad view on the distinct variations of nonstandard super- can be seen as E in the previous definition, and the specific
vised problems. A comprehensive taxonomy summarizing task T may vary, but usually involves predicting an appro-
their traits is proposed. A review of the common approaches priate output given a new input.
followed to accomplish them and their main applications is Traditionally, supervised learning problems have been
provided as well. spread into two categories: classification and regression [43,
60]. In the first, information is divided into discrete cate-
gories, while the latter involves patterns associated to a value
D. Charte in a continuous spectrum.
Universidad de Granada, Granada, Spain
E-mail: fdavidcl@ugr.es
These problems can be processed by learning from a
training dataset, which is composed of instances. Typically,
F. Charte
Universidad de Jaén, Jaén, Spain
these instances or samples take the form (x, y) where x is a
E-mail: fcharte@ujaen.es vector of values in the space of input variables and y is a
S. Garcı́a
value in the target variable. Each problem can be described
Universidad de Granada, Granada, Spain by the type of its instances: inputs will usually belong to
E-mail: salvagl@decsai.ugr.es a subset of Rn , and outputs will take values in a specific
F. Herrera one-dimensional set, finite or continuous. Once trained, the
Universidad de Granada, Granada, Spain obtained model can be used to predict the target variable on
E-mail: herrera@decsai.ugr.es unseen instances.
2 David Charte et al.
Standard classification problems are those where labels further from the ones previously detailed. Lastly, Section 7
are either binary or multiclass [33, 105]. In the binary case, draws some conclusions.
an instance can only be associated with one of two values:
positive or negative, which is equivalent to 0 or 1. For ex-
ample, email messages may be classified into spam or legit, 2 Definitions of nonstandard variations
and tumours can be categorized as either benign or malign.
Multiclass problems, on the other hand, involve any finite The problems introduced in this section are generalizations
number of classes. That is, any given instance will belong to over the traditional versions of classification and regression.
one of possibly many categories, which is equivalent to it be- The focus is on fully supervised problems, where inputs are
ing assigned a natural number below a convenient threshold. always paired with outputs during training. An alternative
As an example, a photograph of a plant or a sound record- taxonomy based on different supervision models is intro-
ing from an animal could correspond to one of a variety of duced in [54].
species.
A standard regression problem [61, 99] consists in find- 2.1 Notation
ing a function which is able to predict, for a given example,
a real value among a continuous range, usually an interval In this work we will establish a notation which intends to be
or the set of real numbers R. For example, the height of a as simple to understand as possible, while being able to en-
person may be estimated out of several characteristics such compass every nonstandard variation. First, any supervised
as age or country of origin. learning problem consists in finding a function which will
Even though these standard problems are applicable in a classify, rank or perform regression. It will be noted as
multitude of cases, there are situations whose correct mod- f :X →Y (1)
eling requires modifications of their structure. For example,
a newspaper article can be categorized according to its con- where X is an input set, or domain, and Y is an output set,
tents, but it could be desirable to assign several categories or codomain. It will be assumed that a training dataset S is
simultaneously. Similarly, a social media post could be de- provided, including a finite number of input-output pairs:
scribed by not one but two input vectors, an image and a
piece of text. These special circumstances cannot be covered (x, y) ∈ S ⊂ X ×Y . (2)
by the traditional one-vector input and one-dimensional out- This way, a learning algorithm will be able to generate the
put schema. As a consequence, since performance metrics desired function f . An additional notation will be the set of
which measure improvements in standard tasks assume the labels L where convenient.
common structure, they lose applicability or sense in these For example, in standard binary classification X ⊂ Rn
cases. Thus, not only new techniques are needed to tackle and Y = L = {0, 1}. Similarly, standard regression prob-
the problems, but also new ways of measuring and compar- lems can be defined with the same kind of X set and Y ⊂ R.
ing their success. Thus, we can define very distinct supervised problems by
This work studies variations on classic supervised prob- particularizing sets X or Y in different ways.
lems where the traditional structure is not obeyed, which we Other usual notations are based in probability theory,
call nonstandard variations. These emerge when the struc- thus involving random variables and probability distribu-
ture of the classical components of the problems does not tions [115, 83]. In that case, X and Y would be the sample
suffice to describe complex situations, such as multiplic- spaces of the input and output variables X and Y, respec-
ity of inputs or outputs, or order restrictions. As a conse- tively. Predictors would usually attempt to infer a discrimi-
quence, this manuscript does not cover other singular super- nant model P(Y|X) from the training dataset.
vised problems, such as high dimensionality of the feature
space [10] or unbalanced training sets [40, 67], nor time-
dependent problems, such as data streams [46, 98] or time 2.2 Multi-instance
series [58].
The multi-instance (MI) framework [56] assumes a single
The rest of the paper is structured as follows. Section 2 feature space for all instances, but each training pattern may
formally defines and describes each nonstandard variation. consist of more than one instance. In this case, a training
This is followed by Section 3 establishing relations among pattern is composed of a finite multiset or bag of instances
the introduced problems and proposing a taxonomy of them. and a label. Formally, assuming instances are drawn from a
Section 4 describes the most common techniques used to set A ⊂ Rn , the domain can be described as follows:
solve them. After that, Section 5 enumerates popular appli-
cations of each problem. Section 6 covers other variations X = {b ⊂ A | b finite} . (3)
A snapshot on nonstandard supervised learning problems 3
In this case, the learning algorithm will not know labels a photograph labeled ocean is less likely to have the moun-
associated to each instance but to a bag of them. In addition tains label rather than beach. Methods may take advantage
to this, not all instances may share the same relevance or are of label co-ocurrence [18] in order to reduce the search space
equally related to the label. when predicting a labelset.
Some MI problems assume that hidden labels are present
for each instance in a bag: for example, a training set of drug
tests where, for each test, several drug types are analyzed. 2.5 Multi-dimensional
Additionally, a typical MI assumption in the binary scenario
states that a bag is positive when at least one of its instances Multi-dimensional (MD) learning [96] is a generalized clas-
is positive, and it is negative otherwise [41]. sification problem where categorization is performed simul-
Other MI problems differ in that a per-instance labeling taneously along several dimensions. Each instance can be-
may not be possible or may not make sense: for example, if long to one of many classes in each dimension, thus the out-
each bag represents an image and instances are image seg- put space can be formally described as:
ments, class beach can only apply to bags with water and
Y = L1 × L2 × · · · × L p , (6)
sand segments, but it cannot apply to an individual instance.
where Li is the label space for the i-th dimension.
As with ML learning, label dimensions may be related
2.3 Multi-view in some way and treating them independently would only be
a naive solution to the problem.
A learning problem is considered to be multi-view (MV)
[120] when inputs are composed of several components of
very different nature. 2.6 Label distribution learning
For example, if a learning pattern consists of an image as
well as a piece of text representing the same instance, they In label distribution learning (LDL) problems [47], other-
can be seen as two views on it. In that case, images and texts wise known as probabilistic class label problems [75], any
would belong to distinct feature spaces A and B respectively, instance can be described in different degrees by each la-
an input pattern being (a, b) ∈ A × B . More generally, we bel. This can be modeled as a discrete distribution over the
can describe the input space as: labels, where the probability of a label given a specific in-
stance is called its degree of description. Analitically, the
t
objective is, for each instance, to predict a real-valued vec-
X = ∏ Ai , where Ai ⊂ Rni , (4)
i=1 tor which sums exactly 1:
( )
p
where t is the number of views offered by the problem and
Y= y ∈ [0, 1] p : ∑ yi = 1 . (7)
ni is the dimension of the feature space of the i-th view. i=1
is said to be a sorting or permutation, and can be formulated When inputs as well as outputs are at least partially or-
as follows: dered, it is common to look for predictions which respect
their order relations. In that case, the objective is to obtain a
Y = {σ : {1, . . . , p} → L | σ is bijective} , (8) classifier or regression function which enforces the follow-
where p is the amount of labels. Y can also be seen as the ing constraint:
set of all permutations of the labels in L , usually known as
x1 < x2 ⇒ f (x1 ) < f (x2 ) ∀x1 , x2 ∈ X . (11)
the symmetric group of order p, and noted as S p .
When Y is discrete the problem is usually called mono-
tone classification (MC), monotonic classification or ordinal
2.8 Multi-target regression classification with monotonicity constraints [51]. If, on the
contrary, Y is continuous, it is known as isotonic regression
A regression problem where the output space has more than
(IR) [6].
just one dimension is usually called multi-target regression
(MTR) and is also known as multi-output, multi-variate or
multi-response [11]. In this case, a formal description is sim- 2.11 Absence or partiality of information
ply that the codomain is a continuous multi-dimensional real
set: Some problems do not directly alter the structure of X and Y
p from the standard supervised problem. Instead, they restrict
Y = ∏ Yi , where Yi ⊂ R ∀i (9) which data can belong to a training set, or remove labelings
i=1
from training examples. In this case, training information is
and p is the number of target variables. presented partially or with some exclusions.
As with other multiple target extensions, the key differ- According to which kind of information is missing from
ence with single-target regression in this case is the possible the training set, a learning task can usually be categorized as
interactions among output variables. semi-supervised [16], one-class learning [81], PU-learning
[37], zero-shot learning [86] or one-shot learning [39]. These
are described further in Section 6.1.
2.9 Ordinal regression
A problem where the target space is discrete but ordered is 2.12 Variation combinations
called ordinal regression (OR) or, alternatively, ordinal clas-
sification [52]. It can be located midway between classifica- Some of the components described above can be combined
tion and regression. More specifically, it consists in labeling to compose a more complex problem overall. Usually, one
instances with a finite number of choices where these are of these combinations will take components from different
ordered variation types, for example, simultaneous multiplicity of
inputs and outputs.
Y = {1, 2, . . . , c} , 1 < 2 < · · · < c . (10)
More specifically, there exist several studies involving
In OR, the training phase consists in learning from a MI ML scenarios [122, 103]. In this case, examples from the
set of feature vectors which have a specific label associated input space are composed of several feature vectors and are
to them, and testing can be performed over individual in- associated to various labels. As a consequence, this model
stances. This means that, although labels are ordered, the can represent many complicated problems where inputs and
main objective is not to rank or sort instances as in learning outputs have more structure than usual.
to rank [13], but to simply classify them. The labels them- Other more uncommon situations are MV MI ML prob-
selves do not provide any metric information either, they lems [84], where patterns have several instances which may
only carry qualitative information about the order among or may not belong to the same space, a multi-output ver-
themselves. sion of OR named graded ML classification [22] and more
complex input structures such as multi-layer MI MV [116],
where a hierarchy of instances is present in each example.
2.10 Monotonicity constraints
Order relations can exist not only in the label space but in 3 Taxonomy
the feature space as well. Partial orders among real-valued
feature vectors are always possible, and there may be cases A first categorization of the variations analyzed in this work
where the order among instances is determined by just one can be made according to how they differ from the standard
or a few of their attributes. problem. There can be multiplicity in the input space or the
A snapshot on nonstandard supervised learning problems 5
output space, order constraints may exist, or only partial in- There are also problems where there exists a partial or
formation may be given in some cases. Fig. 1 shows ways in total order among instances, which is coupled with an order
which the traditional problems can be generalized. constraint in relation to the outputs. These are MC and IR.
Fig. 2 summarizes these structural traits in a hierarchy
and indicates problems where these traits are present.
Multiple inputs
(MI, MV)
Unordered
(standard)
Single feature
vector
Multiple outputs
Order constraints
(ML, LR, MD, Ordered (MC, IR)
LDL, MTR) Standard problem (OR, MC, IR)
Input structure
traits
Different
space (MV)
Partial information Multiple
(SSL, PU, 0shot, feature vectors
1shot, 1class)
Same space (MI)
Discrete Continuous
Scalar (standard
Scalar Multiple Multiple
regression)
Unordered
Ordered Distribution Unrestricted
(standard Binary (ML) Ranking (LR) Finite (MD)
(OR, MC) (LDL) (MTR)
classification)
Fig. 3 Traits that can be found on the output structure of supervised problems.
variable allows to generalize binary problems to multiclass, In the following subsections several methods based on
and ordinal to single-target regression, as well as ML ones both approaches are enumerated for each analysed problem.
to MD and these to MTR. LDL can be seen as a general-
ization of ML where real numbers between 0 and 1 are also
allowed as values for a label. LR is a generalization of ML
4.1 Problem transformation
by the argument discussed before.
Problem transformation methods assume that a solution can
Multi-label Binary Multiclass be achieved by extracting one or more simpler problems
out of the original one. For example, a problem with multi-
dimensional targets could be transformed into many prob-
lems with scalar outputs. Then, these problems could be
Label ranking Multi-dimensional Ordinal
solved independently by a classical algorithm. A solution
for the original problem would be the concatenation of those
extracted from the simpler ones.
Label distribution Multi-target Standard Next, the most common transformation techniques are
learning regression regression
described for each nonstandard supervised learning task pre-
Fig. 4 Relations among supervised problems according to output viously introduced.
structure. Arrows follow natural generalizations from one problem to
another. Continuous arrows denote generalizations based on adding
more variables of the same type. Dashed arrows indicate generaliza-
tions based on modifying existing target variables.
– MI. The taxonomy proposed in [3] describes an Embed-
ded Space paradigm, where each bag is transformed into
a single feature vector representing the relevant informa-
tion about the whole bag. This transformation brings the MI
problem into a single-instance one. Most of these methods
3.3 Summary are vocabulary-based, which means that the embedding uses
a set of concepts to classify each bag according to its in-
In this section input and output variations of standard super- stances, resulting in a single vector with one component per
vised problems have been categorized and related. Table 1 concept.
allows to identify specific problems according to which in-
put and output traits are present.
– MV. Some naive transformations consist in ignoring every
view except one, or concatenating feature vectors from all
4 Common approaches to tackle nonstandard problems views, thus training a single-view model in both cases [68].
A preprocessing based on Canonical Correlation Analysis
When tackling a nonstandard problem, most techniques fol- [19] is able to project data from multiple views onto a lower-
low one of two main approaches: problem transformation dimensional, single-view space.
or algorithm adaptation. The first one relies on appropriate
transformations of the data which result in one or more sim-
pler, standard problems. The latter implies an extension or – ML. Transformation methods for ML classification [118]
development of previously existing algorithms, in order to are diverse: Binary Relevance trains separate binary classi-
adapt them to the complexities induced by the structure of fiers for each label. Label Powerset reduces the problem to
the data. a multiclass one by treating each individual labelset as an
A snapshot on nonstandard supervised learning problems 7
Table 1 Identification of problems according to their input traits (vertical axis) and output traits (horizontal axis).
independent class label, and Random k-Labelsets [108] ex- – OR. An ordinal problem with c classes can be transformed
tracts an ensemble of multiclass problems similarly. Classi- into c − 1 binary classification problems by using each class
fier chains [91] trains subsequent binary classifiers accumu- from the second to the last one as a threshold for the pos-
lating previous predictions as inputs. ML problems can also itive class [42]. This decomposition can be called ordered
be transformed to LR [44]. partitions and is not the only possible one: others are one-
vs-next, one-vs-followers and one-vs-previous [52]. Several
– MD. In some cases, independent classifiers can be trained 3-class problems can also be obtained by using, for the i-th
for several dimensions [96, 87] but this method ignores pos- problem, classes “li ”, “< li ” and “> li ”.
sible correlations among dimensions. An alternative trans-
formation, building a different label from each combination – MC. The authors in [65] describe a procedure to tackle
of classes, would produce a much larger label space and thus binary MC problems by means of IR. Multiclass MC cases
is not typically applied. can be reduced to several binary MC ones, which in turn are
solved as IR problems.
– LDL. A LDL problem can be reduced to multiclass clas-
sification by extracting as many single-label examples as la-
4.2 Algorithm adaptation
bels for each one of the training instances [47]. These new
examples are assigned a class corresponding to each label Existing methods for classical problems can be extended in
and weighted according to its degree of description. During order to introduce the necessary complexities of nonstan-
the prediction process, the classifier must be able to output dard variations. As an example, nearest neighbor methods
the score/confidence for each label, which can be used as its could be coupled with new distance metrics in order to be
description degree. able to measure similarity among multiple inputs.
The rest of this section presents some algorithm adap-
– LR. A reduction of this problem to several binary prob- tations which can be used to tackle nonstandard supervised
lems can be achieved by learning pairwise preferences [57]. tasks.
This transforms a c-label problem into c(c − 1)/2 binary
problems describing a comparison among two labels. An – MI. Methods that work on instance level are adaptations
alternative reduction by means of constraint classification of algorithms from single-instance classification whose re-
[53] builds a single binary classification dataset by expand- sponses are then aggregated to build the bag-level classifi-
ing each label preference into a new positive instance and a cation [3]. They typically assume that one positive instance
new negative instance. The feature space of the new binary implies a positive bag. Adaptations of common algorithms
problem has dimension nc, where n is the original dimension have been proposed with support vector machines (SVM)
and c the number of labels, due to the constraints embedded [4] and neural networks [90], whereas some original meth-
in it by Kesler’s construction [85]. ods in this area are Axis-Parallel Rectangles [31] and Di-
verse Density [77]. In the bag-space paradigm, methods treat
– MTR. There are several ways to transform a MTR prob- bags as a whole and use specific distance metrics with dis-
lem into several single-target regression ones. Some of them tance as well as kernel-based classifiers, such as k-nearest
are inspired by the ML field, such as a one-vs-all single- neighbor (k-NN) [114] or SVM [121].
target reduction, multi-target stacking and regressor chains
[101]. All of them train single-target regressors for several – MV. Supervised methods for MV are comparatively less
extracted problems, and then combine the obtained predic- developed than semi-supervised ones. Nonetheless, there is
tions. A different approach based on support vectors [119] an extension of SVM [38] which simultaneously looks for
extends the feature space which expresses the multi-output two SVMs, one in each of the feature spaces of a two-view
problem as a single-target one that can be solved using least problem. There is an extension of Fisher discriminant anal-
squares support vector regression machines. ysis as well [20].
8 David Charte et al.
– ML. The most relevant algorithm adaptations [118] are Task Problem transformation Algorithm adaptation
based on standard classification algorithms with added sup- MI Embedded-space [3] SVM [4, 121]
port for choosing more than one class at a time: adaptations Neural networks [90]
k-NN [114]
exist for k-NN [117], decision trees [24], SVMs [36], asso-
MV Canonical correlation analysis [19] SVM [38]
ciation rules [106] and ensembles [82]. Fisher discriminant analysis [20]
ML Binary Relevance [118] k-NN [117]
– MD. Specific Bayesian networks have been proposed for Label Powerset [118] Decision trees [24]
Classifier chains [91] SVM [36]
the MD scenario [8, 26], as well as Maximum Entropy-based Association rules [106]
Ensembles [82]
algorithms [96, 87].
MD Independent classifiers [96, 87] Bayesian networks [8, 26]
Maximum Entropy [96, 87]
– LDL. Proposals in [47] are adaptations of k-NN, with a LDL Multiclass reduction [47] k-NN [47]
special derivation of the label distribution of an unseen in- Neural networks [47]
LR Pairwise preferences [57] Boosting [28]
stance given its neighbors, and backpropagated neural net- Constraint classification [53] SVM [113]
works, where the output layer indicates the label distribution Perceptron [95]
of an instance. Other proposed methods are based on the op- MTR ML [101] Generalizations [59, 111]
Support vectors [119] Support vector regression [112,93]
timization algorithms BFGS and Improved Iterative Scaling. Kernel-based [79, 1]
Regression trees [27]
Random forests [64]
– LR. Boosting methods have been adapted to LR [28], as OR Ordered partitions [42] Neural networks [25, 21]
well as the SVM proposed in [36] for ML which can be natu- One-vs-next, One-vs-followers, Extreme learning machines [30,94]
One-vs-previous [52] Decision trees [14]
rally extended to LR [113]. An adaptation of online learning 3-class problems [52] Gaussian processes [23]
algorithms such as the perceptron has also been developed AdaBoost [73]
– LR. The field known as preference learning has been gain- A different scenario arises when the training set only
ing interest [57], and LR is one of the problem that falls un- consists of negative (or only positive) instances, and no un-
der this term. LR is also frequently applied in ML scenarios labeled examples are provided. This is known as one-class
[45], where a threshold can be applied in order to transform classification [81], and data of this nature can be obtained
an obtained ranking into a labelset. from outlier detection applications, where positive examples
are hardly recorded.
– LDL. Data with relative importance of each label appears A problem which may be seen as a generalization of
in applications such as analysis of gene expression levels in one-class classification is zero-shot learning [86], a situation
yeast [35], or emotion description from facial expressions where unseen classes are to be predicted in the testing stage.
[76], where a face can depict several emotions in different That is, the label space Y includes some values which are
grades. not present in any training pattern, but the classifier must be
able to predict them. For example, if in a speech recognition
problem Y is the set of all words in English, the training set
– MTR. Applications modeled as MTR problems are di- is unlikely to have at least one instance for each word, thus
verse, including modeling of vegetation condition in ecosys- the classifier will only succeed if it is capable of assigning
tems assigning several scores which depend on the vegeta- unlearned words to test examples.
tion type [63], prediction of audio spectrums of wind tunnel A relaxation on the obstacles of zero-shot learning is
tests [69], and estimation of several biophysical parameters present in one-shot learning [39], where algorithms attempt
from remote sensing images [109]. to generalize from very few (1 to 5) examples of each class.
This is a common circumstance in the field of image classi-
– OR. The most salient fields where OR can be found are fication, where the cost of collecting and labeling data sam-
text classification [5], where the predicted variable may be ples is high.
an opinion scale or a degree of satisfaction; image catego- A classification of these problems according to the type
rization [107]; medical research [7]; credit rating [70], and of missing information can be found in Table 3.
age estimation [15].
6.1 Learning with partial information The nonstandard variations described in this work general-
ize traditional supervised problems where the predicted out-
In a standard supervised classification setting, it is assumed put is at most a vector whose components take values in ei-
that every training example is labeled accordingly and that ther a finite set or R. Further generalizations are possible if
there exist examples for every class that may appear in the other kinds of structures are allowed. For example, the target
testing phase. When only a fraction of the training instances may take the form of an ordered sequence or a tree. In this
are labeled, the problem is considered semi-supervised [16], case, the problem usually enters the scope of structured pre-
but generally there still exist labeled samples for each class. diction [104], a generalization of supervised learning where
In positive-unlabeled learning [37, 74], however, labeled methods must build structured data associated to input in-
examples provided within the training set are only positive. stances.
This means the learning algorithm only knows about the A particular case of supervised problem which can be
class of positive instances, and unlabeled ones can have ei- seen under the umbrella of structured prediction is learn-
ther class. ing to rank [13], which does not involve a label space as
10 David Charte et al.
such. Instead, training consists in learning from a set of fea- 3. Amores, J.: Multiple instance classification: Review, taxonomy
ture vectors with a series of preferences among them, that and comparative study. Artificial Intelligence 201, 81 – 105
(2013). DOI https://doi.org/10.1016/j.artint.2013.06.003
is, a partial or total order in the training set. During testing 4. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector ma-
a set of feature vectors is provided and the desired output chines for multiple-instance learning. In: Advances in neural in-
is a ranking (with a predefined number of relevance lev- formation processing systems, pp. 577–584 (2003)
els, allowing ties) or a sorting (simply an ordering of the 5. Baccianella, S., Esuli, A., Sebastiani, F.: Feature selection for
ordinal text classification. Neural computation 26(3), 557–591
instances). This problem differs from OR in that individ- (2014). DOI 10.1162/NECO\ a\ 00558
ual classifications are usually meaningless: only relative dis- 6. Barlow, R.E.: Statistical inference under order restrictions; the
tances among ranked instances matter. theory and application of isotonic regression. Wiley (1972)
7. Bender, R., Grouven, U.: Ordinal logistic regression in medical
research. Journal of the Royal College of physicians of London
31(5), 546–551 (1997)
7 Conclusions 8. Bielza, C., Li, G., Larranaga, P.: Multi-dimensional classification
with bayesian networks. International Journal of Approximate
Traditional supervised learning comprises two well known Reasoning 52(6), 705–727 (2011)
9. Błaszczyński, J., Słowiński, R., Szelag, M.: Sequential covering
problems in machine learning: classification and regression. rule induction algorithm for variable consistency rough set ap-
However, the multitude of applications which do not strictly proaches. Information Sciences 181(5), 987–1002 (2011). DOI
fit the structure of the standard versions of those problems 10.1016/j.ins.2010.10.030
have favored the development of alternative versions which 10. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos,
A.: Feature Selection for High-Dimensional Data.
are more flexible and allow the analysis of more complex Springer International Publishing, Cham (2015). DOI
situations. 10.1007/978-3-319-21858-8. URL https://doi.org/10.
In this work an overview of nonstandard variations of 1007/978-3-319-21858-8
supervised learning problems has been presented. A novel 11. Borchani, H., Varando, G., Bielza, C., Larrañaga, P.: A survey on
multi-output regression. Wiley Interdisciplinary Reviews: Data
taxonomy under several criteria has described relationships Mining and Knowledge Discovery 5(5), 216–233 (2015). DOI
among these variations, where the main differentiating prop- 10.1002/widm.1157
erties are multiplicity of inputs, multiplicity of outputs, pres- 12. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-
ence of order relations and constraints, and partial informa- label scene classification. Pattern Recognition 37(9), 1757–1771
(2004). DOI 10.1016/j.patcog.2004.03.009
tion. Afterwards, common methods for tackling these prob- 13. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M.,
lems have been outlined and their main applications have Hamilton, N., Hullender, G.: Learning to rank using gradient
been mentioned as well. Finally, some additional variants descent. In: Proceedings of the 22nd international conference
which were left out of the scope of the previous analysis on Machine learning, pp. 89–96. ACM (2005). DOI 10.1145/
1102351.1102363
have been introduced as well. 14. Cardoso, J.S., Sousa, R.: Classification models with global con-
Design of novel algorithms for nonstandard supervised straints for ordinal data. In: 2010 Ninth International Conference
tasks is scarcer than adaptations and transformations, but on Machine Learning and Applications, pp. 71–77. IEEE (2010).
DOI 10.1109/ICMLA.2010.18
there exist some approximations and even more open pos-
15. Chang, K.Y., Chen, C.S., Hung, Y.P.: Ordinal hyperplanes ranker
sibilities for tackling these from classical algorithmic per- with cost sensitivities for age estimation. In: Computer vision
spectives, such as probabilistic and heuristic methods, infor- and pattern recognition (cvpr), 2011 ieee conference on, pp. 585–
mation theory and linear algebra, among others. 592. IEEE (2011). DOI 10.1109/CVPR.2011.5995437
16. Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning,
1st edn. The MIT Press (2010)
Acknowledgements D. Charte is supported by the Spanish Ministry 17. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Quinta: A
of Science, Innovation and Universities under the FPU National Pro- question tagging assistant to improve the answering ratio in elec-
gram (Ref. FPU17/04069). This work has been partially supported by tronic forums. In: EUROCON 2015 - International Conference
projects TIN2017-89517-P (FEDER Founds) of the Spanish Ministry on Computer as a Tool (EUROCON), IEEE, pp. 1–6 (2015).
of Economy and Competitiveness and TIN2015-68454-R of the Span- DOI 10.1109/EUROCON.2015.7313677
ish Ministry of Science, Innovation and Universities. 18. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Dealing with
difficult minority labels in imbalanced mutilabel data sets. Neu-
rocomputing (2017). DOI 10.1016/j.neucom.2016.08.158
19. Chaudhuri, K., Kakade, S.M., Livescu, K., Sridharan, K.: Multi-
References view clustering via canonical correlation analysis. In: Proceed-
ings of the 26th annual international conference on machine
1. Alvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for vector- learning, pp. 129–136. ACM (2009). DOI 10.1145/1553374.
valued functions: A review. In: Foundations and Trends in 1553391
Machine Learning. Now Publishers (2012). DOI 10.1561/ 20. Chen, Q., Sun, S.: Hierarchical multi-view fisher discriminant
2200000036 analysis. In: International Conference on Neural Informa-
2. Amini, M., Usunier, N., Goutte, C.: Learning from multiple par- tion Processing, pp. 289–298. Springer (2009). DOI 10.1007/
tially observed views-an application to multilingual text catego- 978-3-642-10684-2\ 32
rization. In: Advances in neural information processing systems, 21. Cheng, J., Wang, Z., Pollastri, G.: A neural network approach
pp. 28–36 (2009) to ordinal regression. In: Neural Networks, 2008. IJCNN
A snapshot on nonstandard supervised learning problems 11
2008.(IEEE World Congress on Computational Intelligence). Advances in neural information processing systems, pp. 355–362
IEEE International Joint Conference on, pp. 1279–1284. IEEE (2006)
(2008). DOI 10.1109/IJCNN.2008.4633963 39. Fe-Fei, L., et al.: A bayesian approach to unsupervised one-shot
22. Cheng, W., Hüllermeier, E., Dembczynski, K.J.: Graded multi- learning of object categories. In: Computer Vision, 2003. Pro-
label classification: The ordinal case. In: Proceedings of the ceedings. Ninth IEEE International Conference on, pp. 1134–
27th international conference on machine learning (ICML-10), 1141. IEEE (2003). DOI 10.1109/ICCV.2003.1238476
pp. 223–230 (2010) 40. Fernández, A., Garcı́a, S., Galar, M., Prati, R.C., Krawczyk, B.,
23. Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regres- Herrera, F.: Learning from Imbalanced Data Sets. Springer Inter-
sion. Journal of machine learning research 6(Jul), 1019–1041 national Publishing (2018). DOI 10.1007/978-3-319-98074-4
(2005) 41. Foulds, J., Frank, E.: A review of multi-instance learning as-
24. Clare, A., King, R.D.: Knowledge discovery in multi-label phe- sumptions. The Knowledge Engineering Review 25(1), 1–25
notype data. In: European Conference on Principles of Data Min- (2010). DOI 10.1017/S026988890999035X
ing and Knowledge Discovery, pp. 42–53. Springer (2001). DOI 42. Frank, E., Hall, M.: A simple approach to ordinal classification.
10.1007/3-540-44794-6\ 4 In: European Conference on Machine Learning, pp. 145–156.
25. Costa, M.: Probabilistic interpretation of feedforward network Springer (2001). DOI 10.1007/3-540-44795-4\ 13
outputs, with relationships to statistical prediction of ordinal 43. Fukunaga, K.: Introduction to statistical pattern recognition. El-
quantities. International journal of neural systems 7(05), 627– sevier (2013)
637 (1996). DOI 10.1142/S0129065796000610 44. Fürnkranz, J., Hüllermeier, E., Mencı́a, E.L., Brinker, K.: Multil-
26. De Waal, P.R., Van Der Gaag, L.C.: Inference and learning in abel classification via calibrated label ranking. Machine learning
multi-dimensional bayesian network classifiers. In: European 73(2), 133–153 (2008). DOI 10.1007/s10994-008-5064-8
Conference on Symbolic and Quantitative Approaches to Rea- 45. Fürnkranz, J., Hüllermeier, E., Mencı́a, E.L., Brinker, K.: Multil-
soning and Uncertainty, pp. 501–511. Springer (2007). DOI abel classification via calibrated label ranking. Machine learning
10.1007/978-3-540-75256-1\ 45 73(2), 133–153 (2008). DOI 10.1007/s10994-008-5064-8
27. De’Ath, G.: Multivariate regression trees: a new technique for 46. Gama, J.: Knowledge discovery from data streams. Chapman
modeling species–environment relationships. Ecology 83(4), and Hall/CRC (2010)
1105–1117 (2002). DOI 10.1890/0012-9658(2002)083[1105: 47. Geng, X.: Label distribution learning. IEEE Transactions on
MRTANT]2.0.CO;2 Knowledge and Data Engineering 28(7), 1734–1748 (2016).
DOI 10.1109/TKDE.2016.2545658
28. Dekel, O., Singer, Y., Manning, C.D.: Log-linear models for label
48. Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM
ranking. In: Advances in neural information processing systems,
Computing Surveys (CSUR) 47(3), 52 (2015). DOI 10.1145/
pp. 497–504 (2004)
2716262
29. Dembczyński, K., Kotłowski, W., Słowiński, R.: Ensemble of
49. Greco, S., Matarazzo, B., Slowinski, R.: A new rough set ap-
decision rules for ordinal classification with monotonicity con-
proach to evaluation of bankruptcy risk. In: Operational tools in
straints. In: International Conference on Rough Sets and Knowl-
the management of financial risks, pp. 121–136. Springer (1998).
edge Technology, pp. 260–267. Springer (2008). DOI 10.1007/
DOI 10.1007/978-1-4615-5495-0\ 8
978-3-540-79721-0\ 38
50. Greco, S., Matarazzo, B., Słowiński, R.: Rough set approach
30. Deng, W.Y., Zheng, Q.H., Lian, S., Chen, L., Wang, X.: Ordinal
to customer satisfaction analysis. In: International Conference
extreme learning machine. Neurocomputing 74(1-3), 447–456
on Rough Sets and Current Trends in Computing, pp. 284–295.
(2010). DOI 10.1016/j.neucom.2010.08.022
Springer (2006). DOI 10.1007/11908029\ 31
31. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the 51. Gutiérrez, P.A., Garcı́a, S.: Current prospects on ordinal and
multiple instance problem with axis-parallel rectangles. Ar- monotonic classification. Progress in Artificial Intelligence
tificial intelligence 89(1-2), 31–71 (1997). DOI 10.1016/ 5(3), 171–179 (2016). DOI 10.1007/s13748-016-0088-y. URL
S0004-3702(96)00034-3 https://doi.org/10.1007/s13748-016-0088-y
32. Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I.: Protein 52. Gutiérrez, P.A., Pérez-Ortiz, M., Sánchez-Monedero, J.,
classification with multiple algorithms. In: Proc. 10th Panhel- Fernández-Navarro, F., Hervás-Martı́nez, C.: Ordinal regression
lenic Conference on Informatics, Volos, Greece, PCI05, pp. 448– methods: Survey and experimental study. IEEE Transactions
456 (2005). DOI 10.1007/11573036\ 42 on Knowledge and Data Engineering 28(1), 127–146 (2016).
33. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John DOI 10.1109/TKDE.2015.2457911
Wiley & Sons (2012) 53. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification for
34. Duivesteijn, W., Feelders, A.: Nearest neighbour classification multiclass classification and ranking. In: Advances in neural in-
with monotonicity constraints. In: Joint European Conference on formation processing systems, pp. 809–816 (2003)
Machine Learning and Knowledge Discovery in Databases, pp. 54. Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervi-
301–316. Springer (2008). DOI 10.1007/978-3-540-87479-9\ sion and other non-standard classification problems: A taxon-
38 omy. Pattern Recognition Letters 69, 49 – 55 (2016). DOI
35. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster 10.1016/j.patrec.2015.10.008
analysis and display of genome-wide expression patterns. Pro- 55. Herrera, F., Charte, F., Rivera, A.J., Del Jesus, M.J.: Multilabel
ceedings of the National Academy of Sciences 95(25), 14863– classification. Springer (2016)
14868 (1998) 56. Herrera, F., Ventura, S., Bello, R., Cornelis, C., Zafra, A.,
36. Elisseeff, A., Weston, J.: A kernel method for multi-labelled clas- Sánchez-Tarragó, D., Vluymans, S.: Multiple instance learning:
sification. In: Advances in neural information processing sys- foundations and algorithms. Springer (2016). DOI 10.1007/
tems, pp. 681–687 (2002) 978-3-319-47759-6
37. Elkan, C., Noto, K.: Learning classifiers from only positive and 57. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label
unlabeled data. In: Proceedings of the 14th ACM SIGKDD in- ranking by learning pairwise preferences. Artificial Intelligence
ternational conference on Knowledge discovery and data mining, 172(16-17), 1897–1916 (2008). DOI 10.1016/j.artint.2008.08.
pp. 213–220. ACM (2008). DOI 10.1145/1401890.1401920 002
38. Farquhar, J., Hardoon, D., Meng, H., Shawe-taylor, J.S., Szed- 58. Hyndman, R.J., Athanasopoulos, G.: Forecasting: principles and
mak, S.: Two view learning: Svm-2k, theory and practice. In: practice. OTexts (2018)
12 David Charte et al.
59. Izenman, A.J.: Reduced-rank regression for the multivariate lin- 77. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance
ear model. Journal of multivariate analysis 5(2), 248–264 (1975). learning. In: Advances in neural information processing systems,
DOI 10.1016/0047-259X(75)90042-1 pp. 570–576 (1998)
60. Jain, A.K., Duin, R.P., Mao, J.: Statistical pattern recognition: 78. Marsland, S.: Machine Learning: An Algorithmic Perspective.
A review. IEEE Transactions on pattern analysis and machine Chapman & Hall (2014)
intelligence 22(1), 4–37 (2000) 79. Micchelli, C.A., Pontil, M.: On learning vector-valued func-
61. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction tions. Neural computation 17(1), 177–204 (2005). DOI 10.1162/
to Statistical Learning: with Applications in R. Springer New 0899766052530802
York, New York, NY (2013). DOI 10.1007/978-1-4614-7138-7 80. Mitchell, T.M.: Machine learning. McGraw Hill series in com-
62. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classifi- puter science. McGraw-Hill (1997)
cation for automated tag suggestion. In: Proc. ECML PKDD08 81. Moya, M.M., Koch, M.W., Hostetler, L.D.: One-class classifier
Discovery Challenge, Antwerp, Belgium, pp. 75–83 (2008) networks for target recognition applications. NASA STI/Recon
63. Kocev, D., Džeroski, S., White, M.D., Newell, G.R., Griffioen, Technical Report N 93 (1993)
P.: Using single-and multi-target regression trees and ensembles 82. Moyano, J.M., Gibaja, E.L., Cios, K.J., Ventura, S.: Review of
to model a compound index of vegetation condition. Ecological ensembles of multi-label classifiers: Models, experimental study
Modelling 220(8), 1159–1168 (2009). DOI 10.1016/j.ecolmodel. and prospects. Information Fusion 44, 33–45 (2018). DOI 10.
2009.01.037 1016/j.inffus.2017.12.001
64. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for 83. Murphy, K.P.: Machine Learning: A Probabilistic Perspective.
predicting structured outputs. Pattern Recognition 46(3), 817– The MIT Press (2012)
833 (2013). DOI 10.1016/j.patcog.2012.09.023 84. Nguyen, C.T., Wang, X., Liu, J., Zhou, Z.H.: Labeling compli-
65. Kotlowski, W., Slowinski, R.: On nonparametric ordinal classi- cated objects: Multi-view multi-instance multi-label learning. In:
fication with monotonicity constraints. IEEE Transactions on AAAI, pp. 2013–2019 (2014)
Knowledge and Data Engineering 25(11), 2576–2589 (2013). 85. Nilsson, N.J.: Learning machines: foundations of trainable
DOI 10.1109/TKDE.2012.204 pattern-classifying systems. McGraw-Hill (1965)
66. Kotsiantis, S., Kanellopoulos, D., Tampakas, V.: Financial appli- 86. Palatucci, M., Pomerleau, D., Hinton, G.E., Mitchell, T.M.: Zero-
cation of multi-instance learning: two greek case studies. Journal shot learning with semantic output codes. In: Advances in neural
of Convergence Information Technology 5(8), 42–53 (2010) information processing systems, pp. 1410–1418 (2009)
67. Krawczyk, B.: Learning from imbalanced data: open challenges 87. Pan, F.: Multi-dimensional fragment classification in biomedical
and future directions. Progress in Artificial Intelligence 5(4), text. Queen’s University (2006)
221–232 (2016). DOI 10.1007/s13748-016-0094-0. URL 88. Pan, S.J., Kwok, J.T., Yang, Q., Pan, J.J.: Adaptive localization
https://doi.org/10.1007/s13748-016-0094-0 in a dynamic wifi environment through multi-view learning. In:
68. Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spec- AAAI, pp. 1108–1113 (2007)
tral clustering. In: Advances in neural information processing 89. Potharst, R., Feelders, A.J.: Classification trees for problems with
systems, pp. 1413–1421 (2011) monotonicity constraints. ACM SIGKDD Explorations Newslet-
69. Kuznar, D., Mozina, M., Bratko, I.: Curve prediction with kernel ter 4(1), 1–10 (2002). DOI 10.1145/568574.568577
regression. In: Proceedings of the 1st Workshop on Learning 90. Ramon, J., De Raedt, L.: Multi instance neural networks. In:
from Multi-Label Data, pp. 61–68 (2009) Proceedings of the ICML-2000 workshop on attribute-value and
70. Kwon, Y.S., Han, I., Lee, K.C.: Ordinal pairwise partitioning relational learning, pp. 53–60 (2000)
(opp) approach to neural networks training in bond rating. In- 91. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains
telligent Systems in Accounting, Finance & Management 6(1), for multi-label classification. Machine learning 85(3), 333
23–40 (1997). DOI 10.1002/(SICI)1099-1174(199703)6:1h23:: (2011). DOI 10.1007/s10994-011-5256-5
AID-ISAF113i3.0.CO;2-4 92. Ryu, Y.U., Chandrasekaran, R., Jacob, V.S.: Breast cancer pre-
71. Laghmari, K., Marsala, C., Ramdani, M.: An adapted incremen- diction using the isotonic separation technique. European Jour-
tal graded multi-label classification model for recommendation nal of Operational Research 181(2), 842–854 (2007). DOI
systems. Progress in Artificial Intelligence 7(1), 15–29 (2018). 10.1016/j.ejor.2006.06.031
DOI 10.1007/s13748-017-0133-5 93. Sánchez-Fernández, M., de Prado-Cumplido, M., Arenas-Garcı́a,
72. Li, S.Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., Shum, H.: Sta- J., Pérez-Cruz, F.: Svm multiregression for nonlinear channel es-
tistical learning of multi-view face detection. In: European Con- timation in multiple-input multiple-output systems. IEEE trans-
ference on Computer Vision, pp. 67–81. Springer (2002). DOI actions on signal processing 52(8), 2298–2307 (2004). DOI
10.1007/3-540-47979-1\ 5 10.1109/TSP.2004.831028
73. Lin, H.T., Li, L.: Combining ordinal preferences by boosting. 94. Sánchez-Monedero, J., Gutiérrez, P.A., Hervás-Martı́nez, C.:
In: Proceedings ECML/PKDD 2009 Workshop on Preference Evolutionary ordinal extreme learning machine. In: International
Learning, pp. 69–83 (2009) Conference on Hybrid Artificial Intelligence Systems, pp. 500–
74. Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text clas- 509. Springer (2013). DOI 10.1007/978-3-642-40846-5\ 50
sifiers using positive and unlabeled examples. In: Data Mining, 95. Shalev-Shwartz, S., Singer, Y.: A unified algorithmic approach
2003. ICDM 2003. Third IEEE International Conference on, pp. for efficient online label ranking. In: Artificial Intelligence and
179–186. IEEE (2003). DOI 10.1109/ICDM.2003.1250918 Statistics, pp. 452–459 (2007)
75. López-Cruz, P.L., Bielza, C., Larrañaga, P.: Learning conditional 96. Shatkay, H., Pan, F., Rzhetsky, A., Wilbur, W.J.: Multi-
linear gaussian classifiers with probabilistic class labels. In: Con- dimensional classification of biomedical text: Toward automated,
ference of the Spanish Association for Artificial Intelligence, pp. practical provision of high-utility text to diverse users. Bioinfor-
139–148. Springer (2013). DOI 10.1007/978-3-642-40643-0\ matics 24(18), 2086–2093 (2008). DOI 10.1093/bioinformatics/
15 btn381
76. Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding facial 97. Sill, J.: Monotonic networks. In: Advances in neural information
expressions with gabor wavelets. In: Automatic Face and Ges- processing systems, pp. 661–667 (1998)
ture Recognition, 1998. Proceedings. Third IEEE International 98. Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., De Car-
Conference on, pp. 200–205. IEEE (1998). DOI 10.1109/AFGR. valho, A.C., Gama, J.: Data stream clustering: A survey. ACM
1998.670949 Computing Surveys (CSUR) 46(1), 13 (2013)
A snapshot on nonstandard supervised learning problems 13
99. Smola, A.J., Schölkopf, B.: On a kernel-based method for pattern 117. Zhang, M.L., Zhou, Z.H.: Ml-knn: A lazy learning approach
recognition, regression, approximation, and operator inversion. to multi-label learning. Pattern recognition 40(7), 2038–2048
Algorithmica 22(1-2), 211–231 (1998) (2007). DOI 10.1016/j.patcog.2006.12.019
100. Sousa, R., Gama, J.: Multi-label classification from high-speed 118. Zhang, M.L., Zhou, Z.H.: A review on multi-label learning al-
data streams with adaptive model rules and random rules. gorithms. IEEE transactions on knowledge and data engineering
Progress in Artificial Intelligence 7(3), 177–187 (2018). DOI 26(8), 1819–1837 (2014). DOI 10.1109/TKDE.2013.39
10.1007/s13748-018-0142-z 119. Zhang, W., Liu, X., Ding, Y., Shi, D.: Multi-output ls-svr ma-
101. Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, chine in extended feature space. In: Computational Intelligence
I.: Multi-label classification methods for multi-target regression. for Measurement Systems and Applications (CIMSA), 2012
arXiv preprint arXiv 1211 (2012) IEEE International Conference on, pp. 130–134. IEEE (2012).
102. Sun, S., Chao, G.: Multi-view maximum entropy discrimination. DOI 10.1109/CIMSA.2012.6269600
In: IJCAI, pp. 1706–1712 (2013) 120. Zhao, J., Xie, X., Xu, X., Sun, S.: Multi-view learning overview:
103. Surdeanu, M., Tibshirani, J., Nallapati, R., Manning, C.D.: Recent progress and new challenges. Information Fusion 38, 43–
Multi-instance multi-label learning for relation extraction. In: 54 (2017). DOI 10.1016/j.inffus.2017.02.007
Proceedings of the 2012 joint conference on empirical meth- 121. Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treat-
ods in natural language processing and computational natural ing instances as non-iid samples. In: Proceedings of the 26th
language learning, pp. 455–465. Association for Computational annual international conference on machine learning, pp. 1249–
Linguistics (2012) 1256. ACM (2009). DOI 10.1145/1553374.1553534
104. Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning 122. Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance
structured prediction models: A large margin approach. In: Pro- multi-label learning. Artificial Intelligence 176(1), 2291–2320
ceedings of the 22nd international conference on Machine learn- (2012). DOI 10.1016/j.artint.2011.10.002
ing, pp. 896–903. ACM (2005). DOI 10.1145/1102351.1102464
105. Tax, D.M., Duin, R.P.: Using two-class classifiers for multiclass
classification. In: Pattern Recognition, 2002. Proceedings. 16th
International Conference on, vol. 2, pp. 124–127. IEEE (2002)
106. Thabtah, F.A., Cowling, P., Peng, Y.: Mmac: A new multi-class,
multi-label associative classification approach. In: Data Mining,
2004. ICDM’04. Fourth IEEE International Conference on, pp.
217–224. IEEE (2004). DOI 10.1109/ICDM.2004.10117
107. Tian, Q., Chen, S., Tan, X.: Comparative study among three
strategies of incorporating spatial structures to ordinal image re-
gression. Neurocomputing 136, 152–161 (2014). DOI 10.1016/
j.neucom.2014.01.017
108. Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble
method for multilabel classification. In: European conference on
machine learning, pp. 406–417. Springer (2007). DOI 10.1007/
978-3-540-74958-5\ 38
109. Tuia, D., Verrelst, J., Alonso, L., Pérez-Cruz, F., Camps-Valls,
G.: Multioutput support vector regression for remote sensing
biophysical parameter estimation. IEEE Geoscience and Re-
mote Sensing Letters 8(4), 804–808 (2011). DOI 10.1109/LGRS.
2011.2109934
110. Tzortzis, G., Likas, A.: Kernel-based weighted multi-view clus-
tering. In: Data Mining (ICDM), 2012 IEEE 12th International
Conference on, pp. 675–684. IEEE (2012). DOI 10.1109/ICDM.
2012.43
111. Van Der Merwe, A., Zidek, J.: Multivariate regression analysis
and canonical variates. Canadian Journal of Statistics 8(1), 27–
39 (1980). DOI 10.2307/3314667
112. Vazquez, E., Walter, E.: Multi-output support vector regression.
In: 13th IFAC Symposium on System Identification, pp. 1820–
1825. Citeseer (2003)
113. Vembu, S., Gärtner, T.: Label ranking algorithms: A survey. In:
Preference learning, pp. 45–64. Springer (2010). DOI 10.1007/
978-3-642-14125-6\ 3
114. Wang, J., Zucker, J.D.: Solving multiple-instance problem: a lazy
learning approach. In: International Conference on Machine
Learning, pp. 1119–1126. Morgan Kaufmann Publishers (2000)
115. Williams, C.K., Barber, D.: Bayesian classification with gaussian
processes. IEEE Transactions on Pattern Analysis and Machine
Intelligence 20(12), 1342–1351 (1998)
116. Wu, B., Zhong, E., Horner, A., Yang, Q.: Music emotion recogni-
tion by multi-label multi-layer multi-instance multi-view learn-
ing. In: Proceedings of the 22nd ACM international confer-
ence on Multimedia, pp. 117–126. ACM (2014). DOI 10.1145/
2647868.2654904
Heterogeneous Uncertainty Sampling for Supervised Learning
0.50
0.50
200 200 200
error rate (per cent)
number of errors
number of errors
o
oo
o o
o o o o
o
o
o o
0.10
0.10
0.10
o o
50 50 o o o 50
ooo o o o o o
o o
o o o o o o
o
o o o
0.05
0.05
0.05
o
o o o o o o 25 25 25
o o
1 2 3 5 10 20 1 2 3 5 10 20 1 2 3 5 10 20
0.50
0.50
200 200 200
error rate (per cent)
number of errors
number of errors
o o o o o
o o ooo o o oo o
oo o o o o o o o o o o o
o o o o o o
o
o
0.10
0.10
0.10
o 50 50 50
o o o o
oo o o o o
0.05
0.05
0.05
25 25 25
1 2 3 5 10 20 1 2 3 5 10 20 1 2 3 5 10 20
o o
0.50
0.50
0.50
o o
o o o o
o o o o o o
o
200 o o 200 o o o o 200
o
error rate (per cent)
oo o o o o o o o
number of errors
number of errors
number of errors
o o
o
o o
oo o o o o o o o o o
0.10
0.10
0.10
50 50 50
0.05
0.05
0.05
25 25 25
1 2 3 5 10 20 1 2 3 5 10 20 1 2 3 5 10 20
Figure 2: Average error rate for C4.5 rules trained on uncertainty samples of size 299 (black dots) and 999 (white dots), at
various loss ratio values. The average error rates for C4.5 rules trained with random samples of size 1,000 (large dashes)
and 10,000 (small dashes) are shown as dashed lines. The percentage of positive instances on the training set follows the
category name; triangles indicate the percentage on the test set.
152
3 + 996 uncertainty 3 + 9997 random
Reject C4.5 ( =5) prob. ( =1) C4.5 ( =1) prob. ( =1)
Category All Average SD Average SD Average SD Average SD
tickertalk 0.077 0.077 (0.000) 0.078 (0.001) 0.078 (0.003) 0.109 (0.044)
boxoffice 0.081 0.047 (0.002) 0.048 (0.008) 0.061 (0.018) 0.077 (0.021)
bonds 0.115 0.064 (0.002) 0.069 (0.006) 0.076 (0.020) 0.145 (0.069)
nielsens 0.167 0.094 (0.011) 0.062 (0.005) 0.107 (0.006) 0.100 (0.026)
burma 0.179 0.090 (0.008) 0.098 (0.006) 0.115 (0.040) 0.193 (0.046)
dukakis 0.206 0.197 (0.014) 0.208 (0.020) 0.210 (0.039) 0.235 (0.036)
ireland 0.225 0.188 (0.005) 0.189 (0.011) 0.220 (0.024) 0.228 (0.016)
quayle 0.256 0.161 (0.009) 0.222 (0.012) 0.143 (0.010) 0.263 (0.035)
budget 0.379 0.336 (0.010) 0.361 (0.009) 0.350 (0.014) 0.392 (0.016)
hostages 0.439 0.415 (0.024) 0.360 (0.016) 0.466 (0.039) 0.431 (0.018)
Table 2: Average and standard deviation of percentage error of various classifiers. Reject all is a classifier that deems all
instances non-members of the category. Two types of training set were used: an uncertainty sample of size 999 and a
random sample of size 10,000. Two types of classifier are built from each training set: a decision rule classifier trained
using C4.5, and the probabilistic classifier described in the text. When C4.5 was used on the uncertainty sample, a loss
ratio of 5 was used; for the random sample a loss ratio of 1 was used (original C4.5). Figures are averages over 20 runs for
classifiers built from random samples using the probabilistic method, and over 10 runs for the other three combinations.
Table 3: Average number of false positives (FP) and false negatives (FN) for each of 10 categories and 5 conditions.
Experiment conditions are the same as for Table 2.
153
Table 2 lists error rates for both C4.5 and the probabilitistic We believe uncertainty sampling and other sequential, ac-
classifier used during uncertainty sampling. C4.5 figures tive, or exploratory approaches to learning [12, 25] enable
are for a loss ratio of 5 for uncertainty samples and 1 (the both learning research and learning applications on large,
unmodified C4.5) for random samples. The probabilistic complex, real-world data sets where fixed training sets are
classifier uses a loss ratio of 1.0 in both cases. Table 3 impracticable. Natural language processing, where there
shows how the errors divide into false positives and false is great interest in inducing knowledge to support tagging,
negatives. parsing, semantic interpretation, and other forms of analy-
sis, is a particularly fruitful application area.
8 Discussion Heterogeneous approaches are likely to become common,
in response to both resource limitations and the desire to
As Figure 2 shows, an uncertainty sample of 999 instances train new algorithms on previously generated uncertainty
was in most situations as good for training C4.5 rules on samples. A better understanding of how to minimize the
a random sample of 1,000 or even 10,000 instances. At a problems caused by a heterogeneous approach would be
loss ratio of 5, it was even significantly better (p=.03) than desirable.
a random sample of 10,000 instances.3 In some cases, an Note that we treated our large but finite set of instances
uncertainty sample of 299 instances is also as good, though as if it were infinite. By adapting results from sequential
this was less reliable. As expected, it is often necessary to sampling [32] it may be possible both to improve uncer-
use a loss ratio greater than 1 in training rules. Fortunately, tainty sampling and to tell when additional iterations are no
there is some leeway in choosing the loss ratio—good er- longer providing any benefit—when all the juice has been
ror rates are produced for values from 3 to 20 (the highest squeezed out of a data set.
value we tried) for our data. These results show that het-
erogeneous uncertainty sampling can indeed be effective. Finally, in contrast to the assumptions made in most the-
Table 2 presents the data for the larger uncertainty samples oretical work on querying, our categories are stochastic
and random samples in tabular form. rather than deterministic. A classifier may indicate that the
probability of category membership is 0.5 not because the
To point out the extremely low category frequencies, Fig- classifier is incompletely trained, but because the expert
ure 2 and Table 2 also indicate the error rate of a strategy may really classify such instances as category members
that classifies all instances as nonmembers. While such 50% of the time. Indeed, we have seen some evidence of
a strategy has a low error rate, it is not useful. In most such instances being selected in the later iterations of an
cases the classifiers did manage to beat this error rate, and uncertainty sampling run.
an evaluation measure that penalized false negatives would
show an even greater advantage for the trained classifiers. These murky instances are not the best ones for training
[17, 20]. This suggests a goal of producing a classifier that
Table 2 also shows error rates for the probabilistic classifier, estimates accurately rather than simply classifying
both on the samples it selected and on random samples accurately. The variance of this estimate becomes impor-
of size 10,000. C4.5 is significantly better (p=.01) than tant, and it may be more appropriate to treat the problem
the probabilistic classifier on the random sample, but only as one of regression or interpolation [21, 25] rather than
insignificantly better (p=.30) on the uncertainty sample. classification.
This suggests that C4.5 is actually more suitable for this
text categorization task than the probabilistic classifier, and
that there is some penalty in accuracy for heterogeneity in
uncertainty sampling.
10 Summary
Table 3 is similar to Table 2 but shows false positives and
false negatives separately. This shows that while the total
numbers of errors produced by our classifiers were some-
times not substantially smaller than the total number for Using partially formed classifiers to select training data
a strategy that rejects all instances, the errors were more incrementally can reduce the number of instances the expert
balanced between false positives and false negatives. must label to achieve a given error rate. Our experiments
show that some reduction is possible even if this uncertainty
sampling is heterogeneous: the classifiers used to select
9 Future Work instances were of a very different type from the one built
from the final sample. The decision rules C4.5 produced
In this section we discuss a few unexplored directions in from uncertainty samples of roughly 1,000 instances chosen
what we believe is a rich area for study. by a probabilistic classifier were significantly more accurate
than those from random samples ten times larger. The
3
Significance by t-score. The null hypothesis was that differ- ability to use cheap classifiers to select data for training
ences in average error rate across the 10 runs for each category expensive classifiers makes uncertainty sampling even more
were normally distributed with mean zero and a category-specific attractive for a variety of applications where large amounts
variance. of unlabeled data are available.
154
Acknowledgements References
We thank William Cohen, Eileen Fitzpatrick, Yoav Fre- [1] Dana Angluin. Queries and concept learning. Machine
und, William Gale, Trevor Hastie, Doug McIlroy, Robert Learning, 2:319–342, 1988.
Schapire, and Sebastian Seung for advice and useful com- [2] I. Bratko, I. Mozetic, and N. Lavrac. KARDIO: a study
ments on this work, and Ken Church for help with his text in deep and qualitative knowledge for expert systems.
processing tools. MIT Press, Cambridge, Massachusetts, 1989.
[3] Leo Breiman, Jerome H. Friedman, Richard A. Ol-
shen, and Charles J. Stone. Classification and Regres-
sion Trees. Wadsworth, Belmont, CA, 1984.
[4] J. Catlett. Megainduction: a test flight. In Ma-
chine Learning: Proceedings of the Eigth Interna-
tional Workshop, pages 596–599, San Mateo, CA,
1991. Morgan Kaufmann.
[5] William G. Cochran. Sampling Techniques. John
Wiley & Sons, New York, 3rd edition, 1977.
[6] David Cohn, Les Atlas, and Richard Ladner. Improv-
ing generalization with self-directed learning, 1992.
To appear in Machine Learning.
[7] Stuart L. Crawford, Robert M. Fung, Lee A. Appel-
baum, and Richard M. Tong. Classification trees for
information retrieval. In Eighth International Work-
shop on Machine Learning, pages 245–249, 1991.
[8] Daniel T. Davis and Jenq-Neng Hwang. Attentional
focus training by boundary region data selection. In
International Joint Conference on Neural Networks,
pages I–676 to I–681, Baltimore, MD, June 7–11
1992.
[9] James P. Egan. Signal Detection Theory and ROC
Analysis. Academic Press, New York, 1975.
[10] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby.
Information, prediction, and query by committee. In
Advances in Neural Information Processing Systems
5, San Mateo, CA, 1992. Morgan Kaufmann.
[11] William A. Gale, Kenneth W. Church, and David
Yarowsky. A method for disambiguating word senses
in a large corpus. Computers and the Humanities,
26:415–439, 1993.
[12] B. K. Ghosh. A brief history of sequential analysis.
In B. K. Ghosh and P. K. Sen, editors, Handbook of
Sequential Analysis, chapter 1, pages 1–19. Marcel
Dekker, New York, 1991.
[13] Norm Goldstein, editor. The Associated Press Style-
book and Libel Manual. Addison-Wesley, Reading,
MA, 1992.
[14] Donna Harman. Ranking algorithms. In William B.
Frakes and Ricardo Baeza-Yates, editors, Informa-
tion Retrieval: Data Structures and Algorithms, pages
363–392. Prentice Hall, Englewood Cliffs, NJ, 1992.
[15] Peter E. Hart. The condensed nearest neighbor
rule. IEEE Transactions on Information Theory, IT-
14:515–516, May 1968. Reprinted in Agrawala, Ma-
chine Recognition of Patterns, IEEE Press, New York,
1977.
155
[16] Jenq-Neng Hwang, Jai J. Choi, Seho Oh, and Robert J. [32] Bikas Kumar Sinha. Sequential methods for finite
Marks II. Query-based learning applied to partially populations. In B. K. Ghosh and P. K. Sen, editors,
trained multilayer perceptrons. IEEE Transactions on Handbook of Sequential Analysis, chapter 1, pages
Neural Networks, 2(1):131–136, January 1991. 1–19. Marcel Dekker, New York, 1991.
[17] Igor Kononerko, Ivan Bratko, and Esidija Roskar. Ex- [33] Paul E. Utgoff. Improved training via incremental
periments in automatic learning of medical diagnostic learning. In Sixth International Workshop on Machine
rules. Technical report, Jozef Stefan Institute, Ljubl- Learning, pages 362–365, 1989.
jana, Slovenia, 1984. [34] Sholom M. Weiss, Robert S. Galen, and Prasad V.
[18] David D. Lewis and William A. Gale. Training text Tadepalli. Maximizing the predictive value of pro-
classifiers by uncertainty sampling. In Seventeenth duction rules. Artificial Intelligence, 45(1–2):47–71,
Annual International ACM SIGIR Conference on Re- September 1990.
search and Development in Information Retrieval, [35] P. H. Winston. Learning structural descriptions from
1994. To appear. examples. In P. H. Winston, editor, The Psychology of
[19] David D. Lewis and Philip J. Hayes. Editorial. ACM Computer Vision, pages 157–209. McGraw-Hill, New
Transactions on Information Systems. Special Issue York, 1975.
on Text Categorization, 1994. To appear. [36] J. Wirth and J. Catlett. Costs and benefits of window-
[20] David J. C. MacKay. The evidence framework ap- ing in ID3. In Proceedings of the Fifth International
plied to classification networks. Neural Computation, Conference on Machine Learning, Ann Arbor, Michi-
4:720–736, 1992. gan, 1988. Morgan Kaufmann.
[21] David J. C. MacKay. Information-based objective
functions for active data selection. Neural Compu-
tation, 4(4):589–603, 1992.
[22] Michel Manago. Knowledge intensive induction. In
Machine Learning: Proceedings of the Sixth Interna-
tional Workshop, pages 151–155, 1989.
[23] P. McCullagh and J. A. Nelder. Generalized Linear
Models. Chapman & Hall, London, 2nd edition, 1989.
[24] Tom M. Mitchell. Generalization as search. Artificial
Intelligence, 18:203–226, 1982.
[25] Mark Plutowski and Halbert White. Selecting concise
training sets from clean data. IEEE Transactions on
Neural Networks, 4(2):305–318, March 1993.
[26] J. R. Quinlan. Discovering rules by induction from
large collections of examples. In Expert systems in
the micro-electronic age, Edinburgh, UK, 1979. Ed-
inburgh University Press.
[27] J. Ross Quinlan. C4.5: Programs for Machine Learn-
ing. Morgan Kaufmann, San Mateo, CA, 1993.
[28] J.R. Quinlan. Decision trees as probabilistic clas-
sifiers. In Proceedings of the Fourth International
Workshop on Machine Learning, pages 31–37, Irvine,
California, 1987.
[29] Gerard Salton. Automatic Text Processing: The Trans-
formation, Analysis, and Retrieval of Information by
Computer. Addison-Wesley, Reading, MA, 1989.
[30] Claude Sammut, Scott Hurst, Dana Kedzier, and
Donald Michie. Learning to fly. In Ninth Interna-
tional Workshop on Machine Learning, pages 385–
393, 1992.
[31] H. S. Seung, M. Opper, and H. Sompolinsky. Query
by committee. In Proceedings of the Fifth Annual
ACM Workshop on Computational Learning Theory,
pages 287–294, 1992.
156
Wasserstein Propagation for Semi-Supervised Learning
Abstract Niyogi (2001); Zhu et al. (2003); Belkin et al. (2006); Zhou
Probability distributions and histograms are nat- & Belkin (2011); Ji et al. (2012) (also see the survey by Zhu
ural representations for product ratings, traffic (2008) and references therein), can be applied bin-by-bin to
measurements, and other data considered in many propagate normalized frequency counts, this strategy does
machine learning applications. Thus, this pa- not model interactions between histogram bins. As a result,
per introduces a technique for graph-based semi- a fundamental aspect of this type of data is ignored, leading
supervised learning of histograms, derived from to artifacts even when propagating Gaussian distributions.
the theory of optimal transportation. Our method Among first works directly addressing semi-supervised
has several properties making it suitable for this learning of probability distributions is Subramanya &
application; in particular, its behavior can be char- Bilmes (2011), which propagates distributions represent-
acterized by the moments and shapes of the his- ing class memberships. Their loss function, however, is
tograms at the labeled nodes. In addition, it can be based on Kullback-Leibler divergence, which cannot cap-
used for histograms on non-standard domains like ture interactions between histogram bins. Talukdar & Cram-
circles, revealing a strategy for manifold-valued mer (2009) allow interactions between bins by essentially
semi-supervised learning. We also extend this modifying the underlying graph to its tensor product with a
technique to related problems such as smoothing prescribed bin interaction graph; this approach loses prob-
distributions on graph nodes. abilistic structure and tends to oversmooth. Similar issues
have been encountered in the mathematical literature (Mc-
Cann, 1997; Agueh & Carlier, 2011) and in vision/graphics
1. Introduction applications (Bonneel et al., 2011; Rabin et al., 2012) involv-
Graph-based semi-supervised learning is an effective ap- ing interpolating probability distributions. Their solutions
proach for learning problems involving a limited amount attempt to find weighted barycenters of distributions, which
of labeled data (Singh et al., 2008). Methods in this class is insufficient for propagating distributions along graphs.
typically propagate labels from a subset of nodes of a graph The goal of our work is to provide an efficient and theoreti-
to the rest of the nodes. Usually each node is associated cally sound approach to graph-based semi-supervised learn-
with a real number, but in many applications labels are more ing of probability distributions. Our strategy uses the ma-
naturally expressed as histograms or probability distribu- chinery of optimal transportation (Villani, 2003). Inspired
tions. For instance, the traffic density at a given location by (Solomon et al., 2013), we employ the two-Wasserstein
can be seen as a histogram over the 24-hour cycle; these distance between distributions to construct a regularizer
densities may be known only where a service has cameras measuring the “smoothness” of an assignment of a proba-
installed but need to be propagated to the entire map. Prod- bility distribution to each graph node. The final assignment
uct ratings, climatic measurements, and other data sources is produced by optimizing this energy while fitting the his-
exhibit similar structure. togram predictions at labeled nodes.
While methods for numerical labels, such as Belkin & Our technique has many notable properties. As certainty in
st
Proceedings of the 31 International Conference on Machine the known distributions increases, it reduces to the method
Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- of label propagation via harmonic functions (Zhu et al.,
right 2014 by the author(s). 2003). Also, the moments and other characteristics of the
Wasserstein Propagation
we replace the square distance between scalar function val- W22 (ρ0 , ρ1 ) = (F1−1 (s) − F0−1 (s))2 ds . (2)
0
ues appearing in the classical Dirichlet energy (namely the
quantity |fv − fw |2 ) with an appropriate distance between
By applying (2) to the minimization problem (1), we obtain
the distributions ρv and ρw . Rather than using the bin-by-bin
a linear strategy for our propagation problem.
KL divergence, however, we use the Wasserstein distance
with quadratic cost between probability distributions with Proposition 2. Wasserstein propagation can be character-
finite second moment on R. This distance is defined as ized in the following way. For each v ∈ V0 let Fv be the
¨ 1/2 CDF of the distribution ρv . Now suppose that for each
W2 (ρv , ρw ) := inf |x − y|2 dπ(x, y) s ∈ [0, 1] we determine gs : V → R as the solution of the
π∈Π(ρv ,ρw ) R2 classical Dirichlet problem
Distribution-valued maps ρ : V → Prob(R) propagated by of the classical Dirichlet problem and the Wasserstein prop-
optimizing (1) satisfy many analogs of functions extended agation problem coincide in the following way. Suppose that
using the classical Dirichlet problem. Two results of this f : V → R satisfies the classical Dirichlet problem with
kind concern the mean m(v) and the variance σ(v) of the boundary data u. Then ρv (x) := δ(x − f (v)) minimizes (1)
distributions ρv as functions of V . These are defined as subject to the fixed boundary constraints.
ˆ ∞
m(v) := xρv (x) dx Proof. The boundary data for ρ given here yields the bound-
−∞ ary data gs (v) = u(v) for all v ∈ V0 and s ∈ [0, 1) in
ˆ ∞
σ 2 (v) := (x − m(v))2 ρv (x) dx . the Dirichlet problem (3). The solution of this Dirichlet
−∞ problem is thus also constant in s, let us say gs (v) = f (v)
for all s ∈ [0, 1) and v ∈ V . The only distributions whose
Proposition 3. Suppose the distribution-valued map ρ : inverse CDFs are of this form are δ-distributions; hence
V → Prob(R) is obtained using Wasserstein propagation. ρv (x) = δ(x − f (v)) as desired.
Then for all v ∈ V the following estimates hold.
• inf v0 ∈V0 m(v0 ) ≤ m(v) ≤ supv0 ∈V0 m(v0 ). 3.2. Application to Smoothing
• 0 ≤ σ(v) ≤ supv0 ∈V0 σ(v0 ). Using the connection to the classical Dirichlet problem in
Proposition 2 we can extend our treatment to other dif-
Proof. Both estimates can be derived from the following ferential equations. There is a large space of differential
formula. Let ρ ∈ Prob(R) and let φ : R → R be any equations that have been adapted to graphs via the discrete
integrable function. If we apply the change of variables Laplacian ∆; here we focus on the heat equation, considered
s = F (x) where F is the CDF of ρ in the integral defining e.g. in Chung et al. (2007).
the expectation value of φ with respect to ρ, we get The heat equation for scalar functions is applied to smooth-
ˆ ∞ ˆ 1
ing problems; for example, in Rn solving the heat equation
φ(x)ρ(x) dx = φ(F −1 (s)) ds . is equivalent to Gaussian convolution. Just as the Dirichlet
−∞ 0 equation on F −1 is equivalent to Wasserstein propagation,
´1 ´1 heat diffusion on F −1 is equivalent to gradient flows of
Thus m(v) = 0 Fv−1 (s) ds and σ 2 (v) = 0 (Fv−1 (s) − the energy ED in (1), providing a straightforward way to
m(v))2 ds where Fv is the CDF of ρv for each v ∈ V . understand and implement such a diffusive process.
Assume ρ minimizes (1) with fixed boundary constraints Proposition 5. Let ρ : V → Prob(R) be a distribution-
on V0 . By Proposition 2, we then have ∆Fv−1 = 0 for all valued map and let Fv : [0, 1] → R be the CDF of ρv for
´1
v ∈ V . Therefore ∆m(v) = 0 ∆Fv−1 (s) ds = 0, so m is each v ∈ V . Then these two procedures are equivalent:
a harmonic function on V . The estimates for m follow by
• Mass-preserving flow of ρ in the direction of steepest
the maximum principle for harmonic functions. Also,
descent of the Dirichlet energy.
ˆ 1
∆[σ 2 (v)] = ∆(Fv−1 (s) − m(v))2 ds • Heat flow of the inverse CDFs.
0
ˆ 1 2 Proof. A mass-preserving flow of ρ is a family of
X
= a(v, s) − a(v ′ , s) ds
(v,v ′ )∈E 0 distribution-valued maps ρε : V → Prob(R) with ε ∈
(−ε0 , ε0 ) that satisfies the equations
≥0 — where a(v, s) := Fv−1 (s) − m(v),
∂ρv,ε (t)
∂
since ∆Fv−1 (s) = ∆m(v) = 0. Thus σ 2 is a subharmonic + Yv (ε, t)ρv,ε (t) = 0
∂ε ∂t ∀v ∈ V
function and the upper bound for σ 2 follows by the maxi-
ρv,0 (t) = ρv (t)
mum principle for subharmonic functions.
where Yv : (−ε0 , ε0 ) × R → R is an arbitrary function
Finally, we check that if we encode a classical interpola- that governs the flow. By applying the change of variables
tion problem using Dirac delta distributions, we recover −1
t = Fv,ε (s) using the inverse CDFs of the ρv,ε , we find that
the classical solution. The essence of this result is that this flow is equivalent to the equations
if the boundary data for Wasserstein propagation has zero
variance, then the solution must also have zero variance. −1
∂Fv,ε (s)
−1
= Yv (ε, Fv,ε (s))
Proposition 4. Suppose that there exists u : V0 → R such ∂ε ∀v ∈ V .
that ρv (x) = δ(x−u(v)) for all v ∈ V0 . Then, the solutions −1 −1
Fv,0 (s) = Fv (s)
Wasserstein Propagation
A short calculation starting from (1) now leads to the deriva- ρvi ≥ 0 ∀v ∈ V, i ∈ S xij ≥ 0 ∀i, j ∈ S
tive of the Dirichlet energy under such a flow, namely
where S = {1, . . . , m}.
dED (ρε ) Xˆ 1
−1 −1
= −2 ∆(Fv,ε ) · Yv (ε, Fv,ε (s)) ds .
dε 0 v∈V 5. Algorithm Details
Thus, steepest descent for the Dirichlet energy is achieved We handle the general case from §4 by optimizing the linear
−1
by choosing Yv (ε, Fv,ε (s)) := ∆(Fv,ε (s)) for each v, ε, s. programming formulation directly. Given the size of these
−1
As a result, the equation for the evolution of Fv,ε becomes linear programs, we use large-scale barrier method solvers.
−1
∂Fv,ε (s)
−1 The characterizations in Propositions 2 and 5, however, sug-
= ∆(Fv,ε (s))
∂ε ∀v ∈ V gest a straightforward discretization and accompanying set
−1
Fv,0 (s) = Fv−1 (s)
of optimization algorithms in the linear case. In fact, we
can recover propagated distributions by inverting the graph
−1
which is exactly heat flow of Fv,ε . Laplacian ∆ via a sparse linear solve, leading to near-real-
time results for moderately-sized graphs G.
4. Generalization For a given graph G = (V, E) and subset V0 ⊆ V , we
discretize the domain [0, 1] of Fv−1 for each v using a set
Our preceding discussion involves distribution-valued maps
of evenly-spaced samples s0 = 0, s1 , . . . , sm = 1. This
into Prob(R), but in a more general setting we might wish
representation supports any ρv provided it is possible to
to replace Prob(R) with Prob(Γ) for an alternative domain
sample the inverse CDF from Proposition 1 at each si . In
Γ carrying a distance metric d. Our original formulation
particular, when the underlying distributions are histograms,
of Wasserstein propagation easily handles such an exten-
we model ρv using δ functions at evenly-spaced bin cen-
sion by replacing |x − y|2 with d(x, y)2 in the definition of
ters, which have piecewise constant CDFs; we model con-
W2 . Furthermore, although proofs in this case are consider-
tinuous ρv using piecewise linear interpolation. Regard-
ably more involved, some key properties proved above for
less, in the end we obtain a non-decreasing set of samples
Prob(R) extend naturally.
(F −1 )1v , . . . , (F −1 )m
v with (F
−1 1
)v = 0 and (F −1 )m
v = 1.
In this case, we no longer can rely on the computational
Now that we have sampled Fv−1 for each v ∈ V0 , we can
benefits of Propositions 2 and 5 but can solve the propaga-
propagate to the remainder V \V0 . For each i ∈ {1, . . . , m},
tion problem directly. If Γ is discrete, then Wasserstein dis-
we solve the system from (3):
tances between ρv ’s can be computed using a linear program.
Suppose we represent two histograms P as {a1 , . . P
. , am } and ∆g = 0 ∀ v ∈ V \ V0
{b1 , . . . , bm } with ai , bi ≥ 0 ∀i and i ai = i bi = 1. (5)
Then, the definition of W2 yields the optimization: g(v) = (F −1 )iv ∀ v ∈ V0 .
X
W22 ({ai }, {bj }) = min d2ij xij (4) In the diffusion case, we replace this system with implicit
ij time stepping for the heat equation, iteratively applying
X X (I − t∆)−1 to g for diffusion time step t. In either case, the
s.t. xij = ai ∀i xij = bj ∀j xij ≥ 0 ∀i, j
linear solve is sparse, symmetric, and positive definite; we
j i
apply Cholesky factorization to solve the systems directly.
Here dij is the distance from bin i to bin j, which need not
This process propagates F −1 to the entire graph, yielding
be proportional to |i − j|.
samples (F −1 )iv for all v ∈ V . We invert once again to
From this viewpoint, the energy ED from (1) remains convex yield samples ρiv for all v ∈ V . Of course, each inversion
in ρ and can be optimized using a linear program simply by incurs some potential for sampling and discretization error,
summing terms of the form (4) above: but in practice we are able to oversample sufficiently to
XX (e)
overcome most potential issues. When the inputs ρv are
min d2ij xij discrete histograms, we return to this discrete representation
ρ,x
e∈E ij by integrating the resulting ρv ∈ Prob([0, 1]) over the width
(e)
X
s.t. xij = ρvi ∀e = (v, w) ∈ E, i ∈ S of the bin about the center defined above.
j
This algorithm is efficient even on large graphs and is easily
(e)
X
xij = ρwj ∀e = (v, w) ∈ E, j ∈ S parallelizable. For instance, the initial sampling steps for
i obtaining F −1 from ρ are parallelizable over v ∈ V0 , and
the linear solve (5) can be parallelized over samples i. Direct
X
ρvi = 1 ∀v ∈ V ρvi fixed ∀v ∈ V0
i
solvers can be replaced with iterative solvers for particularly
Wasserstein Propagation
(a)
Boundary Value Problems Figure 3 illustrates our algo- Alternative Target Domain Figure 5 shows an example
rithm on a less trivial graph G. To mimic a typical test case in which the target is Prob(S1 ), where S1 is the unit cir-
for classical Dirichlet problems, our graph is a mesh of the cle, rather than Prob([0, 1]). We optimize the ED using the
Wasserstein Propagation
(b)
(c)
(a)
Figure 6. We propagate histograms of temperatures collected over time to a map of the United States: (a) Average error at propagated sites
as a function of the number of nodes with labeled distributions; (b) means of the histograms at the propagated sites from a typical trial in
(a); (c) standard deviations at the propagated sites. Vertices with prescribed distributions are shown in blue and comprise ∼ 2% of V .
Figure 7. (a) Interpolating histograms of wind directions using the PDF and Wasserstein propagation methods, illustrated using the same
scheme as Figure 5; (b) entropy values from the same distributions.
Summary done using global optimization methods (Shi et. al., 2000;
Hale et. al., 2003). Another solution is to use unsupervised
In this paper we present a novel approach to detect salt learning techniques (Coléou et. al., 2003), often relying on
bodies based on seismic attributes and supervised learning. the application of Self Organizing Maps (Castro de Matos
We report on the use of a machine learning algorithm, et. al., 2007). Our new approach is essentially a novel salt
Extremely Randomized Trees, to automatically identify and body detection workflow. The workflow as a whole
classify salt regions. We have worked with a complex envisions the creation of a software solution that can
synthetic seismic dataset from phase I model of the SEG automatically identify, classify and delineate salt bodies
Advanced Modeling Corporation (SEAM) that corresponds from seismic data using seismic attributes and supervised
to deep water regions of the Gulf of Mexico. This dataset learning algorithms. A comparison between the salt body
has very low frequency and contains sediments bearing detected and its interpretation from 3D synthetic data set
amplitude values similar to those of salt bodies. In the first testifies to the effectiveness of our approach.
step of our methodology, where machine learning is
applied directly to the seismic data, we obtained accuracy Method
values of around 80%. A second (post-processing)
smoothing step improved accuracy to around 95%. We Automated classification of salt bodies using machine
conclude that machine learning is a promising mechanism learning
to identify salt bodies on seismic data, especially with
models that can produce complex decision boundaries, Our approach aims at automatically identifying and
while being able to control the associated variance delineating geological elements from seismic data.
component of error. Specifically, we focus on the automatic classification of
salt bodies using supervised learning techniques. In
Introduction supervised learning we assume each element of study is
represented as an n-component vector-valued random
Seismic-data interpretation has as its main goal the variable (X1, X2,..,Xn), where each Xi represents an
identification of compartments, faults, fault sealing, and attribute or feature; the space of all possible feature vectors
trapping mechanism that hold hydrocarbons; it additionally is called the input space X. We also consider a set {w1,
tries to understand the depositional history of the w2,...,wk} corresponding to the possible classes; this forms
environment to describe the relationship between seismic the output space W. A classifier or learning algorithm
data and a priori geological information. Data mining or typically receives as input a set of training examples from a
knowledge discovery in databases (KDD) has become a source domain, T = {(xi, wi)}, where x = (x1, x2,…,xn) is a
significant area both in academia and industry. Data mining vector in the input space, and w is a value in the (discrete)
is the process of extracting novel, useful and output space. We assume the training or source sample T
understandable patterns from a large collection of data. consists of independently and identically distributed (i.i.d.)
Automated tools for knowledge discovery are frequently examples obtained according to a fixed but unknown joint
invoked in databases to unveil patterns that show how probability distribution, P(x,w), in the input-output space.
objects group into some classification scheme; algorithms The outcome of the classifier is a hypothesis or function
make use of higher order statistics, feature extraction f(x) mapping the input space to the output space, f: X →
methods, pattern recognition, clustering methods, and W. We commonly choose the hypothesis that minimizes
unsupervised and supervised classification. A major the expected value of a loss function (e.g., zero-one loss).
strategy in this field is to apply data mining algorithms
(Hastie, 2011) to classify points or parts of the 3D seismic The challenge behind classification of seismic data
data to reinforce correct data interpretations. Multiple
studies have shown the benefits of using data mining
Our workflow takes as input a cube of seismic data where
techniques for seismic-data interpretation. For example,
each voxel stands as a feature vector (we used three
previous work has shown how to generate a set of seismic
informative features as described below). From the whole
traces from velocity models containing faults with varying
cube we take a small fraction of representative voxels to
locality, using machine learning to identify the presence of
conform a training set T = {(xi, wi)}, where x = (x1, x2, x3);
a fault in previously unseen traces (Zhang et. al., 2014).
we assume only two classes: w1 and w2, corresponding to
Other techniques segment a seismic image into structural
voxels inside and outside the salt body, respectively. This
and stratigraphic geologic units (Hale, 2002), which is best
Supervised learning to detect salt body
workflow is challenging because 1) the sheer size of the 3D from marine acquisition and represents strong challenges to
data cube precludes training predictive models with more the geophysical community. Inspiration was deep water
than just 1% of the available training data; this implies (600 – 2000 meters) US GOM Salt Structure and its major
several regions of the cube may not be fairly represented in structural features are salt body with rugose top and
the training set; 2) many learning algorithms are unable to overhangs, twelves radial faults near the root salt,
cope with millions of training examples; it took days to overturned sediment raft proximate to salt root and internal
complete the entire data processing; 3) classification is sutures and a heterogeneous salt cap. The migrated seismic
difficult because many voxels inside and outside salt bodies volume was obtained with very low frequency, and there
have very similar appearance. In machine learning are sediments locations with similar amplitude value than
terminology this is a problem known as high Bayes error. salt body. A migrated seismic volume with these kinds of
The success of this workflow is clearly contingent on features is very complex for detection of salt body.
finding useful and informative features to appropriately Mathematical and machine learning algorithms were taken
discriminate among classes. from Python’s Numpy and Scikit-learn libraries,
respectively. Our final predictive model of choice was
Informative attributes to generate predictive models in Extremely Randomized Trees, which was used to predict
seismic data the labels of 376,752,501 samples; this resulted in a
Boolean mask. The accuracy reported was essentially the
A proper characterization of voxels can be attained with same as in cross validation (80%). After that, we have
useful and informative features. We selected three features removed outliers and misclassification using mathematical
for our study exhibiting high correlation with the target morphological operations and a 3D interactive guided
class: signal amplitude (directly from seismic data), second (manual intervention) tool developed in house; finally, we
derivative, and curve length; the last two derived from used threshold segmentation using local average threshold
amplitude. Second derivative is instrumental to detect to get better detection results.
edges in images, and curve length capture patterns which
characterize different features observed inside a salt Results
structure and in its surroundings.
We describe our results by visually comparing our
Supervised learning algorithms
predictions on a cube of seismic data. Figure 1(a) shows a
cross section of the seismic data, figure 1(b) shows the
Our data analysis phase receives as input a body of seismic
classification obtained with our proposed methodology, and
data with the task of automatically identifying salt regions.
figure 1(c) shows the classification after the post-
We randomly sample a small fraction (0.5%) of the total
processing step.
data; the sample is then assigned class labels by an expert
(aided by a software tool that simplifies the labeling
process). To achieve a class-balanced problem, we made
sure exactly one half of the subset corresponded to salt, and
the other half as non-salt (the task exhibited equal class
priors). The model was built using 2 million training
voxels. Accuracy is estimated using 10-fold cross
validation (Hastie, 2011). The classification model was
subsequently used to automatically label the entire body of
seismic data (376,752,501 voxels). Our top performing
learning algorithms were the following: Gradient Boosting
Trees (Accuracy 80%), Extremely Randomized Trees
(Accuracy 80%), and Random Forests (Accuracy 79%). All
our learning algorithms are ensemble methods; these
techniques have shown remarkable performance due to
their ability to attain low bias (using complex decision
boundaries), and low variance (achieved by averaging over
various models).
Figure 1: (a) Seismic data, (b) classification using our method, (c)
Example results obtained with a post-processing step.
We have tested our proposed technique using SEAM I Figure 2 shows the overlapping between seismic data and
(SEG Advance Modeling Corporation) data. This comes salt body (white color) detected on different inline
Supervised learning to detect salt body
locations. We can observe that seismic attributes used in around 95%. We conclude that machine learning is a
combination with the machine learning algorithm allows promising mechanism to identify geological bodies on
capturing and classifying different patterns and features seismic data when the selected model has high capacity,
between sediments and salt body. and is able to control the variance component of error by
model averaging (using ensemble techniques).
Acknowledgments
The authors thank Repsol for its support and for the
authorization to present this work. We would like to
Figure 2: Overlapping between seismic data and salt body
acknowledge also SEG Advanced Modeling Corporation
detected.
(SEAM) for their initiative on creating a realistic salt model
and seismic data used for this study.
To measure accuracy, we count the number of hits between
the detected salt body and the interpretation in the
References
following way: using both volumes, we have counted the
number of hits voxel by voxel. We refer to this number as
NH. The effectiveness ratio is calculated as: (NH/TS) * Castro de Matos, M., Manassi, P. L., Osorio, Schroeder, P.
100, where TS is the total number of voxels in the volume. R. 2007. Unsupervised Seismic Facies Analysis using
Following this technique, we have obtained an accuracy of Wavelet Transform and Self Organizing Maps.
95.22%. GEOPHYSICS, 72 (1) pp. 9-21.
Figure 3 shows a comparison between our salt body Coléou, T., Poupon, M., Azbel, K. 2003. Unsupervised
Seismic Facies Classification: A Review and Comparison
detected (white color) and its interpretation (red color). We of Techniques and Implementation: The Leading Edge, 22,
can see the promising quality of our detection for the pp. 942–953.
synthetic seismic dataset used in this work.
Hale, D., Emanuel, J. 2003. Seismic Interpretation using
Conclusions Global Image Segmentation. 73th Annual International
Meeting, Society of Exploration Geophysicists.
We have shown an efficient approach to classify salt bodies
Hale, D., 2002. Atomic Meshes from Seismic Imaging to
from a very complex synthetic seismic dataset using
Reservoir Simulation. Proceedings of the 8th European
machine learning techniques. Results show very high
Conference on the Mathematics of Oil Recovery, Freiberg,
accuracy when machine learning algorithms are used to
Germany.
predict class labels of voxels on a seismic cube; this is true
even after training with a very small portion of the data
Hastie, T., Tibshirani, R., Friedman, J. 2011. The Elements
(0.5%). After a first step, where machine learning is applied
of Statistical Learning: Data Mining, Inference, and
directly to the data, we obtained accuracy values of around
Prediction. 2nd Edition, Springer.
80%. A second (post-processing) step increased accuracy to
Supervised learning to detect salt body